Certifying LLM Safety against Adversarial Prompting

Abstract:Large language models (LLMs) are vulnerable to that malicious tokens to an input prompt to the safety guardrails of an LLM cause it to produce harmful . In this work, we introduce erase-and-, the for against adversarial prompts with certifiable safety guarantees. Given a prompt, our procedure erases tokens individually and inspects

Read more

Related Posts