PAL: Proxy-Guided Black-Box Attack on Large Language Models

Abstract:Large () have surged in popularity in recent months, but they have demonstrated concerning capabilities to generate harmful when manipulated. While like safety fine-tuning to minimize harmful use, recent works have shown that LLMs remain vulnerable to that toxic responses. In this work, we introduce the -Guided on LLMs (PAL), the

