Haize Labs Red-Teaming Resistance Leaderboard — LLM robustness benchmark
AI Impact Summary
Haize Labs launches the Red-Teaming Resistance Leaderboard to quantify how frontier LLMs withstand adversarial, human-crafted jailbreak prompts. The benchmark aggregates datasets such as AdvBench, AART, Beavertails, Do Not Answer (DNA), RedEval-HarmfulQA, and RedEval-DangerousQA, and uses LlamaGuard and GPT-4 to classify outputs as Safe or Unsafe, with results expressed as Safe-rate across categories like Adult Content and Physical Harm. The emphasis on high-quality, human-authored prompts and the involvement of major models (GPT-4, Claude-2) highlights real-world safety gaps that may exist even in enterprise API deployments where additional safety classifiers may shield responses. This matters for technical teams because it provides a comparative view of model robustness and safety posture across vendors and platforms, informing risk assessment and governance discussions.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info