InfoCapability

Haize Labs Red-Teaming Resistance Leaderboard — LLM robustness benchmark

AI Impact Summary

Haize Labs launches the Red-Teaming Resistance Leaderboard to quantify how frontier LLMs withstand adversarial, human-crafted jailbreak prompts. The benchmark aggregates datasets such as AdvBench, AART, Beavertails, Do Not Answer (DNA), RedEval-HarmfulQA, and RedEval-DangerousQA, and uses LlamaGuard and GPT-4 to classify outputs as Safe or Unsafe, with results expressed as Safe-rate across categories like Adult Content and Physical Harm. The emphasis on high-quality, human-authored prompts and the involvement of major models (GPT-4, Claude-2) highlights real-world safety gaps that may exist even in enterprise API deployments where additional safety classifiers may shield responses. This matters for technical teams because it provides a comparative view of model robustness and safety posture across vendors and platforms, informing risk assessment and governance discussions.

Affected Systems

GPT-4

Date: Date not specified
Change type: capability
Severity: info

Haize Labs Red-Teaming Resistance Leaderboard — LLM robustness benchmark

More from Hugging Face

Get alerts for Hugging Face