LiveCodeBench Leaderboard — Holistic, contamination-free evaluation of code LLMs
AI Impact Summary
LiveCodeBench introduces a time-aware, contamination-resistant benchmark for code LLMs, evaluating Code Generation, Self Repair, Code Execution, and Test Output Prediction. Problems are sourced from LeetCode, AtCoder, and CodeForces with release-date annotations, enabling evaluation over time to detect training-data leakage. Observed leaders (GPT-4-Turbo across most scenarios, Claude-3-Opus in test-output prediction, Mistral-Large in natural-language reasoning tasks) give teams a concrete basis to compare models and prioritize improvements for coding workloads.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info