BigCodeBench: New code-generation benchmark with 1,140 tasks across 139 libraries
AI Impact Summary
BigCodeBench introduces a broad, contamination-resistant benchmark for code-generation LLMs, emphasizing practical, function-level tasks drawn from 139 libraries. It uses 1,140 tasks with multiple test cases, calibrated Pass@1, and an Elo-based ranking to better reflect real-world performance and tool-use capabilities. For engineering teams, integrating the PyPI-distributed evaluation framework (EvalPlus-based prototype) will tighten benchmarking signals and steer model development toward robust multi-library code composition and instruction-following under realistic constraints.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info