InfoCapability

BigCodeBench: New code-generation benchmark with 1,140 tasks across 139 libraries

AI Impact Summary

BigCodeBench introduces a broad, contamination-resistant benchmark for code-generation LLMs, emphasizing practical, function-level tasks drawn from 139 libraries. It uses 1,140 tasks with multiple test cases, calibrated Pass@1, and an Elo-based ranking to better reflect real-world performance and tool-use capabilities. For engineering teams, integrating the PyPI-distributed evaluation framework (EvalPlus-based prototype) will tighten benchmarking signals and steer model development toward robust multi-library code composition and instruction-following under realistic constraints.

Affected Systems

BigCodeBenchHugging Face Spaces

Date: Date not specified
Change type: capability
Severity: info

BigCodeBench: New code-generation benchmark with 1,140 tasks across 139 libraries

More from Hugging Face

Get alerts for Hugging Face