Amazon’s new benchmark to evaluate AI coding agents’ ability to navigate and understand complex codebases and GitHub issues • DigiBanker

Amazon has introduced SWE-PolyBench, the first industry benchmark to evaluate AI coding agents’ ability to navigate and understand complex codebases. The benchmark, which measures system performance in GitHub issues, has spurred the development of capable coding agents and has become the de-facto standard for coding agent benchmarking. SWE-PolyBench contains over 2,000 curated issues in four languages and a stratified subset of 500 issues for rapid experimentation. The benchmark aims to advance AI performance in real-world scenarios. Key features of SWE-PolyBench at a glance: Multi-Language Support: Java (165 tasks), JavaScript (1017 tasks), TypeScript (729 tasks), and Python (199 tasks). Extensive Dataset: 2110 instances from 21 repositories ranging from web frameworks to code editors and ML tools, on the same scale as SWE-Bench full with more repository. Task Variety: Includes bug fixes, feature requests, and code refactoring. Faster Experimentation: SWE-PolyBench500 is a stratified subset for efficient experimentation. Leaderboard: A leaderboard with a rich set of metrics for transparent benchmarking.

Read Article