Papers With Code 2 | ML Benchmarks, SotA Results & Code

We developed Web-Bench as a benchmark for evaluating the performance of LLMs on real-world web projects.

50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows.
When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5-10 years of experience, each presents a significant challenge. On average, a single project takes 4–8 hours for a senior engineer to complete.
On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1% Pass@1, significantly lower than SWE-Bench's Verified (65.4%) and Full (33.8%) scores (2025.4).