Web-Bench

CC BYIntroduced 2025-05-12

We developed Web-Bench as a benchmark for evaluating the performance of LLMs on real-world web projects.

  1. 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows.

  2. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5-10 years of experience, each presents a significant challenge. On average, a single project takes 4–8 hours for a senior engineer to complete.

  3. On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1% Pass@1, significantly lower than SWE-Bench's Verified (65.4%) and Full (33.8%) scores (2025.4).