TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Case Study of Web App Coding with OpenAI Reasoning Models

A Case Study of Web App Coding with OpenAI Reasoning Models

Yi Cui

2024-09-19Code Generation
PaperPDFCode(official)

Abstract

This paper presents a case study of coding tasks by the latest reasoning models of OpenAI, i.e. o1-preview and o1-mini, in comparison with other frontier models. The o1 models deliver SOTA results for WebApp1K, a single-task benchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doubling number of tasks and test cases. The new benchmark causes the o1 model performances to decline significantly, falling behind Claude 3.5. Moreover, they consistently fail when confronted with atypical yet correct test cases, a trap non-reasoning models occasionally avoid. We hypothesize that the performance variability is due to instruction comprehension. Specifically, the reasoning mechanism boosts performance when all expectations are captured, meanwhile exacerbates errors when key expectations are missed, potentially impacted by input lengths. As such, we argue that the coding success of reasoning models hinges on the top-notch base model and SFT to ensure meticulous adherence to instructions.

Results

TaskDatasetMetricValueModel
Code GenerationWebApp1k-Duo-Reactpass@10.679claude-3-5-sonnet
Code GenerationWebApp1k-Duo-Reactpass@10.667o1-mini
Code GenerationWebApp1k-Duo-Reactpass@10.652o1-preview
Code GenerationWebApp1k-Duo-Reactpass@10.531gpt-4o-2024-08-06
Code GenerationWebApp1k-Duo-Reactpass@10.49deepseek-v2.5
Code GenerationWebApp1k-Duo-Reactpass@10.449mistral-large-2
Code GenerationWebApp1K-Reactpass@10.952o1-preview
Code GenerationWebApp1K-Reactpass@10.939o1-mini
Code GenerationWebApp1K-Reactpass@10.834deepseek-v2.5

Related Papers

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning2025-07-18Towards Formal Verification of LLM-Generated Code from Natural Language Prompts2025-07-17MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks2025-07-16Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15Kodezi Chronos: A Debugging-First Language Model for Repository-Scale, Memory-Driven Code Understanding2025-07-14CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks2025-07-14CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance2025-07-14