Papers With Code 2 | ML Benchmarks, SotA Results & Code

StableToolBench is a new benchmark for tool learning that aims to provide a well-balanced combination of stability and reality, building upon its predecessor, ToolBench. It was developed to address the instability issues of previous tool learning benchmarks, which either relied on hand-crafted online tools with limited scale or large-scale real online APIs that suffered from instability due to API status changes¹².

Here are some key features of StableToolBench:

Virtual API System: This includes a caching system to ensure consistent API call responses and API simulators, powered by Large Language Models (LLMs), for unavailable APIs. It maintains the diverse API environment from ToolBench¹.
New Set of Solvable Queries: It uses state-of-the-art LLMs to determine task solvability beforehand, reducing randomness and instability in query solvability¹.
Stable Evaluation System: Implements a two-phase evaluation process using GPT-4 as an automatic evaluator, with metrics like Solvable Pass Rate (SoPR) and Solvable Win Rate (SoWR) to assess the capability of LLMs to utilize tools².

(1) THUNLP-MT/StableToolBench - GitHub. https://github.com/THUNLP-MT/StableToolBench. (2) StableToolBench: Towards Stable Large-Scale Benchmarking on Tool .... https://arxiv.org/abs/2403.07714. (3) StableToolBench: Towards Stable Large-Scale Benchmarking on Tool .... https://paperreading.club/page?id=214642. (4) Papers with Code - StableToolBench: Towards Stable Large-Scale .... https://paperswithcode.com/paper/stabletoolbench-towards-stable-large-scale. (5) undefined. https://doi.org/10.48550/arXiv.2403.07714.