diaforge-utc-r-0725
DiaFORGE UTC: Unified Tool-Calling Conversations Dataset
ActionsDialogTextsCC BY-NC-SAIntroduced 2025-07-04
Dataset for our paper Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky which includes 5000 enterprise tools and the corresponding dialogues generated using DiaFORGE UTC data engine.
The dataset is generated with the data generation engine. The engine simulates a user agent and an assistant agent in a dialogue, where the user agent has a persona and the assistant agent has access to a set of tools. Detailed information about the data generation process can be found in the paper.
Each entry in the dataset contains:
- seed: Gold tool for the generated dialogue.
- user_persona: Persona of the simulated user agent.
- messages: List of messages in the dialogue, where each message is a dictionary containing:
- distractor_tools: List of distractor tools that are not the gold tool but are relevant to the dialogue. These tools are used by user agent to generate utterances that are hard to disambiguate from the gold tool.
- retrieved_tools: List of tools retrieved by the assistant agent and used by the assistant agent in order to ask clarifying questions to the user agent.