TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Making Large Language Models Better Data Creators

Making Large Language Models Better Data Creators

Dong-Ho Lee, Jay Pujara, Mohit Sewak, Ryen W. White, Sujay Kumar Jauhar

2023-10-31Instruction FollowingPrompt EngineeringVisual Question Answering
PaperPDFCode(official)

Abstract

Although large language models (LLMs) have advanced the state-of-the-art in NLP significantly, deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. As such, trainable models are still the preferred option in some cases. However, these models still require human-labeled data for optimal performance, which is expensive and time-consuming to obtain. In order to address this issue, several techniques to reduce human effort involve labeling or generating data using LLMs. Although these methods are effective for certain applications, in practice they encounter difficulties in real-world scenarios. Labeling data requires careful data selection, while generating data necessitates task-specific prompt engineering. In this paper, we propose a unified data creation pipeline that requires only a single formatting example, and which is applicable to a broad range of tasks, including traditionally problematic ones with semantically devoid label spaces. In our experiments we demonstrate that instruction-following LLMs are highly cost-effective data creators, and that models trained with these data exhibit performance better than those trained with human-labeled data (by up to 17.5%) on out-of-distribution evaluation, while maintaining comparable performance on in-distribution tasks. These results have important implications for the robustness of NLP systems deployed in the real-world.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)ViP-BenchGPT-4 score (bbox)48.3ViP-LLaVA-13B (Visual Prompt)
Visual Question Answering (VQA)ViP-BenchGPT-4 score (human)48.2ViP-LLaVA-13B (Visual Prompt)
Visual Question AnsweringViP-BenchGPT-4 score (bbox)48.3ViP-LLaVA-13B (Visual Prompt)
Visual Question AnsweringViP-BenchGPT-4 score (human)48.2ViP-LLaVA-13B (Visual Prompt)

Related Papers

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning2025-07-17Leveraging Language Prior for Infrared Small Target Detection2025-07-17Emotional Support with LLM-based Empathetic Dialogue Generation2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16How Many Instructions Can LLMs Follow at Once?2025-07-15DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering2025-07-15Prompt Engineering in Segment Anything Model: Methodologies, Applications, and Emerging Challenges2025-07-13Multilingual Multimodal Software Developer for Code Generation2025-07-11