Papers With Code 2 | ML Benchmarks, SotA Results & Code

Dataset Generation

Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
Seed Instructions: Selected from the databricks/databricks-dolly-15k dataset
Generation Approach: Iterative evolution of instructions using a conversational syntax for in-depth and in-breadth evolving
Total Instructions: 2,304 instruction tuning data samples

Dataset Sources

Repository: Bitbucket Project
Paper: Pre-Print

Structure

The dataset entries consist of:

Instruction
Response
Evolution Strategy (in-depth or in-breadth)
Category (of the original instruction)

Usage

The Evol-Instruct Dataset is designed for the automatic evolution of instruction datasets, enhancing the complexity and diversity of instructions to train language models for a wide range of tasks.

Citation

If you find our work useful, please cite our paper as follows:

@misc{surge2024openbezoar,
      title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, 
      author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake},
      year={2024},
      eprint={2404.12195},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Dataset Authors

Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake