Papers With Code 2 | ML Benchmarks, SotA Results & Code

Dataset Generation

Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
Seed Instructions: Derived from the FLAN-v2 Collection.
Generation Approach: Explanation tuning with detailed responses generated from h2ogpt-gm-oasst1-en-2048-falcon-40b-v2.
Total Instructions: 5,507 explanation tuning data samples.

Dataset Sources

Repository: Bitbucket Project
Paper : Pre-Print

Structure

The dataset entries consist of:

Query
Response
System Message (when applicable)

Usage

The Orca Dataset is intended for fine-tuning language models to not only imitate the style but also the reasoning process of LFMs, thereby improving the safety and quality of the models’ responses.

Citation

If you find our work useful, please cite our paper as follows:

@misc{surge2024openbezoar,
      title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, 
      author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake},
      year={2024},
      eprint={2404.12195},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Dataset Authors

Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake