Papers With Code 2 | ML Benchmarks, SotA Results & Code

Overview

The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.

Dataset Generation

Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2.
Seed Instructions: Sourced from databricks/databricks-dolly-15k dataset.
Generation Approach: Example-guided and topic-guided strategies.
Total Instructions: 1,504 unique instruction examples.

Dataset Sources

Repository: Bitbucket Project
Paper : Pre-Print

Structure

Each entry in the dataset contains:

Instruction
Response

Usage

The LaMini Dataset can be used to fine-tune language models to improve their ability to follow instructions and generate relevant responses.

Access

The dataset is available on HuggingFace at the following link: https://huggingface.co/datasets/SurgeGlobal/LaMini

Citation

If you find our work useful, please cite our paper as follows:

@misc{surge2024openbezoar,
      title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, 
      author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake},
      year={2024},
      eprint={2404.12195},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Dataset Authors

Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake