SurgeGlobal/LaMini

TextsApache 2.0Introduced 2024-04-18

Overview

The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.

Dataset Generation

  • Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2.
  • Seed Instructions: Sourced from databricks/databricks-dolly-15k dataset.
  • Generation Approach: Example-guided and topic-guided strategies.
  • Total Instructions: 1,504 unique instruction examples.

Dataset Sources

Structure

Each entry in the dataset contains:

  • Instruction
  • Response

Usage

The LaMini Dataset can be used to fine-tune language models to improve their ability to follow instructions and generate relevant responses.

Access

The dataset is available on HuggingFace at the following link: https://huggingface.co/datasets/SurgeGlobal/LaMini

Citation

If you find our work useful, please cite our paper as follows:

@misc{surge2024openbezoar,
      title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, 
      author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake},
      year={2024},
      eprint={2404.12195},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Dataset Authors

Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake