LIMA

Introduced 2023-05-18

The LIMA dataset is a valuable resource used in natural language processing (NLP) research. Let me provide you with some details:

  1. Origin and Purpose:

    • The LIMA dataset is derived from the LLaMa language model, which has an impressive 65 billion parameters.
    • It serves as a fine-tuned version of the LLaMa model, specifically adjusted using approximately 1,000 prompts and responses.
  2. Performance and Applications:

    • LIMA demonstrates remarkable performance by learning to follow specific response formats from just a handful of examples in the training data.
    • The dataset covers a wide range of tasks, including complex queries such as planning trip itineraries and speculating about alternate history.
    • Interestingly, the model tends to generalize well to unseen tasks that were not part of the training data.
  3. License:

    • The licensing of the LIMA dataset depends on the source data it was derived from:
      • If the source data has a stricter license than CC BY-NC-SA, the LIMA dataset follows the same restrictions.
      • Otherwise, it adheres to the CC BY-NC-SA license.

(1) GAIR/lima · Datasets at Hugging Face. https://huggingface.co/datasets/GAIR/lima. (2) GAIR/lima at main - Hugging Face. https://huggingface.co/datasets/GAIR/lima/tree/main. (3) 日本語LIMAデータセットlima-jaを作成したので公開します. https://zanote.net/ai/lima-ja/. (4) Paper page - LIMA: Less Is More for Alignment - Hugging Face. https://huggingface.co/papers/2305.11206. (5) undefined. https://huggingface.co/datasets/GAIR/lima/.