LCStep

TextsMITIntroduced 2024-09-02

For our experiments, we collected a dataset of procedural knowledge of the LangChain Python library, unseen by many extant LLMs. We selected LangChain as the domain for our dataset because it was published in 2022, which is later than the knowledge cutoff date for many web-scale LLMs, including GPT-3.5, while also having plenty of documentation due to its popularity.

The LCStep dataset was collected from 180 tutorial pages in the Python section of the LangChain website. We used an LLM-enabled pipeline with human oversight and quality review to extract 276 procedures from these tutorials, representing each procedure in a format structure.