vpfrc_llm_vulnerability_classifier

VPFRC LLM Vulnerability Classifier Data

TextsMITIntroduced 2024-12-16

LLM-Based Vulnerability Classification in Police Narratives

This repository contains datasets used in our research on applying large language models (LLMs) to identify indicators of vulnerability in police incident narratives. These resources support the replication of findings in our paper: "Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives."

Project Overview

Law enforcement frequently encounters vulnerable individuals, but identifying vulnerability factors in police records remains challenging. Our research explores how LLMs can assist in identifying four key vulnerability indicators in police Field Interrogation and Observation (FIO) narratives:

  • Mental health issues
  • Drug abuse
  • Alcoholism
  • Homelessness

This project advances police research methodology by:

  1. Evaluating LLM performance in vulnerability classification against human labelers
  2. Comparing different LLM architectures and prompt engineering approaches
  3. Investigating potential demographic biases through counterfactual analysis
  4. Developing a reusable framework for qualitative text analysis

Datasets

This repository includes four key datasets:

  • boston_narratives_test_classified_4000.csv: 4,000 narratives classified with our LLM pipeline, including all labels and model explanations
  • counterfactual_narratives_all_coded.csv: Systematically generated counterfactual narratives with varied demographic characteristics
  • examples_for_counterfactuals.csv: 100 base narratives used for counterfactual generation
  • labelled_fio_data_for_analysis.csv: 500 pre-processed examples with human and GPT-4o labels

Code Repository

The complete codebase for replicating our research is available in our GitHub repository: llm-deductive-coding (particularly in the boston_fio_paper directory).

The repository includes:

  • Data preprocessing scripts
  • Classification pipeline implementation
  • Counterfactual generation code
  • Analysis notebooks
  • Visualization tools

Citation

If you use these resources in your research, please cite our paper:

@article{author2023llm,
  title={Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives},
  author={Relins, S. and Birks, D and Lloyd, C},
  journal={Arxiv Preprint},
  year={2023},
  note={Currently under review for the Journal of Quantitative Criminology}
}

License

These datasets are released under the MIT License. The original Boston FIO data is released under the Open Data Commons Public Domain Dedication and License (PDDL).

Contact

For questions about this research or datasets, please contact the authors or open an issue in our GitHub repository.