DAPFAM

A Domain‑Aware Patent Retrieval Dataset Aggregated at the Family Level

Textscc-by-nc-sa-4.0Introduced 2025-06-27

Dataset DAPFAM

See the accompanying paper: Ayaou et al., 2025 — “DAPFAM: A Domain‑Aware Patent Retrieval Dataset Aggregated at the Family Level” (arXiv:2506.22141).

Summary

DAPFAM provides 1 247 balanced query patent families and 45 336 target families with forward/backward‑citation relevance labels (≈ 50 K pairs). Each relevant link is explicitly marked in‑domain or out‑of‑domain according to IPC 3‑char overlap, enabling rigorous cross‑domain evaluation.

  • Full text (title · abstract · claims · description) plus rich metadata for every family.
  • Multi‑jurisdictional, English‑only text (families may originate in US, JP, EP, CN, …).
  • Parquet qrel file: qrels_all.parquet.

Dataset Structure

corpus.parquet   # 45 336 rows, targets – every original column from the paper
queries.parquet  # 1 247 rows,   queries – same columns + abstract_keywords
qrels_all.parquet  # (all | in | out) four‑column tables → query_id · relevant_id · relevance_score · domain_rel

Citation

@misc{ayaou2025dapfam,
    title={DAPFAM: A Domain-Aware Patent Retrieval Dataset Aggregated at the Family Level},
    author={Iliass Ayaou and Denis Cavallucci and Hicham Chibane},
    year={2025},
    eprint={2506.22141},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Quick Stats

  • Queries: 1,247
  • Corpus (targets): 45,336
  • Qrels (all): 49,869
  • Qrels (in): 19,736
  • Qrels (out): 5,193