DAPFAM
A Domain‑Aware Patent Retrieval Dataset Aggregated at the Family Level
Textscc-by-nc-sa-4.0Introduced 2025-06-27
Dataset DAPFAM
See the accompanying paper: Ayaou et al., 2025 — “DAPFAM: A Domain‑Aware Patent Retrieval Dataset Aggregated at the Family Level” (arXiv:2506.22141).
Summary
DAPFAM provides 1 247 balanced query patent families and 45 336 target families with forward/backward‑citation relevance labels (≈ 50 K pairs). Each relevant link is explicitly marked in‑domain or out‑of‑domain according to IPC 3‑char overlap, enabling rigorous cross‑domain evaluation.
- Full text (title · abstract · claims · description) plus rich metadata for every family.
- Multi‑jurisdictional, English‑only text (families may originate in US, JP, EP, CN, …).
- Parquet qrel file:
qrels_all.parquet.
Dataset Structure
corpus.parquet # 45 336 rows, targets – every original column from the paper
queries.parquet # 1 247 rows, queries – same columns + abstract_keywords
qrels_all.parquet # (all | in | out) four‑column tables → query_id · relevant_id · relevance_score · domain_rel
Citation
@misc{ayaou2025dapfam,
title={DAPFAM: A Domain-Aware Patent Retrieval Dataset Aggregated at the Family Level},
author={Iliass Ayaou and Denis Cavallucci and Hicham Chibane},
year={2025},
eprint={2506.22141},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Quick Stats
- Queries: 1,247
- Corpus (targets): 45,336
- Qrels (all): 49,869
- Qrels (in): 19,736
- Qrels (out): 5,193