Paper Information
Citation: Piroi, F., Lupu, M., Hanbury, A., Magdy, W., Sexton, A. P., & Filippov, I. (2012). CLEF-IP 2012: Retrieval Experiments in the Intellectual Property Domain. CLEF 2012 Working Notes, CEUR Workshop Proceedings, Vol. 1178.
Publication: CLEF 2012 Working Notes (CEUR-WS Vol. 1178)
Patent Retrieval and the CLEF-IP 2012 Benchmark
This is a Resource paper (benchmark infrastructure). It establishes a standardized test bed for the Intellectual Property (IP) Information Retrieval community by defining tasks, curating datasets (topics and relevance judgments), and establishing evaluation protocols. The paper does not propose a new method itself but aggregates and analyzes the performance of participant systems on these shared tasks.
Motivation for Standardized IP Information Retrieval
The volume of patent applications is increasing rapidly, necessitating automated methods to help patent experts find prior art and classify documents.
- Economic Impact: Thorough searches are critical due to the high economic value of granted patents.
- Complexity: Patent work-flows are specific; examiners need to find prior art for specific claims alongside whole documents, and often rely on non-textual data like flowcharts and chemical diagrams.
- Gap: Existing general IR tools are insufficient for the specific granularity (passages, images, structures) required in the IP domain.
Novel Multi-modal Tasks: Claims, Flowcharts, and Chemicals
The 2012 edition of the lab introduced three specific tasks targeting different modalities of patent data:
- Passage Retrieval starting from Claims: Moving beyond document-level retrieval to identifying specific relevant passages based on claim text.
- Flowchart Recognition: A new image analysis task requiring the extraction of structural information (nodes, edges, text) from patent images.
- Chemical Structure Recognition: A dual task of segmenting molecular diagrams from full pages and recognizing them into structural files (MOL), specifically addressing the challenge of Markush structures in patents.
Benchmarking Setup and Evaluation
The “experiments” were the benchmarking tasks themselves, performed by participants (e.g., University of Birmingham, SAIC, TU Vienna).
- Passage Retrieval: Participants retrieved documents and passages for 105 test topics (sets of claims) from a corpus of 1.5 million patents. Performance was measured using PRES, Recall, and MAP at the document level, and AP/Precision at the passage level.
- Flowchart Recognition: Participants extracted graph structures from 100 test images. Evaluation compared the submitted graphs to ground truth using a distance metric based on the Maximum Common Subgraph (MCS).
- Chemical Structure:
- Segmentation: Identifying bounding boxes of chemical structures in 30 multipage TIFF patents.
- Recognition: Converting 865 “automatic” (standard MOL) and 95 “manual” (Markush/complex) diagrams into structure files.
Key Findings and Baseline Results
- Passage Retrieval: Approaches varied from two-step retrieval (document then passage) to full NLP techniques. Translation tools were universally used due to the multilingual corpus (English, German, French).
- Chemical Recognition: The best performing system (UoB, run uob-4) achieved 92% recall on total structures (886/960), with 96% on the automatic set and 57% on the manual set. SAIC achieved 83% total recall. The manual evaluation highlighted a critical need for standards extending MOL files to support Markush structures, which are common in patents but poorly supported by current tools.
- Flowchart Recognition: The evaluation was not completed at the time of writing the workshop notes. The evaluation required a combination of structural matching and edit-distance for text labels because OCR outputs rarely “hard-matched” the gold standard.
Chemical Structure Recognition Results
Segmentation (SAIC, best run using OSRA native rendering):
| Tolerance (px) | Precision | Recall | $F_1$ |
|---|---|---|---|
| 0 | 0.708 | 0.686 | 0.697 |
| 10 | 0.793 | 0.769 | 0.781 |
| 20 | 0.821 | 0.795 | 0.808 |
| 40 | 0.867 | 0.840 | 0.853 |
| 55 | 0.887 | 0.860 | 0.873 |
Recognition (automatic and manual sets):
| System | Auto (#/865) | Auto % | Manual (#/95) | Manual % | Total (#/960) | Total % |
|---|---|---|---|---|---|---|
| SAIC | 761 | 88% | 38 | 40% | 799 | 83% |
| UoB-1 | 832 | 96% | 44 | 46% | 876 | 91% |
| UoB-2 | 821 | 95% | 56 | 59% | 877 | 91% |
| UoB-3 | 821 | 95% | 44 | 46% | 865 | 90% |
| UoB-4 | 832 | 96% | 54 | 57% | 886 | 92% |
Reproducibility Details
Data
The collection focuses on European Patent Office (EPO) and WIPO documents published up to 2002.
1. Passage Retrieval Data
- Corpus: >1.5 million XML patent documents (EP and WO sources).
- Training Set: 51 topics (sets of claims) with relevance judgments (18 DE, 21 EN, 12 FR).
- Test Set: 105 topics (35 per language).
- Topic Source: Extracted manually from search reports listing “X” or “Y” citations (highly relevant prior art).
2. Flowchart Data
- Format: Black and white TIFF images.
- Training Set: 50 images with textual graph representations.
- Test Set: 100 images.
- Ground Truth: A defined textual format describing nodes (
NO), directed edges (DE), undirected edges (UE), and meta-data (MT).
3. Chemical Structure Data
- Segmentation: 30 patent files rendered as 300dpi monochrome multipage TIFFs.
- Recognition (Automatic Set): 865 diagram images fully representable in standard MOL format.
- Recognition (Manual Set): 95 diagram images containing Markush structures or variability not supported by standard MOL.
Algorithms
Ground Truth Generation:
- Qrels Generator: An in-house tool was used to manually map search report citations to specific XML passages (XPaths) for the passage retrieval task.
- McGregor Algorithm: Used for the flowchart evaluation to compute the Maximum Common Subgraph (MCS) between participant submissions and ground truth.
Evaluation
Passage Retrieval Metrics:
- Document Level: PRES (Patent Retrieval Evaluation Score), Recall, MAP. Cut-off at 100 documents.
- Passage Level: $AP(D)$ (Average Precision at document level) and $Precision(D)$ (Precision at document level), averaged across all relevant documents for a topic.
Flowchart Recognition Metric:
- Graph Distance ($d$): Defined quantitatively based on the Maximum Common Subgraph (MCS) between a target flowchart ($F_t$) and a submitted flowchart ($F_s$): $$ \begin{aligned} d(F_t, F_s) &= 1 - \frac{|mcs(F_t, F_s)|}{|F_t| + |F_s| - |mcs(F_t, F_s)|} \end{aligned} $$ where $|F|$ represents the size of the graph (nodes + edges).
- Levels: Evaluated at three levels: Basic (structure only), Intermediate (structure + node types), and Complete (structure + types + text labels).
Chemical Structure Metrics:
- Segmentation: Precision, Recall, and $F_1$ based on bounding box matches. A match is valid if borders are within a tolerance (0 to 55 pixels).
- Recognition:
- Automatic: Comparison of InChI strings generated by Open Babel.
- Manual: Visual comparison of images rendered by MarvinView.
Reproducibility
The CLEF-IP 2012 benchmark data was distributed to registered participants through the CLEF evaluation framework. The patent corpus is derived from the MAREC dataset (EPO and WIPO documents published until 2002). Evaluation tools for segmentation (bounding box comparison) and recognition (InChI comparison via Open Babel) were developed in-house by the organizers. The McGregor algorithm implementation for flowchart evaluation was also custom-built.
No public code repositories or pre-trained models are associated with this paper, as it is a benchmarking infrastructure paper. The evaluation protocols and data formats are fully described in the paper.
| Artifact | Type | License | Notes |
|---|---|---|---|
| CLEF-IP 2012 data | Dataset | Unknown | Distributed to registered CLEF participants; no persistent public archive |
| MAREC corpus | Dataset | Unknown | Source patent corpus (EPO/WIPO documents up to 2002) |
- Status: Partially Reproducible
- Missing components: The benchmark datasets were distributed to participants and are not hosted on a persistent public repository. The in-house evaluation tools (qrels generator, segmentation comparator, flowchart distance calculator) are not publicly released.
Citation
@inproceedings{piroi2012clefip,
title={CLEF-IP 2012: Retrieval Experiments in the Intellectual Property Domain},
author={Piroi, Florina and Lupu, Mihai and Hanbury, Allan and Magdy, Walid and Sexton, Alan P. and Filippov, Igor},
booktitle={CLEF 2012 Working Notes},
series={CEUR Workshop Proceedings},
volume={1178},
year={2012},
publisher={CEUR-WS.org},
url={https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-PiroiEt2012.pdf}
}
