CLEF-IP 2012: Patent and Chemical Structure Benchmark

Paper Information

Citation: Piroi, F., Lupu, M., Hanbury, A., Magdy, W., Sexton, A. P., & Filippov, I. (2012). CLEF-IP 2012: Retrieval Experiments in the Intellectual Property Domain. CLEF 2012 Working Notes, CEUR Workshop Proceedings, Vol. 1178.

Publication: CLEF 2012 Working Notes (CEUR-WS Vol. 1178)

Patent Retrieval and the CLEF-IP 2012 Benchmark

This is a Resource paper (benchmark infrastructure). It establishes a standardized test bed for the Intellectual Property (IP) Information Retrieval community by defining tasks, curating datasets (topics and relevance judgments), and establishing evaluation protocols. The paper does not propose a new method itself but aggregates and analyzes the performance of participant systems on these shared tasks.

Motivation for Standardized IP Information Retrieval

The volume of patent applications is increasing rapidly, necessitating automated methods to help patent experts find prior art and classify documents.

Economic Impact: Thorough searches are critical due to the high economic value of granted patents.
Complexity: Patent work-flows are specific; examiners need to find prior art for specific claims alongside whole documents, and often rely on non-textual data like flowcharts and chemical diagrams.
Gap: Existing general IR tools are insufficient for the specific granularity (passages, images, structures) required in the IP domain.

The 2012 edition of the lab introduced three specific tasks targeting different modalities of patent data:

Passage Retrieval starting from Claims: Moving beyond document-level retrieval to identifying specific relevant passages based on claim text.
Flowchart Recognition: A new image analysis task requiring the extraction of structural information (nodes, edges, text) from patent images.
Chemical Structure Recognition: A dual task of segmenting molecular diagrams from full pages and recognizing them into structural files (MOL), specifically addressing the challenge of Markush structures in patents.

Benchmarking Setup and Evaluation

The “experiments” were the benchmarking tasks themselves, performed by participants (e.g., University of Birmingham, SAIC, TU Vienna).

Passage Retrieval: Participants retrieved documents and passages for 105 test topics (sets of claims) from a corpus of 1.5 million patents. Performance was measured using PRES, Recall, and MAP at the document level, and AP/Precision at the passage level.
Flowchart Recognition: Participants extracted graph structures from 100 test images. Evaluation compared the submitted graphs to ground truth using a distance metric based on the Maximum Common Subgraph (MCS).
Chemical Structure:
- Segmentation: Identifying bounding boxes of chemical structures in 30 multipage TIFF patents.
- Recognition: Converting 865 “automatic” (standard MOL) and 95 “manual” (Markush/complex) diagrams into structure files.

Key Findings and Baseline Results

Passage Retrieval: Approaches varied from two-step retrieval (document then passage) to full NLP techniques. Translation tools were universally used due to the multilingual corpus (English, German, French).
Chemical Recognition: The best performing system (UoB, run uob-4) achieved 92% recall on total structures (886/960), with 96% on the automatic set and 57% on the manual set. SAIC achieved 83% total recall. The manual evaluation highlighted a critical need for standards extending MOL files to support Markush structures, which are common in patents but poorly supported by current tools.
Flowchart Recognition: The evaluation was not completed at the time of writing the workshop notes. The evaluation required a combination of structural matching and edit-distance for text labels because OCR outputs rarely “hard-matched” the gold standard.

Chemical Structure Recognition Results

Segmentation (SAIC, best run using OSRA native rendering):

Tolerance (px)	Precision	Recall	$F_1$
0	0.708	0.686	0.697
10	0.793	0.769	0.781
20	0.821	0.795	0.808
40	0.867	0.840	0.853
55	0.887	0.860	0.873

Recognition (automatic and manual sets):

System	Auto (#/865)	Auto %	Manual (#/95)	Manual %	Total (#/960)	Total %
SAIC	761	88%	38	40%	799	83%
UoB-1	832	96%	44	46%	876	91%
UoB-2	821	95%	56	59%	877	91%
UoB-3	821	95%	44	46%	865	90%
UoB-4	832	96%	54	57%	886	92%

Reproducibility Details

Data

The collection focuses on European Patent Office (EPO) and WIPO documents published up to 2002.

1. Passage Retrieval Data

Corpus: >1.5 million XML patent documents (EP and WO sources).
Training Set: 51 topics (sets of claims) with relevance judgments (18 DE, 21 EN, 12 FR).
Test Set: 105 topics (35 per language).
Topic Source: Extracted manually from search reports listing “X” or “Y” citations (highly relevant prior art).

2. Flowchart Data

Format: Black and white TIFF images.
Training Set: 50 images with textual graph representations.
Test Set: 100 images.
Ground Truth: A defined textual format describing nodes (NO), directed edges (DE), undirected edges (UE), and meta-data (MT).

3. Chemical Structure Data

Segmentation: 30 patent files rendered as 300dpi monochrome multipage TIFFs.
Recognition (Automatic Set): 865 diagram images fully representable in standard MOL format.
Recognition (Manual Set): 95 diagram images containing Markush structures or variability not supported by standard MOL.

Algorithms

Ground Truth Generation:

Qrels Generator: An in-house tool was used to manually map search report citations to specific XML passages (XPaths) for the passage retrieval task.
McGregor Algorithm: Used for the flowchart evaluation to compute the Maximum Common Subgraph (MCS) between participant submissions and ground truth.

Evaluation

Passage Retrieval Metrics:

Document Level: PRES (Patent Retrieval Evaluation Score), Recall, MAP. Cut-off at 100 documents.
Passage Level: $AP(D)$ (Average Precision at document level) and $Precision(D)$ (Precision at document level), averaged across all relevant documents for a topic.

Flowchart Recognition Metric:

Graph Distance ($d$): Defined quantitatively based on the Maximum Common Subgraph (MCS) between a target flowchart ($F_t$) and a submitted flowchart ($F_s$): $$ \begin{aligned} d(F_t, F_s) &= 1 - \frac{|mcs(F_t, F_s)|}{|F_t| + |F_s| - |mcs(F_t, F_s)|} \end{aligned} $$ where $|F|$ represents the size of the graph (nodes + edges).
Levels: Evaluated at three levels: Basic (structure only), Intermediate (structure + node types), and Complete (structure + types + text labels).

Chemical Structure Metrics:

Segmentation: Precision, Recall, and $F_1$ based on bounding box matches. A match is valid if borders are within a tolerance (0 to 55 pixels).
Recognition:
- Automatic: Comparison of InChI strings generated by Open Babel.
- Manual: Visual comparison of images rendered by MarvinView.

Reproducibility

The CLEF-IP 2012 benchmark data was distributed to registered participants through the CLEF evaluation framework. The patent corpus is derived from the MAREC dataset (EPO and WIPO documents published until 2002). Evaluation tools for segmentation (bounding box comparison) and recognition (InChI comparison via Open Babel) were developed in-house by the organizers. The McGregor algorithm implementation for flowchart evaluation was also custom-built.

No public code repositories or pre-trained models are associated with this paper, as it is a benchmarking infrastructure paper. The evaluation protocols and data formats are fully described in the paper.

Artifact	Type	License	Notes
CLEF-IP 2012 data	Dataset	Unknown	Distributed to registered CLEF participants; no persistent public archive
MAREC corpus	Dataset	Unknown	Source patent corpus (EPO/WIPO documents up to 2002)

Status: Partially Reproducible
Missing components: The benchmark datasets were distributed to participants and are not hosted on a persistent public repository. The in-house evaluation tools (qrels generator, segmentation comparator, flowchart distance calculator) are not publicly released.

Citation

@inproceedings{piroi2012clefip,
  title={CLEF-IP 2012: Retrieval Experiments in the Intellectual Property Domain},
  author={Piroi, Florina and Lupu, Mihai and Hanbury, Allan and Magdy, Walid and Sexton, Alan P. and Filippov, Igor},
  booktitle={CLEF 2012 Working Notes},
  series={CEUR Workshop Proceedings},
  volume={1178},
  year={2012},
  publisher={CEUR-WS.org},
  url={https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-PiroiEt2012.pdf}
}

Paper Information#

Patent Retrieval and the CLEF-IP 2012 Benchmark#

Motivation for Standardized IP Information Retrieval#

Novel Multi-modal Tasks: Claims, Flowcharts, and Chemicals#

Benchmarking Setup and Evaluation#

Key Findings and Baseline Results#

Chemical Structure Recognition Results#

Reproducibility Details#

Data#

Algorithms#

Evaluation#

Reproducibility#

Citation#