A Standardized Resource for Chemical Mixtures
This is a Resource and Method paper that introduces two complementary standards for representing chemical mixtures: the detailed Mixfile format for comprehensive mixture descriptions and the compact MInChI (Mixtures InChI) specification for canonical mixture identifiers.
The Missing Format for Complex Formulations
Here’s a fundamental gap in chemical informatics: Current standards excel at representing pure individual molecules (SMILES, InChI, Molfile). A corresponding standard for multi-component mixtures remains an open challenge. This is a major problem because real-world chemistry predominantly relies on complex mixtures.
Think about everyday chemical work: you frequently deal with:
- Reagents with specified purity (e.g., “$\geq$ 97% pure”)
- Solutions and formulations
- Complex mixtures like “hexanes” (which contains multiple isomers)
- Drug formulations with active ingredients and excipients
Without a machine-readable standard, chemists are forced to describe these mixtures in plain text that software can’t parse or analyze systematically. This creates barriers for automated safety analysis, inventory management, and data sharing.
Dual Design: Comprehensive Mixfiles and Canonical MInChIs
The authors propose a two-part solution:
- Mixfile: A detailed, hierarchical JSON format that captures the complete composition of a mixture
- MInChI: A compact, canonical string identifier derived from Mixfile data
This dual approach gives you both comprehensive description (Mixfile) and simple identification (MInChI), similar to how you might have both a detailed recipe and a short name for a dish.
What Makes a Good Mixture Format?
The authors identify three essential properties any mixture format must capture:
- Compound: What molecules are present?
- Quantity: How much of each component?
- Hierarchy: How are components organized (e.g., mixtures-of-mixtures)?
The hierarchical aspect is crucial. Consider “hexanes”: it is a named mixture containing specific proportions of n-hexane, 2-methylpentane, 3-methylpentane, etc. Your mixture format needs to represent both the individual isomers and the fact that they’re grouped under the umbrella term “hexanes.”
Mixfile Format Details
Mixfile uses JSON as its foundation, making it both human-readable and easy to parse in modern programming languages. The core structure is a hierarchical tree where each component can contain:
- name: Component identifier
- molfile/smiles/inchi: Molecular structure (molfile is the primary source of truth)
- quantity/units/relation: Concentration data with optional relation operators
- contents: Array of sub-components for hierarchical mixtures
- identifiers: Database IDs or URLs for additional information
Simple Example
A basic Mixfile might look like:
{
"mixfileVersion": 0.01,
"name": "Acetone, ≥99%",
"contents": [
{
"name": "acetone",
"smiles": "CC(=O)C",
"quantity": 99,
"units": "%",
"relation": ">="
}
]
}
Note that the paper specifies distinct fields for molecular structures: molfile (the primary source of truth), smiles, inchi, and formula. Concentration data uses separate quantity, units, and relation fields.
Complex Example: Mixture-of-Mixtures
For something like “ethyl acetate dissolved in hexanes,” you’d have:
{
"mixfileVersion": 0.01,
"name": "Ethyl acetate in hexanes",
"contents": [
{
"name": "ethyl acetate",
"smiles": "CCOC(=O)C",
"quantity": 10,
"units": "%"
},
{
"name": "hexanes",
"contents": [
{
"name": "n-hexane",
"smiles": "CCCCCC",
"quantity": 60,
"units": "%"
},
{
"name": "2-methylpentane",
"smiles": "CC(C)CCC",
"quantity": 25,
"units": "%"
}
]
}
]
}
This hierarchical structure elegantly captures the “recipe” of complex mixtures while remaining machine-readable.
MInChI: Canonical Mixture Identifiers
While Mixfiles provide comprehensive descriptions, you also need simple identifiers for database storage and searching. This is where MInChI comes in.
A MInChI string is structured as:
MInChI=0.00.1S/<components>/<indexing>/<concentration>
- Header: Version information (
0.00.1Sin the paper’s specification) - Components: Standard InChI for each unique molecule, sorted alphabetically by the InChI strings themselves, then concatenated with
& - Indexing: Hierarchical structure using curly braces
{}for branches and&for adjacent nodes; uses 1-based integer indices referring to the sorted InChI list - Concentration: Quantitative information for each component, with units converted to canonical codes
Why This Matters
MInChI strings enable simple database searches:
- Check if a specific component appears in any mixture
- Compare different formulations of the same product
- Identify similar mixtures based on string similarity
Validating the Standard Through Practical Tooling
The paper demonstrates the format’s capabilities through several practical applications and a proof-of-concept implementation:
Text Extraction Algorithm
The authors demonstrate a proof-of-concept algorithm that uses regular expressions and chemical name recognition to parse plain-text mixture descriptions into structured Mixfile data. The algorithm:
- Applies regex rules to remove filler words and extract concentrations
- Looks up cleaned names against a custom chemical database
- Falls back to OPSIN for SMILES generation from chemical names
- Generates 2D coordinates for molecular structures
Graphical Editor
An open-source editor provides:
- Drag-and-drop interface for building hierarchical structures
- Chemical structure sketching and editing
- Database lookup (e.g., PubChem integration)
- Automatic MInChI generation
- Import/export capabilities
Example Use Cases
The paper validates the format through real-world applications:
- Safety compliance: Automated hazard assessment based on concentration-dependent properties (e.g., solid osmium tetroxide vs. 1% aqueous solution)
- Inventory management: Precise, searchable laboratory records
- Data extraction: Parsing vendor catalogs and safety data sheets
Outcomes and Future Extensibility
The work successfully establishes the first standardized, machine-readable formats for chemical mixtures. Key achievements:
- Comprehensive representation: Mixfile captures component identity, quantity, and hierarchy
- Canonical identification: MInChI provides compact, searchable identifiers
- Practical tooling: Open-source editor and text extraction demonstrate feasibility
- Real-world validation: Format handles diverse use cases from safety to inventory
Limitations and Future Directions
The authors acknowledge areas for improvement:
- Machine learning improvements: Better text extraction using modern NLP techniques
- Extended coverage: Support for polymers, complex formulations, analytical results
- Community adoption: Integration with existing chemical databases and software
The hierarchical design makes Mixfile suitable for both “recipe” descriptions (how to make something) and analytical results (what was found). This flexibility should help drive adoption across different use cases in chemistry and materials science.
Reproducibility Details
Open Source Tooling & Data
While the central repository focusing on validating and establishing the MInChI standard is github.com/IUPAC/MInChI, the tools and datasets actually used to develop the paper’s proofs-of-concept are hosted elsewhere:
- Graphical Editor & App codebase: The Electron application and Mixfile handling codebase (
console.js) can be found at github.com/cdd/mixtures. - Text Extraction Data: The 5,615 extracted mixture records mentioned in the text extraction validation can be accessed as training data inside the
cdd/mixturesrepository underreference/gathering.zip.
Algorithms
This section provides the specific algorithmic logic, schema definitions, and standardization rules needed to replicate the Mixfile parser or MInChI generator.
The Strict Mixfile JSON Schema
To implement the format, a parser must recognize these specific fields:
Root Structure:
{
"mixfileVersion": 0.01,
"header": {},
"contents": []
}
Component Fields:
name: string (required if no structure is provided)molfile: string (the primary source of truth for molecular structure)smiles,inchi,formula: derived/transient fields for conveniencequantity: number OR[min, max]array for rangesunits: string (must map to supported ontology)relation: string (e.g.,">","~",">=")contents: recursive array for hierarchical mixtures
MInChI Generation Algorithm
To generate MInChI=0.00.1S/..., the software must follow these steps:
Component Layer:
- Calculate standard InChI for all structures in the mixture
- Sort distinct InChIs alphabetically by the InChI string itself
- Join with
&to form the structure layer
Hierarchy & Concentration Layers:
- Traverse the Mixfile tree recursively
- Indexing: Use integer indices (1-based) referring to the sorted InChI list
- Grouping: Use
{}to denote hierarchy branches and&to separate nodes at the same level - Concentration: Convert all quantities to canonical unit codes and apply scaling factors
Unit Standardization Table
Replication requires mapping input units to these canonical MInChI codes:
| Input Unit | MInChI Code | Scale Factor |
|---|---|---|
| % | pp | 1 |
| w/w% | wf | 0.01 |
| v/v% | vf | 0.01 |
| mol/L (M) | mr | 1 |
| mmol/L | mr | $10^{-3}$ |
| g/L | wv | $10^{-3}$ |
Text Extraction Logic
The paper defines a recursive procedure for parsing plain-text mixture descriptions:
- Input: Raw text string (e.g., “2 M acetone in water”)
- Rule Application: Apply RegEx rules in order:
- Remove: Delete common filler words (“solution”, “in”)
- Replace: Substitute known variations
- Concentration: Extract quantities like “2 M”, “97%”
- Branch: Split phrases like “A in B” into sub-nodes
- Lookup: Check cleaned name against a custom table (handles cases like “xylenes” or specific structures)
- OPSIN: If no lookup match, send to the OPSIN tool to generate SMILES from the chemical name
- Embed: If structure found, generate 2D coordinates (Molfile) via RDKit
Paper Information
Citation: Clark, A. M., McEwen, L. R., Gedeck, P., & Bunin, B. A. (2019). Capturing mixture composition: An open machine-readable format for representing mixed substances. Journal of Cheminformatics, 11(1), 33. https://doi.org/10.1186/s13321-019-0357-4
Publication: Journal of Cheminformatics (2019)
@article{clark2019capturing,
title={Capturing mixture composition: an open machine-readable format for representing mixed substances},
author={Clark, Alex M and McEwen, Leah R and Gedeck, Peter and Bunin, Barry A},
journal={Journal of cheminformatics},
volume={11},
number={1},
pages={1--14},
year={2019},
publisher={Springer}
}
Additional Resources:
