Mixfile & MInChI: Machine-Readable Mixture Formats

A Standardized Resource for Chemical Mixtures

This is a Resource and Method paper that introduces two complementary standards for representing chemical mixtures: the detailed Mixfile format for comprehensive mixture descriptions and the compact MInChI (Mixtures InChI) specification for canonical mixture identifiers.

The Missing Format for Complex Formulations

Here’s a fundamental gap in chemical informatics: Current standards excel at representing pure individual molecules (SMILES, InChI, Molfile). A corresponding standard for multi-component mixtures remains an open challenge. This is a major problem because real-world chemistry predominantly relies on complex mixtures.

Think about everyday chemical work: you frequently deal with:

Reagents with specified purity (e.g., “$\geq$ 97% pure”)
Solutions and formulations
Complex mixtures like “hexanes” (which contains multiple isomers)
Drug formulations with active ingredients and excipients

Without a machine-readable standard, chemists are forced to describe these mixtures in plain text that software can’t parse or analyze systematically. This creates barriers for automated safety analysis, inventory management, and data sharing.

Dual Design: Comprehensive Mixfiles and Canonical MInChIs

The authors propose a two-part solution:

Mixfile: A detailed, hierarchical JSON format that captures the complete composition of a mixture
MInChI: A compact, canonical string identifier derived from Mixfile data

This dual approach gives you both comprehensive description (Mixfile) and simple identification (MInChI), similar to how you might have both a detailed recipe and a short name for a dish.

What Makes a Good Mixture Format?

The authors identify three essential properties any mixture format must capture:

Compound: What molecules are present?
Quantity: How much of each component?
Hierarchy: How are components organized (e.g., mixtures-of-mixtures)?

The hierarchical aspect is crucial. Consider “hexanes”: it is a named mixture containing specific proportions of n-hexane, 2-methylpentane, 3-methylpentane, etc. Your mixture format needs to represent both the individual isomers and the fact that they’re grouped under the umbrella term “hexanes.”

Mixfile Format Details

Mixfile uses JSON as its foundation, making it both human-readable and easy to parse in modern programming languages. The core structure is a hierarchical tree where each component can contain:

name: Component identifier
molfile/smiles/inchi: Molecular structure (molfile is the primary source of truth)
quantity/units/relation: Concentration data with optional relation operators
contents: Array of sub-components for hierarchical mixtures
identifiers: Database IDs or URLs for additional information

Simple Example

A basic Mixfile might look like:

{
  "mixfileVersion": 0.01,
  "name": "Acetone, ≥99%",
  "contents": [
    {
      "name": "acetone",
      "smiles": "CC(=O)C",
      "quantity": 99,
      "units": "%",
      "relation": ">="
    }
  ]
}

Note that the paper specifies distinct fields for molecular structures: molfile (the primary source of truth), smiles, inchi, and formula. Concentration data uses separate quantity, units, and relation fields.

Complex Example: Mixture-of-Mixtures

For something like “ethyl acetate dissolved in hexanes,” you’d have:

{
  "mixfileVersion": 0.01,
  "name": "Ethyl acetate in hexanes",
  "contents": [
    {
      "name": "ethyl acetate",
      "smiles": "CCOC(=O)C",
      "quantity": 10,
      "units": "%"
    },
    {
      "name": "hexanes",
      "contents": [
        {
          "name": "n-hexane",
          "smiles": "CCCCCC",
          "quantity": 60,
          "units": "%"
        },
        {
          "name": "2-methylpentane",
          "smiles": "CC(C)CCC",
          "quantity": 25,
          "units": "%"
        }
      ]
    }
  ]
}

This hierarchical structure elegantly captures the “recipe” of complex mixtures while remaining machine-readable.

MInChI: Canonical Mixture Identifiers

While Mixfiles provide comprehensive descriptions, you also need simple identifiers for database storage and searching. This is where MInChI comes in.

A MInChI string is structured as:

MInChI=0.00.1S/<components>/<indexing>/<concentration>

Header: Version information (0.00.1S in the paper’s specification)
Components: Standard InChI for each unique molecule, sorted alphabetically by the InChI strings themselves, then concatenated with &
Indexing: Hierarchical structure using curly braces {} for branches and & for adjacent nodes; uses 1-based integer indices referring to the sorted InChI list
Concentration: Quantitative information for each component, with units converted to canonical codes

Why This Matters

MInChI strings enable simple database searches:

Check if a specific component appears in any mixture
Compare different formulations of the same product
Identify similar mixtures based on string similarity

Validating the Standard Through Practical Tooling

The paper demonstrates the format’s capabilities through several practical applications and a proof-of-concept implementation:

Text Extraction Algorithm

The authors demonstrate a proof-of-concept algorithm that uses regular expressions and chemical name recognition to parse plain-text mixture descriptions into structured Mixfile data. The algorithm:

Applies regex rules to remove filler words and extract concentrations
Looks up cleaned names against a custom chemical database
Falls back to OPSIN for SMILES generation from chemical names
Generates 2D coordinates for molecular structures

Graphical Editor

An open-source editor provides:

Drag-and-drop interface for building hierarchical structures
Chemical structure sketching and editing
Database lookup (e.g., PubChem integration)
Automatic MInChI generation
Import/export capabilities

Example Use Cases

The paper validates the format through real-world applications:

Safety compliance: Automated hazard assessment based on concentration-dependent properties (e.g., solid osmium tetroxide vs. 1% aqueous solution)
Inventory management: Precise, searchable laboratory records
Data extraction: Parsing vendor catalogs and safety data sheets

Outcomes and Future Extensibility

The work successfully establishes the first standardized, machine-readable formats for chemical mixtures. Key achievements:

Comprehensive representation: Mixfile captures component identity, quantity, and hierarchy
Canonical identification: MInChI provides compact, searchable identifiers
Practical tooling: Open-source editor and text extraction demonstrate feasibility
Real-world validation: Format handles diverse use cases from safety to inventory

Limitations and Future Directions

The authors acknowledge areas for improvement:

Machine learning improvements: Better text extraction using modern NLP techniques
Extended coverage: Support for polymers, complex formulations, analytical results
Community adoption: Integration with existing chemical databases and software

The hierarchical design makes Mixfile suitable for both “recipe” descriptions (how to make something) and analytical results (what was found). This flexibility should help drive adoption across different use cases in chemistry and materials science.

Reproducibility Details

Open Source Tooling & Data

While the central repository focusing on validating and establishing the MInChI standard is github.com/IUPAC/MInChI, the tools and datasets actually used to develop the paper’s proofs-of-concept are hosted elsewhere:

Graphical Editor & App codebase: The Electron application and Mixfile handling codebase (console.js) can be found at github.com/cdd/mixtures.
Text Extraction Data: The 5,615 extracted mixture records mentioned in the text extraction validation can be accessed as training data inside the cdd/mixtures repository under reference/gathering.zip.

Algorithms

This section provides the specific algorithmic logic, schema definitions, and standardization rules needed to replicate the Mixfile parser or MInChI generator.

The Strict Mixfile JSON Schema

To implement the format, a parser must recognize these specific fields:

Root Structure:

{
  "mixfileVersion": 0.01,
  "header": {},
  "contents": []
}

Component Fields:

name: string (required if no structure is provided)
molfile: string (the primary source of truth for molecular structure)
smiles, inchi, formula: derived/transient fields for convenience
quantity: number OR [min, max] array for ranges
units: string (must map to supported ontology)
relation: string (e.g., ">", "~", ">=")
contents: recursive array for hierarchical mixtures

MInChI Generation Algorithm

To generate MInChI=0.00.1S/..., the software must follow these steps:

Component Layer:
- Calculate standard InChI for all structures in the mixture
- Sort distinct InChIs alphabetically by the InChI string itself
- Join with & to form the structure layer
Hierarchy & Concentration Layers:
- Traverse the Mixfile tree recursively
- Indexing: Use integer indices (1-based) referring to the sorted InChI list
- Grouping: Use {} to denote hierarchy branches and & to separate nodes at the same level
- Concentration: Convert all quantities to canonical unit codes and apply scaling factors

Unit Standardization Table

Replication requires mapping input units to these canonical MInChI codes:

Input Unit	MInChI Code	Scale Factor
%	pp	1
w/w%	wf	0.01
v/v%	vf	0.01
mol/L (M)	mr	1
mmol/L	mr	$10^{-3}$
g/L	wv	$10^{-3}$

Text Extraction Logic

The paper defines a recursive procedure for parsing plain-text mixture descriptions:

Input: Raw text string (e.g., “2 M acetone in water”)
Rule Application: Apply RegEx rules in order:
- Remove: Delete common filler words (“solution”, “in”)
- Replace: Substitute known variations
- Concentration: Extract quantities like “2 M”, “97%”
- Branch: Split phrases like “A in B” into sub-nodes
Lookup: Check cleaned name against a custom table (handles cases like “xylenes” or specific structures)
OPSIN: If no lookup match, send to the OPSIN tool to generate SMILES from the chemical name
Embed: If structure found, generate 2D coordinates (Molfile) via RDKit

Paper Information

Citation: Clark, A. M., McEwen, L. R., Gedeck, P., & Bunin, B. A. (2019). Capturing mixture composition: An open machine-readable format for representing mixed substances. Journal of Cheminformatics, 11(1), 33. https://doi.org/10.1186/s13321-019-0357-4

Publication: Journal of Cheminformatics (2019)

@article{clark2019capturing,
  title={Capturing mixture composition: an open machine-readable format for representing mixed substances},
  author={Clark, Alex M and McEwen, Leah R and Gedeck, Peter and Bunin, Barry A},
  journal={Journal of cheminformatics},
  volume={11},
  number={1},
  pages={1--14},
  year={2019},
  publisher={Springer}
}

Additional Resources:

Official MInChI GitHub repository

A Standardized Resource for Chemical Mixtures#

The Missing Format for Complex Formulations#

Dual Design: Comprehensive Mixfiles and Canonical MInChIs#

What Makes a Good Mixture Format?#

Mixfile Format Details#

Simple Example#

Complex Example: Mixture-of-Mixtures#

MInChI: Canonical Mixture Identifiers#

Why This Matters#

Validating the Standard Through Practical Tooling#

Text Extraction Algorithm#

Graphical Editor#

Example Use Cases#

Outcomes and Future Extensibility#

Limitations and Future Directions#

Reproducibility Details#

Open Source Tooling & Data#

Algorithms#

The Strict Mixfile JSON Schema#

MInChI Generation Algorithm#

Unit Standardization Table#

Text Extraction Logic#

Paper Information#