Paper Information

Citation: Clark, A. M., McEwen, L. R., Gedeck, P., & Bunin, B. A. (2019). Capturing mixture composition: An open machine-readable format for representing mixed substances. Journal of Cheminformatics, 11(1), 33. https://doi.org/10.1186/s13321-019-0357-4

Publication: Journal of Cheminformatics (2019)

Additional Resources:

What kind of paper is this?

This is a format specification and methods paper that introduces two complementary standards for representing chemical mixtures: the detailed Mixfile format for comprehensive mixture descriptions and the compact MInChI (Mixtures InChI) specification for canonical mixture identifiers.

What is the motivation?

Here’s a fundamental gap in chemical informatics: we have excellent standards for representing individual molecules (SMILES, InChI, Molfile), but no widely accepted format for representing mixtures. This is a major problem because most real chemistry involves mixtures, not pure compounds.

Think about everyday chemical work: you’re rarely dealing with a perfectly pure substance. Instead, you have:

  • Reagents with specified purity (e.g., “≥97% pure”)
  • Solutions and formulations
  • Complex mixtures like “hexanes” (which contains multiple isomers)
  • Drug formulations with active ingredients and excipients

Without a machine-readable standard, chemists are forced to describe these mixtures in plain text that software can’t parse or analyze systematically. This creates barriers for automated safety analysis, inventory management, and data sharing.

What is the novelty here?

The authors propose a two-part solution:

  1. Mixfile: A detailed, hierarchical JSON format that captures the complete composition of a mixture
  2. MInChI: A compact, canonical string identifier derived from Mixfile data

This dual approach gives you both comprehensive description (Mixfile) and simple identification (MInChI), similar to how you might have both a detailed recipe and a short name for a dish.

What Makes a Good Mixture Format?

The authors identify three essential properties any mixture format must capture:

  1. Compound: What molecules are present?
  2. Quantity: How much of each component?
  3. Hierarchy: How are components organized (e.g., mixtures-of-mixtures)?

The hierarchical aspect is crucial. Consider “hexanes” - it’s not just a list of isomers, but a named mixture containing specific proportions of n-hexane, 2-methylpentane, 3-methylpentane, etc. Your mixture format needs to represent both the individual isomers and the fact that they’re grouped under the umbrella term “hexanes.”

Mixfile Format Details

Mixfile uses JSON as its foundation, making it both human-readable and easy to parse in modern programming languages. The core structure is a hierarchical tree where each component can contain:

  • name: Component identifier
  • molfile/smiles/inchi: Molecular structure (molfile is the primary source of truth)
  • quantity/units/relation: Concentration data with optional relation operators
  • contents: Array of sub-components for hierarchical mixtures
  • identifiers: Database IDs or URLs for additional information

Simple Example

A basic Mixfile might look like:

{
  "mixfileVersion": 0.01,
  "name": "Acetone, ≥99%",
  "contents": [
    {
      "name": "acetone",
      "smiles": "CC(=O)C",
      "quantity": 99,
      "units": "%",
      "relation": ">="
    }
  ]
}

Note that the paper specifies distinct fields for molecular structures: molfile (the primary source of truth), smiles, inchi, and formula. Concentration data uses separate quantity, units, and relation fields rather than a nested object.

Complex Example: Mixture-of-Mixtures

For something like “ethyl acetate dissolved in hexanes,” you’d have:

{
  "mixfileVersion": 0.01,
  "name": "Ethyl acetate in hexanes",
  "contents": [
    {
      "name": "ethyl acetate",
      "smiles": "CCOC(=O)C",
      "quantity": 10,
      "units": "%"
    },
    {
      "name": "hexanes",
      "contents": [
        {
          "name": "n-hexane",
          "smiles": "CCCCCC",
          "quantity": 60,
          "units": "%"
        },
        {
          "name": "2-methylpentane", 
          "smiles": "CC(C)CCC",
          "quantity": 25,
          "units": "%"
        }
      ]
    }
  ]
}

This hierarchical structure elegantly captures the “recipe” of complex mixtures while remaining machine-readable.

MInChI: Canonical Mixture Identifiers

While Mixfiles provide comprehensive descriptions, you also need simple identifiers for database storage and searching. That’s where MInChI comes in.

A MInChI string is structured as:

MInChI=0.00.1S/<components>/<indexing>/<concentration>
  • Header: Version information (0.00.1S in the paper’s specification)
  • Components: Standard InChI for each unique molecule, sorted alphabetically by the InChI strings themselves, then concatenated with &
  • Indexing: Hierarchical structure using curly braces {} for branches and & for adjacent nodes; uses 1-based integer indices referring to the sorted InChI list
  • Concentration: Quantitative information for each component, with units converted to canonical codes

Why This Matters

MInChI strings enable simple database searches:

  • Check if a specific component appears in any mixture
  • Compare different formulations of the same product
  • Identify similar mixtures based on string similarity

What experiments were performed?

The paper demonstrates the format’s capabilities through several practical applications and a proof-of-concept implementation:

Text Extraction Algorithm

The authors demonstrate a proof-of-concept algorithm that uses regular expressions and chemical name recognition to parse plain-text mixture descriptions into structured Mixfile data. The algorithm:

  1. Applies regex rules to remove filler words and extract concentrations
  2. Looks up cleaned names against a custom chemical database
  3. Falls back to OPSIN for SMILES generation from chemical names
  4. Generates 2D coordinates for molecular structures

Graphical Editor

An open-source editor provides:

  • Drag-and-drop interface for building hierarchical structures
  • Chemical structure sketching and editing
  • Database lookup (e.g., PubChem integration)
  • Automatic MInChI generation
  • Import/export capabilities

Example Use Cases

The paper validates the format through real-world applications:

  • Safety compliance: Automated hazard assessment based on concentration-dependent properties (e.g., solid osmium tetroxide vs. 1% aqueous solution)
  • Inventory management: Precise, searchable laboratory records
  • Data extraction: Parsing vendor catalogs and safety data sheets

What outcomes/conclusions?

The work successfully establishes the first standardized, machine-readable formats for chemical mixtures. Key achievements:

  • Comprehensive representation: Mixfile captures component identity, quantity, and hierarchy
  • Canonical identification: MInChI provides compact, searchable identifiers
  • Practical tooling: Open-source editor and text extraction demonstrate feasibility
  • Real-world validation: Format handles diverse use cases from safety to inventory

Limitations and Future Directions

The authors acknowledge areas for improvement:

  • Machine learning improvements: Better text extraction using modern NLP techniques
  • Extended coverage: Support for polymers, complex formulations, analytical results
  • Community adoption: Integration with existing chemical databases and software

The hierarchical design makes Mixfile suitable for both “recipe” descriptions (how to make something) and analytical results (what was found). This flexibility should help drive adoption across different use cases in chemistry and materials science.

Reproducibility Details

Implementation Details

This section provides the specific algorithmic logic, schema definitions, and standardization rules needed to replicate the Mixfile parser or MInChI generator.

The Strict Mixfile JSON Schema

To implement the format, a parser must recognize these specific fields:

Root Structure:

{
  "mixfileVersion": 0.01,
  "header": {},
  "contents": []
}

Component Fields:

  • name: string (required if no structure is provided)
  • molfile: string - the primary source of truth for molecular structure
  • smiles, inchi, formula: derived/transient fields for convenience
  • quantity: number OR [min, max] array for ranges
  • units: string (must map to supported ontology)
  • relation: string (e.g., ">", "~", ">=")
  • contents: recursive array for hierarchical mixtures

MInChI Generation Algorithm

To generate MInChI=0.00.1S/..., the software must follow these steps:

  1. Component Layer:

    • Calculate standard InChI for all structures in the mixture
    • Sort distinct InChIs alphabetically by the InChI string itself
    • Join with & to form the structure layer
  2. Hierarchy & Concentration Layers:

    • Traverse the Mixfile tree recursively
    • Indexing: Use integer indices (1-based) referring to the sorted InChI list
    • Grouping: Use {} to denote hierarchy branches and & to separate nodes at the same level
    • Concentration: Convert all quantities to canonical unit codes and apply scaling factors

Unit Standardization Table

Replication requires mapping input units to these canonical MInChI codes:

Input UnitMInChI CodeScale Factor
%pp1
w/w%wf0.01
v/v%vf0.01
mol/L (M)mr1
mmol/Lmr$10^{-3}$
g/Lwv$10^{-3}$

Text Extraction Logic

The paper defines a recursive procedure for parsing plain-text mixture descriptions:

  1. Input: Raw text string (e.g., “2 M acetone in water”)
  2. Rule Application: Apply RegEx rules in order:
    • Remove: Delete common filler words (“solution”, “in”)
    • Replace: Substitute known variations
    • Concentration: Extract quantities like “2 M”, “97%”
    • Branch: Split phrases like “A in B” into sub-nodes
  3. Lookup: Check cleaned name against a custom table (handles cases like “xylenes” or specific structures)
  4. OPSIN: If no lookup match, send to the OPSIN tool to generate SMILES from the chemical name
  5. Embed: If structure found, generate 2D coordinates (Molfile) via RDKit