Overview

SMILES is a one-dimensional string format for representing chemical molecular structures. It provides a linearized, serialized representation of 3D molecular structures, functioning like a depth-first traversal of the molecular graph. Similar to a connection table, SMILES identifies the nodes (atoms) and edges (bonds) of a molecular graph.

For example, the simple molecule ethanol (C2H6O) can be represented as CCO, while the more complex caffeine molecule becomes CN1C=NC2=C1C(=O)N(C(=O)N2C)C.

Key Characteristics

  • Human-readable: Built for human readability (versus InChI for hierarchical representation and machine readability)
  • Compact: More compact than other representations (3D coordinates, connectivity tables)
  • Simple syntax: A language with simple syntax and structure, making it relatively easy to learn and use for chemists and researchers
  • Flexible: Both linear and cyclic structures can be represented in many different valid ways

Limitations

  • Non-uniqueness: Different SMILES strings can represent the same molecule (e.g., different resonance forms).
  • Non-robustness: SMILES strings can be written that do not correspond to any valid molecular structure.
    • Strings that cannot represent a molecular structure.
    • Strings that violate basic rules (more bonds than is physically possible).
  • Information loss: If 3D structural information exists, a SMILES string cannot encode it.

For a more robust alternative that guarantees 100% valid molecules, see SELFIES (Self-Referencing Embedded Strings).

Basic Syntax

Atomic Symbols

SMILES uses standard atomic symbols with implied hydrogen atoms:

  • C (methane, CH4)
  • N (ammonia, NH3)
  • O (water, H2O)
  • P (phosphine, PH3)
  • S (hydrogen sulfide, H2S)
  • Cl (hydrogen chloride, HCl)

Bracket notation: Elements outside the organic subset must be shown in brackets, e.g., [Pt] for elemental platinum. The organic subset (B, C, N, O, P, S, F, Cl, Br, and I) can omit brackets.

Bond Representation

Bonds are represented by symbols:

  • Single bond: - (usually omitted)
Ethane
Ethane (C2H6), SMILES: CC
  • Double bond: =
Methyl Isocyanate
Methyl Isocyanate (C2H6N), SMILES: CN=C=O
  • Triple bond: #
Hydrogen Cyanide
Hydrogen Cyanide (HCN), SMILES: C#N
  • Aromatic bond: : (usually omitted, or * for aromatic rings)
Vanillin
Vanillin (C8H8O3), SMILES: O=Cc1ccc(O)c(OC)c1
  • Delocalized bond: .
Copper(II) Sulfate
Copper(II) Sulfate (CuSO4), SMILES: [Cu+2].[O-]S(=O)(=O)[O-]

Structural Features

  • Branches: Enclosed in parentheses and can be nested. For example, C(C)C represents propane with a methyl branch.
3-Propyl-4-isopropyl-1-heptene
3-Propyl-4-isopropyl-1-heptene (C12H22), SMILES: C=CC(CCC)C(C(C)C)CCC
  • Cyclic structures: Written by breaking bonds and using numbers to indicate bond connections. For example, C1CCCCC1 represents cyclohexane (the 1 connects the first and last carbon).
  • Aromaticity: Lower case letters are used for atoms in aromatic rings. For example, benzene is written as c1ccccc1.
  • Formal charges: Indicated by placing the charge in brackets after the atom symbol, e.g., [C+], [C-], or [C-2]

Stereochemistry and Isomers

Isotope Notation

Isotope notation specifies the exact isotope of an element and comes before the element within square brackets, e.g., [13C] for carbon-13.

Double Bond Stereochemistry

Directional bonds can be specified using \ and / symbols to indicate the stereochemistry of double bonds:

  • C/C=C\C represents (E)-2-butene (trans configuration)
  • C/C=C/C represents (Z)-2-butene (cis configuration)

The direction of the slashes indicates which side of the double bond each substituent is on.

Tetrahedral Chirality

Chirality around tetrahedral centers uses @ and @@ symbols:

  • N[C@](C)(F)C(=O)O vs N[C@@](F)(C)C(=O)O
  • Anti-clockwise counting vs clockwise counting
  • @ and @@ are shorthand for @TH1 and @TH2, respectively
Glucose
Glucose (C6H12O6), SMILES: OCC@@HC@@HC@HC@@HC@H1

Advanced Stereochemistry

More general notation for other stereocenters:

  • @AL1, @AL2 for allene-type stereocenters
  • @SP1, @SP2, @SP3 for square-planar stereocenters
  • @TB1@TB20 for trigonal bipyramidal stereocenters
  • @OH1@OH30 for octahedral stereocenters

SMILES allows partial specification since it relies on local chirality instead of absolute chirality.

Practical Applications

SMILES notation is widely used in:

  • Chemical databases: Storage and retrieval of molecular structures
  • Machine learning: Input representation for molecular property prediction
  • Chemical informatics: Substructure searching and similarity analysis
  • Drug discovery: High-throughput virtual screening
  • Chemical reaction databases: Representing reactants and products

For a hands-on tutorial on visualizing SMILES strings as 2D molecular images, see Converting SMILES Strings to 2D Molecular Images.

Variants and Standards

Canonical SMILES

Canonical SMILES seeks unique representations of molecules to ensure consistency across different software implementations.

OpenSMILES vs. Proprietary

  • Proprietary: SMILES is technically closed source, which can cause compatibility issues between different groups/labs
  • OpenSMILES: Open-source alternative standardization to address compatibility concerns

Isomeric SMILES

Isomeric SMILES incorporates isotopes and stereochemistry information, providing more detailed molecular representations.

References