Overview

SMILES is a one-dimensional string format for representing chemical molecular structures. It provides a linearized, serialized representation of 3D molecular structures, functioning like a depth-first traversal of the molecular graph. Similar to a connection table, SMILES identifies the nodes (atoms) and edges (bonds) of a molecular graph.

Key Characteristics

  • Human-readable: Built for human readability (versus InChI for hierarchical representation and machine readability)
  • Compact: More compact than other representations (3D coordinates, connectivity tables)
  • Simple syntax: A language with simple syntax and structure, making it relatively easy to learn and use for chemists and researchers
  • Flexible: Both linear and cyclic structures can be represented in many different valid ways

Basic Syntax

Atomic Symbols

SMILES uses standard atomic symbols with implied hydrogen atoms:

  • C (methane, CH4)
  • N (ammonia, NH3)
  • O (water, H2O)
  • P (phosphine, PH3)
  • S (hydrogen sulfide, H2S)
  • Cl (hydrogen chloride, HCl)

Bracket notation: Elements outside the organic subset must be shown in brackets, e.g., [Pt] for elemental platinum. The organic subset (B, C, N, O, P, S, F, Cl, Br, and I) can omit brackets.

Bond Representation

Bonds are represented by symbols:

  • Single bond: - (usually omitted)
Ethane
Ethane (C2H6), SMILES: CC
  • Double bond: =
Methyl Isocyanate
Methyl Isocyanate (C2H6N), SMILES: CN=C=O
  • Triple bond: #
Hydrogen Cyanide
Hydrogen Cyanide (HCN), SMILES: C#N
  • Aromatic bond: : (usually omitted, or * for aromatic rings)
Vanillin
Vanillin (C8H8O3), SMILES: O=Cc1ccc(O)c(OC)c1
  • Delocalized bond: .
Copper(II) Sulfate
Copper(II) Sulfate (CuSO4), SMILES: [Cu+2].[O-]S(=O)(=O)[O-]

Structural Features

  • Branches: Enclosed in parentheses and can be nested
3-Propyl-4-isopropyl-1-heptene
3-Propyl-4-isopropyl-1-heptene (C12H22), SMILES: C=CC(CCC)C(C(C)C)CCC
  • Cyclic structures: Written by breaking bonds and using numbers to indicate bond connections
  • Aromaticity: Lower case letters are used for atoms in rings to denote aromaticity
  • Formal charges: Indicated by placing the charge in brackets after the atom symbol, e.g., [C+], [C-], or [C-2]

Stereochemistry and Isomers

Isotope Notation

Isotope notation specifies the exact isotope of an element and comes before the element within square brackets, e.g., [13C] for carbon-13.

Double Bond Stereochemistry

Directional bonds can be specified using \ and / symbols to indicate the stereochemistry of double bonds:

  • C/C=C\C vs C/C=C/C

Tetrahedral Chirality

Chirality around tetrahedral centers uses @ and @@ symbols:

  • N[C@](C)(F)C(=O)O vs N[C@@](F)(C)C(=O)O
  • Anti-clockwise counting vs clockwise counting
  • @ and @@ are shorthand for @TH1 and @TH2, respectively
Glucose
Glucose (C6H12O6), SMILES: OCC@@HC@@HC@HC@@HC@H1

Advanced Stereochemistry

More general notation for other stereocenters:

  • @AL1, @AL2 for allene-type stereocenters
  • @SP1, @SP2, @SP3 for square-planar stereocenters
  • @TB1@TB20 for trigonal bipyramidal stereocenters
  • @OH1@OH30 for octahedral stereocenters

SMILES allows partial specification since it relies on local chirality instead of absolute chirality.

Variants and Standards

Canonical SMILES

Canonical SMILES seeks unique representations of molecules to ensure consistency across different software implementations.

OpenSMILES vs. Proprietary

  • Proprietary: SMILES is technically closed source, which can cause compatibility issues between different groups/labs
  • OpenSMILES: Open-source alternative standardization to address compatibility concerns

Isomeric SMILES

Isomeric SMILES incorporates isotopes and stereochemistry information, providing more detailed molecular representations.

References