Introduction
David Weininger’s 1988 paper introduced SMILES (Simplified Molecular Input Line Entry System) - a notation system that would fundamentally change how we represent chemical structures in computational chemistry. The goal was ambitious: create a system that could represent any molecule as a simple text string, making it both human-readable and machine-efficient.
Before SMILES, the field faced a real problem. As computers became central to chemical information processing, existing line notations were either too complex for chemists to use practically or too limited for computational applications. Previous systems often required extensive training to write correctly and were prone to errors.
Weininger’s insight was to separate the problem into two parts: create simple, flexible rules that chemists could easily learn for input, while having the computer handle the complex task of generating a unique, canonical representation. This division of labor made SMILES both practical and powerful.
Core Design Philosophy
The SMILES system was built around three key objectives:
- Uniquely describe molecular graphs - Every molecule should have a precise representation that captures its atoms and bonds
- User-friendly input - Chemists should be able to write SMILES strings intuitively without extensive training
- Machine-optimized processing - Computers should be able to generate canonical (unique) SMILES efficiently
The brilliant design choice was recognizing that these goals required different approaches. The input rules could be simple and forgiving, allowing multiple valid ways to write the same molecule. The computer would then handle the complex task of converting any valid input into a single, standardized form.
This separation meant chemists could write CCO
or OCC
for ethanol (both perfectly valid), while the system would internally generate a consistent canonical form for database storage and comparison.
The Basic Rules
Weininger designed SMILES around six fundamental rules that remain the foundation of the system today:
1. Atoms
Atoms use their standard chemical symbols. Elements in the “organic subset” (B, C, N, O, P, S, F, Cl, Br, I) can be written directly when they have their most common valence - so C
automatically means a carbon with enough implicit hydrogens to satisfy its valence.
Everything else goes in square brackets: [Au]
for gold, [NH4+]
for ammonium ion, or [13C]
for carbon-13. Aromatic atoms get lowercase letters: c
for aromatic carbon in benzene.
2. Bonds
Bond notation is straightforward:
-
for single bonds (usually omitted)=
for double bonds#
for triple bonds:
for aromatic bonds (also usually omitted)
So CC
and C-C
both represent ethane, while C=C
is ethylene.
3. Branches
Branches use parentheses, just like in mathematical expressions. Isobutyric acid becomes CC(C)C(=O)O
- the main chain is CC C(=O)O
with a methyl (C)
branch.
4. Rings
This is where SMILES gets clever. Instead of trying to represent rings directly, you break one bond and mark both ends with the same digit. Cyclohexane becomes C1CCCCC1
- the 1
connects the first and last carbon, closing the ring.
You can reuse digits for different rings in the same molecule, making complex structures manageable.
5. Disconnected Parts
Salts and other disconnected structures use periods. Sodium phenoxide: [Na+].[O-]c1ccccc1
. The order doesn’t matter - you’re just listing the separate components.
6. Aromaticity
Aromatic rings can be written directly with lowercase letters. Benzoic acid becomes c1ccccc1C(=O)O
. The system can also detect aromaticity automatically from Kekulé structures, so C1=CC=CC=C1C(=O)O
works just as well.
Simplified Rules for Organic Chemistry
Weininger recognized that most chemists work primarily with organic compounds, so he defined a simplified subset that covers the vast majority of cases. For organic molecules, you only need four rules:
- Atoms: Use standard symbols (C, N, O, etc.)
- Multiple bonds: Use
=
and#
for double and triple bonds - Branches: Use parentheses
()
- Rings: Use matching digits
This “basic SMILES” covers probably 90% of what most chemists encounter daily, making the system immediately accessible without having to learn all the edge cases.
Important Design Decisions
The original paper established several key conventions that remain important today:
Hydrogen Handling
Hydrogens are usually implicit - the system calculates how many each atom needs based on standard valences. So C
represents CH₄, N
represents NH₃, and so on. This keeps strings compact and readable.
Explicit hydrogens only appear in special cases: when hydrogen connects to multiple atoms, when you need to specify an exact count, or in isotopic specifications like [2H]
for deuterium.
Bond Representation
The paper made an important choice about how to represent bonds in ambiguous cases. For example, nitro groups could be written as charge-separated C[N+]([O-])[O-]
or with double bonds CN(=O)=O
. Weininger chose to prefer covalent bonds when possible, keeping the topology symmetric and avoiding unusual charges.
However, when covalent representation would require unusual valences, charge separation is preferred. Diazomethane becomes C=[N+]=[N-]
rather than forcing carbon into an unrealistic valence state.
Tautomers
SMILES doesn’t try to be too clever about tautomers - it represents exactly what you specify. So 2-pyridone can be written as either the enol form Oc1ncccc1
or the keto form O=c1[nH]cccc1
. The system won’t automatically convert between them.
This explicit approach means you need to decide which tautomeric form to represent, but it also means the notation precisely captures what you intend.
Aromaticity Detection
One of the most sophisticated parts of the original system was automatic aromaticity detection. The algorithm uses an extended Hückel rule: a ring is aromatic if all atoms are sp² hybridized and it contains 4N+2 π-electrons.
This means you can input benzene as the Kekulé structure C1=CC=CC=C1
and the system will automatically recognize it as aromatic and convert it to c1ccccc1
. The algorithm handles complex cases like tropone (O=c1ccccc1
) and correctly identifies them as aromatic.
Aromatic Nitrogen
The system makes an important distinction for nitrogen in aromatic rings. Pyridine-type nitrogen (like in pyridine itself) is written as n
and has no attached hydrogens. Pyrrole-type nitrogen has an attached hydrogen that must be specified explicitly: [nH]1cccc1
for pyrrole.
This distinction captures the fundamental difference in electron contribution between these two nitrogen types in aromatic systems.
Why This Paper Mattered
The 1988 SMILES paper solved a fundamental problem in computational chemistry: how to represent molecular structures in a way that’s both human-friendly and computationally efficient. The solution was elegant - create simple input rules for chemists while letting computers handle the complex canonicalization.
The performance gains were dramatic. SMILES strings took 40 characters to represent morphine compared to 1000-2000 for traditional connection tables. Processing time dropped by a factor of 100. But the real impact was in enabling new applications.
SMILES became the foundation for:
- Database storage - Compact, searchable molecular representations
- Substructure searching - Pattern matching in chemical databases
- Property prediction - Input format for QSAR models
- Chemical informatics - Standard exchange format between software
- Modern ML - Text-based representation for neural networks
The Legacy
Nearly four decades later, SMILES remains one of the most important file formats in computational chemistry. While newer approaches like SELFIES have addressed some limitations (like the possibility of invalid strings), SMILES’ combination of simplicity and power has made it enduringly useful.
The original paper established not just a notation system, but a design philosophy: chemical informatics tools should be powerful enough for computers while remaining accessible to working chemists. That balance is still relevant today as we develop new molecular representations for machine learning and AI applications.
Further Reading
For modern perspectives on SMILES, see:
- SMILES notation overview - My summary of current SMILES usage
- SELFIES - A more recent, ML-friendly alternative
- Converting SMILES to 2D images - Practical tutorial for visualization
The original paper: Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1), 31–36.