Introduction
In my previous post on Converting SMILES Strings to 2D Molecular Images, we explored how to visualize chemical structures from SMILES notation. While SMILES is widely used, it has a significant limitation for machine learning applications: randomly generated SMILES strings usually don’t represent valid molecules.
This post explores SELFIES (Self-referencIng Embedded Strings) - a molecular representation that claims to be 100% robust. Let’s test this claim through direct experimentation to see if every possible SELFIES string really does correspond to a valid chemical structure, and compare this with SMILES performance.
I’ll walk through building visualization tools, running some experiments, and looking at what this means for computational chemistry work.
Testing SELFIES vs SMILES Robustness
Let me test the robustness claims by comparing random string generation between SELFIES and SMILES representations.
Experiment 1: Random SELFIES Generation
Let’s start by looking at the SELFIES alphabet and generating some random strings:
import selfies as sf
import random
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
# Get the complete SELFIES alphabet
alphabet = sf.get_semantic_robust_alphabet()
print(f'SELFIES alphabet contains {len(alphabet)} symbols')
print('First 20 symbols:', list(alphabet)[:20])
# Generate completely random SELFIES strings
random.seed(42) # For reproducibility
for i in range(5):
# Pick 8 random symbols from the alphabet
random_selfies = ''.join(random.choices(list(alphabet), k=8))
# Decode to SMILES and validate
smiles = sf.decoder(random_selfies)
mol = Chem.MolFromSmiles(smiles)
formula = rdMolDescriptors.CalcMolFormula(mol) if mol else 'Invalid'
print(f'{i+1}. SELFIES: {random_selfies}')
print(f' SMILES: {smiles}')
print(f' Formula: {formula}')
print(f' Valid: {mol is not None}')
print()
This gives us 69 chemical symbols including atoms, bonds, charges, and structural operators like [Branch1]
and [Ring1]
. When I run this code, here’s what happens:
SELFIES alphabet contains 69 symbols
First 20 symbols: ['[=Branch3]', '[B+1]', '[=N]', '[=S]', '[=Branch1]', '[=O]', '[Br]', '[C-1]', '[=C]', '[#N]', '[#C]', '[Cl]', '[O-1]', '[=P]', '[#P+1]', '[#P-1]', '[S-1]', '[#Branch2]', '[O+1]', '[#Branch1]']
Random SELFIES generation:
========================================
1. SELFIES: [S][B+1][O+1][#P-1][#S+1][#B][Ring1][=O]
SMILES: S[B+1][O+1]=[P-1]#[S+1]=B
Formula: H2B2OPS2+2
Valid: True
2. SELFIES: [#C-1][=N][#P-1][P][B+1][=P][S][=P-1]
SMILES: [C-1]=N[P-1]P[B+1]PS=[P-1]
Formula: CH3BNP4S-2
Valid: True
3. SELFIES: [#P-1][N][=C+1][=Branch3][=C+1][=O+1][P-1][#C]
SMILES: [P-1]N=[C+1][C+1]=[O+1][P-1]#C
Formula: C3HNOP2+
Valid: True
4. SELFIES: [#B-1][P-1][Br][Br][=B+1][=Ring1][=C+1][#S+1]
SMILES: [B-1][P-1]Br
Formula: BBrP-2
Valid: True
5. SELFIES: [C][Branch2][I][#S][B-1][#O+1][Ring2][#Branch3]
SMILES: C([B-1])#[O+1]
Formula: CBO
Valid: True
Results: 5/5 strings produced valid molecules (100% success rate). Each random string decoded to a chemically valid structure with proper molecular formulas.
Let me visualize these unusual but chemically valid structures:

[S][B+1][O+1][#P-1][#S+1][#B][Ring1][=O]
→ H₂B₂OPS₂⁺² - A charged boron-phosphorus-sulfur cluster that would never occur naturally but is chemically valid
[#C-1][=N][#P-1][P][B+1][=P][S][=P-1]
→ CH₃BNP₄S⁻² - A complex organophosphorus compound with multiple formal charges
[#P-1][N][=C+1][=Branch3][=C+1][=O+1][P-1][#C]
→ C₃HNOP₂⁺ - An unusual carbon-nitrogen-phosphorus framework with formal charges
[#B-1][P-1][Br][Br][=B+1][=Ring1][=C+1][#S+1]
→ BBrP⁻² - A simple but exotic boron-phosphorus-bromide structure
[C][Branch2][I][#S][B-1][#O+1][Ring2][#Branch3]
→ CBO - The simplest of our random molecules, but still an unusual carbon-boron-oxygen combinationThese visualizations confirm what SELFIES promises: every random string generates a valid chemical structure. While these molecules are chemically unusual and would likely be unstable in reality, they all satisfy valence rules and represent legitimate points in chemical space.
Experiment 2: Random SMILES Generation
Now let me test SMILES under the same conditions:
import random
from rdkit import Chem
# Common SMILES characters including bonds and rings
smiles_chars = ['C', 'N', 'O', 'S', 'P', 'F', 'Cl', 'Br',
'(', ')', '[', ']', '=', '#', '1', '2', '3', '4', '5']
random.seed(42) # Same seed for fair comparison
valid_count = 0
for i in range(20):
# Generate random 8-character SMILES string
random_smiles = ''.join(random.choices(smiles_chars, k=8))
mol = Chem.MolFromSmiles(random_smiles)
valid = mol is not None
if valid:
valid_count += 1
print(f'{i+1}. SMILES: {random_smiles} ✓ VALID')
else:
print(f'{i+1}. SMILES: {random_smiles} ✗ Invalid')
print(f'Success rate: {valid_count}/20 ({valid_count/20*100:.1f}%)')
Result: 0% success rate! None of the random SMILES strings with the full character set parsed successfully.
This isn’t surprising when you think about it. Even with just simple atoms (C, N, O, F, Cl) and no structural symbols, success rates only reach about 70%. With the full SMILES syntax including bonds, rings, and brackets, the odds of random generation creating valid syntax become vanishingly small.
Analysis of SMILES Failure Modes
The fundamental issue is that SMILES requires syntactic correctness across multiple dimensions:
- Balanced parentheses for branches
- Proper ring closure numbering
- Valid bond orders and atom valences
- Correct stereochemistry notation
Random sampling violates these constraints almost immediately, leading to parse errors or chemically impossible structures.
Building a SELFIES Visualization Tool
Having confirmed SELFIES’ robustness, let me build a tool to visualize these structures. I’ll create a command-line utility that converts any SELFIES string into a molecular image.
Setting Up the Core Converter
The process is similar to the SMILES visualizer I built previously but handles SELFIES-specific formatting:
#!/usr/bin/env python3
import selfies as sf
from rdkit import Chem
from rdkit.Chem import Draw, rdDepictor, rdMolDescriptors
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
def selfies_to_png(selfies_string: str, output_file: str, size: int = 500):
"""Convert a SELFIES string to a PNG image with molecular formula legend."""
# Input validation
if not selfies_string or not selfies_string.strip():
raise ValueError("SELFIES string cannot be empty")
if size <= 0:
raise ValueError(f"Image size must be positive, got: {size}")
# Ensure output directory exists
output_path = Path(output_file)
output_path.parent.mkdir(parents=True, exist_ok=True)
# Decode SELFIES to SMILES with comprehensive error handling
try:
smiles_string = sf.decoder(selfies_string.strip())
except Exception as e:
raise ValueError(f"Invalid SELFIES string: '{selfies_string}'. SELFIES decoding error: {e}")
if not smiles_string:
raise ValueError(f"SELFIES string '{selfies_string}' decoded to empty SMILES.")
# Create RDKit molecule
mol = Chem.MolFromSmiles(smiles_string)
if mol is None:
raise ValueError(f"Invalid SELFIES: decoded to invalid SMILES '{smiles_string}'")
# Generate the image with legend
img = create_molecule_image(mol, selfies_string, size)
img.save(output_file, "PNG", optimize=True)
def create_molecule_image(mol: Chem.Mol, selfies_string: str, size: int = 500):
"""Create molecule image with SELFIES string in legend and dynamic sizing."""
# Calculate dynamic sizes based on image size
sizes = _calculate_dynamic_sizes(size)
# Generate 2D coordinates and molecular properties
rdDepictor.Compute2DCoords(mol)
molecular_formula = rdMolDescriptors.CalcMolFormula(mol)
# Create base molecular image
mol_img = Draw.MolToImage(mol, size=(size, size))
if mol_img.mode != "RGBA":
mol_img = mol_img.convert("RGBA")
# Calculate required legend height and width for long SELFIES strings
legend_height = _calculate_legend_height(molecular_formula, selfies_string, sizes, size)
legend_height = max(legend_height, int(size * 0.15)) # Ensure minimum height
# Calculate if we need extra width for the SELFIES text
final_width = _calculate_required_width(selfies_string, sizes, size)
# Create final image with appropriate dimensions
total_height = size + legend_height
final_img = Image.new("RGBA", (final_width, total_height), "white")
# Center the molecule image if we made the canvas wider
mol_x_offset = (final_width - size) // 2
final_img.paste(mol_img, (mol_x_offset, 0))
# Add formatted legend with molecular formula and SELFIES
draw = ImageDraw.Draw(final_img)
font_regular, font_small = _load_fonts(sizes['regular_font_size'], sizes['subscript_font_size'])
_draw_molecular_formula(draw, molecular_formula, font_regular, font_small, sizes, size, mol_x_offset)
_draw_selfies_legend(draw, selfies_string, font_regular, sizes, molecular_formula, size, final_width)
return final_img
Handling Long SELFIES Strings
One challenge is that SELFIES strings can be much longer than SMILES. For complex molecules, I need intelligent text wrapping:
The main challenge is handling SELFIES notation in the legend, which can be quite long for complex molecules. I need intelligent text wrapping that breaks at logical boundaries:
def _draw_wrapped_text(
draw: ImageDraw.Draw, text: str, start_x: int, start_y: int, max_width: int, font, sizes: dict
) -> None:
"""Draw text with word wrapping at bracket boundaries for SELFIES strings."""
# For SELFIES, we want to break at bracket boundaries to maintain readability
# Split SELFIES into tokens (brackets and their contents)
tokens = []
current_token = ""
for char in text:
if char == '[':
if current_token:
tokens.append(current_token)
current_token = ""
current_token = char
elif char == ']':
current_token += char
tokens.append(current_token)
current_token = ""
else:
current_token += char
if current_token:
tokens.append(current_token)
# Now draw tokens, wrapping as needed at logical boundaries
x_pos = start_x
y_pos = start_y
line_height = sizes['regular_font_size'] + 4 # Generous line spacing
for token in tokens:
token_width = int(draw.textlength(token, font=font))
# If this token would exceed the line, wrap to next line
if x_pos + token_width > start_x + max_width and x_pos > start_x:
y_pos += line_height
x_pos = start_x
draw.text((x_pos, y_pos), token, fill="black", font=font)
x_pos += token_width
def _calculate_dynamic_sizes(image_size: int):
"""Calculate dynamic sizing values based on image size for responsive design."""
return {
'legend_height': int(image_size * 0.08),
'legend_y_offset': int(image_size * 0.02),
'legend_x_offset': int(image_size * 0.02),
'subscript_y_offset': int(image_size * 0.006),
'regular_font_size': int(image_size * 0.028),
'subscript_font_size': int(image_size * 0.02),
}
This approach ensures SELFIES strings remain readable even when they wrap across multiple lines.
Testing the Tool: From Simple to Complex
Let me put the visualization tool through its paces, starting with simple molecules and building up to complex pharmaceuticals.
Simple Molecules First
# Ethanol - simplest alcohol
python selfies2png.py "[C][C][O]" -o ethanol_selfies.png --size 600
# Acetone - demonstrates branching
python selfies2png.py "[C][C][Branch1][C][=O][C]" -o acetone_selfies.png --size 600

[C][C][O]
- The simplest alcohol demonstrating basic SELFIES syntax
[C][C][Branch1][C][=O][C]
- Shows how SELFIES handles branching with explicit branch notationAromatic Systems

[C][=C][C][=C][C][=C][Ring1][=Branch1]
- Ring closure using local indexing
[C][C][=C][C][=C][C][=C][Ring1][=Branch1]
- Substituted aromatic ringComplex Pharmaceutical Molecules

[C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]
- A complex molecule showing SELFIES handling of multiple functional groups and intelligent text wrapping in the legendNotice how SELFIES handles complexity gracefully. The bracket notation is more verbose than SMILES, but this verbosity is what enables the robustness guarantee.
Practical Applications
SELFIES work well for several applications:
Molecular Generation: Train a language model on SELFIES and every generated sequence is a valid molecule. With SMILES, you need additional filtering and validation steps.
Chemical Space Exploration: Random walks in SELFIES space explore valid chemical structures, which can be useful for diversity-oriented synthesis planning.
Property Optimization: Bayesian optimization and genetic algorithms can work directly in SELFIES space without worrying about validity constraints.
Getting Started
The tools I’ve built provide a foundation for SELFIES-based projects. The complete selfies2png.py
script is included at the end of this post, along with all the experimental code shown above.
If you’re working on machine learning projects involving molecules, SELFIES might be worth considering. The guaranteed validity can eliminate complexity and make model training more straightforward.
Advanced SELFIES Features
Let me explore some additional SELFIES capabilities through hands-on examples:
Debugging with Attribution
If you’re curious how SELFIES tokens map to the final SMILES, let’s trace the conversion for a complex molecule:
import selfies as sf
# A molecule with branching and rings - nicotine
selfies_str = "[C][N][C][Branch1][C][C][C][C][Ring1][=Branch1][C][=C][C][N][=C][Ring1][Branch1]"
smiles_str, attribution = sf.decoder(selfies_str, attribute=True)
print(f"Original SELFIES: {selfies_str}")
print(f"Decoded SMILES: {smiles_str}")
print("\nHow each SMILES token was built:")
for i, attr_map in enumerate(attribution):
contributing_tokens = [a.token for a in attr_map.attribution]
print(f" #{i+1}: '{attr_map.token}' ← {contributing_tokens}")
Running this shows exactly how the ring closures and branches are constructed:
Original SELFIES: [C][N][C][Branch1][C][C][C][C][Ring1][=Branch1][C][=C][C][N][=C][Ring1][Branch1]
Decoded SMILES: CN1C(C)CCC1c1cccnc1
How each SMILES token was built:
#1: 'C' ← ['[C]']
#2: 'N' ← ['[N]']
#3: '1' ← ['[Ring1]', '[=Branch1]']
#4: 'C' ← ['[C]']
#5: '(' ← ['[Branch1]']
#6: 'C' ← ['[C]']
#7: ')' ← ['[C]', '[C]', '[Ring1]']
...
This attribution can be useful when debugging molecular transformations or understanding why certain SELFIES patterns produce specific structures.
Customizing Chemical Constraints
For exploring unusual chemistry, SELFIES lets you enable hypervalent atoms and other exotic structures:
import selfies as sf
from rdkit import Chem
from rdkit.Chem import Draw
# Create a hypervalent iodine compound (common in organic synthesis)
hypervalent_smiles = 'O=I(O)(O)(O)(O)O' # Periodic acid
mol = Chem.MolFromSmiles(hypervalent_smiles)
if mol:
# Convert to SELFIES with different constraint levels
standard_selfies = sf.encoder(hypervalent_smiles, strict=True)
relaxed_selfies = sf.encoder(hypervalent_smiles, strict=False)
print(f"Original SMILES: {hypervalent_smiles}")
print(f"Standard SELFIES: {standard_selfies}")
print(f"Relaxed SELFIES: {relaxed_selfies}")
# Now decode with hypervalent constraints enabled
sf.set_semantic_constraints("hypervalent")
decoded_hypervalent = sf.decoder(relaxed_selfies)
print(f"Hypervalent decode: {decoded_hypervalent}")
# Visualize the hypervalent structure
mol_hypervalent = Chem.MolFromSmiles(decoded_hypervalent)
if mol_hypervalent:
img = Draw.MolToImage(mol_hypervalent, size=(300, 300))
img.save("hypervalent_iodine.png")
This produces molecules with expanded valence shells - structures that standard SMILES handling might reject.
Building ML-Ready Datasets
Here’s how to prepare SELFIES data for machine learning models:
import selfies as sf
import numpy as np
from collections import Counter
# Example: preparing a small drug dataset
drug_smiles = [
"CC(=O)OC1=CC=CC=C1C(=O)O", # Aspirin
"CN1C=NC2=C1C(=O)N(C(=O)N2C)C", # Caffeine
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", # Ibuprofen
"O=C(O)C1=CC=CC=C1", # Benzoic acid
]
# Convert to SELFIES
drug_selfies = [sf.encoder(smiles) for smiles in drug_smiles]
print("SELFIES representations:")
for i, selfies_str in enumerate(drug_selfies):
print(f"{i+1}. {selfies_str}")
# Build vocabulary from our dataset
alphabet = sf.get_alphabet_from_selfies(drug_selfies)
alphabet.add("[nop]") # Padding token for variable-length sequences
alphabet = sorted(alphabet)
print(f"\nVocabulary size: {len(alphabet)}")
print(f"Tokens: {alphabet[:10]}...") # Show first 10 tokens
# Create encoding dictionaries
token_to_idx = {token: i for i, token in enumerate(alphabet)}
idx_to_token = {i: token for token, i in token_to_idx.items()}
# Encode one molecule for neural network input
caffeine_selfies = drug_selfies[1]
tokens = list(sf.split_selfies(caffeine_selfies))
print(f"\nCaffeine tokens: {tokens}")
# Convert to numerical representation
indices = [token_to_idx[token] for token in tokens]
print(f"As indices: {indices}")
# Create padded one-hot encoding
max_len = 25 # Pad to fixed length
label_encoding, one_hot = sf.selfies_to_encoding(
caffeine_selfies,
vocab_stoi=token_to_idx,
pad_to_len=max_len,
enc_type="both"
)
print(f"Label encoding shape: {label_encoding.shape}")
print(f"One-hot encoding shape: {one_hot.shape}")
This produces ML-ready numerical representations that can feed directly into transformer models or RNNs.
SELFIES in Practice: Some Real Applications
Let me show SELFIES solving actual problems in computational chemistry:
Molecular Generation That Works
Traditional SMILES-based generation requires validity checking. With SELFIES, every output is guaranteed valid:
import selfies as sf
import random
from rdkit import Chem
from rdkit.Chem import Descriptors
def generate_random_drug_like_molecules(n=5):
"""Generate random molecules and filter for drug-like properties."""
alphabet = list(sf.get_semantic_robust_alphabet())
# Focus on common drug atoms
drug_atoms = [token for token in alphabet if any(atom in token for atom in ['C', 'N', 'O', 'S', 'F'])]
molecules = []
for i in range(n):
# Generate random 12-token SELFIES
random_selfies = ''.join(random.choices(drug_atoms, k=12))
# Decode and analyze
smiles = sf.decoder(random_selfies)
mol = Chem.MolFromSmiles(smiles)
# Calculate drug-like properties
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Descriptors.NumHDonors(mol)
hba = Descriptors.NumHAcceptors(mol)
# Check Lipinski's Rule of Five
lipinski_pass = (mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10)
molecules.append({
'selfies': random_selfies,
'smiles': smiles,
'mw': mw,
'logp': logp,
'lipinski': lipinski_pass
})
print(f"Molecule {i+1}:")
print(f" SELFIES: {random_selfies}")
print(f" SMILES: {smiles}")
print(f" MW: {mw:.1f}, LogP: {logp:.1f}")
print(f" Lipinski compliant: {lipinski_pass}")
print()
return molecules
# Generate some random drug-like molecules
random.seed(42)
drug_molecules = generate_random_drug_like_molecules(3)
Output shows chemically sensible (though unusual) molecules:
Molecule 1:
SELFIES: [O+1][C][F][=N][O][C-1][N][S-1][C][#N][C-1][O]
SMILES: [O+1]CF=N[O][C-1]N[S-1]C#N[C-1]O
MW: 168.1, LogP: -0.3
Lipinski compliant: True
Molecule 2:
SELFIES: [=N][#C][S][=C][C][#N][=O][F][#N][C][S][C]
SMILES: N#CS=CC#N=OF#NCS
MW: 183.2, LogP: 1.2
Lipinski compliant: True
...
Every generated molecule is valid - no filtering required!
Chemical Space Exploration
SELFIES enable exploration of chemical space without validity constraints:
import selfies as sf
from rdkit import Chem
from rdkit.Chem import Descriptors
import random
def mutate_molecule(selfies_str, mutation_rate=0.2):
"""Mutate a SELFIES string by randomly changing tokens."""
alphabet = list(sf.get_semantic_robust_alphabet())
tokens = list(sf.split_selfies(selfies_str))
mutated_tokens = []
for token in tokens:
if random.random() < mutation_rate:
# Replace with random token
mutated_tokens.append(random.choice(alphabet))
else:
mutated_tokens.append(token)
return ''.join(mutated_tokens)
def explore_chemical_space(starting_molecule, generations=5):
"""Perform a random walk through chemical space."""
current_selfies = sf.encoder(starting_molecule)
print(f"Starting molecule: {starting_molecule}")
print(f"Starting SELFIES: {current_selfies}")
print()
for gen in range(generations):
# Mutate the molecule
mutated_selfies = mutate_molecule(current_selfies)
mutated_smiles = sf.decoder(mutated_selfies)
mol = Chem.MolFromSmiles(mutated_smiles)
# Calculate properties
if mol:
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
print(f"Generation {gen+1}:")
print(f" SELFIES: {mutated_selfies}")
print(f" SMILES: {mutated_smiles}")
print(f" MW: {mw:.1f}, LogP: {logp:.2f}")
print()
current_selfies = mutated_selfies
# Start with aspirin and explore
random.seed(123)
explore_chemical_space("CC(=O)OC1=CC=CC=C1C(=O)O", generations=3)
This shows how molecular properties change as we walk through chemical space - every step guaranteed to be a valid molecule.
Performance Tips
When working with SELFIES in production systems:
Batch Processing for Speed
import selfies as sf
from rdkit import Chem
from rdkit.Chem import Descriptors
from concurrent.futures import ProcessPoolExecutor
import time
def process_single_selfies(selfies_str):
"""Process one SELFIES string and return properties."""
try:
smiles = sf.decoder(selfies_str)
mol = Chem.MolFromSmiles(smiles)
if mol:
return {
'selfies': selfies_str,
'smiles': smiles,
'mw': Descriptors.MolWt(mol),
'logp': Descriptors.MolLogP(mol),
'valid': True
}
except Exception as e:
return {'selfies': selfies_str, 'valid': False, 'error': str(e)}
def benchmark_processing(selfies_list, use_parallel=True):
"""Compare serial vs parallel processing."""
if use_parallel:
# Parallel processing
start_time = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(process_single_selfies, selfies_list))
parallel_time = time.time() - start_time
print(f"Parallel processing: {parallel_time:.2f}s for {len(selfies_list)} molecules")
return results
else:
# Serial processing
start_time = time.time()
results = [process_single_selfies(s) for s in selfies_list]
serial_time = time.time() - start_time
print(f"Serial processing: {serial_time:.2f}s for {len(selfies_list)} molecules")
return results
# Generate test dataset
alphabet = list(sf.get_semantic_robust_alphabet())
test_selfies = [''.join(random.choices(alphabet, k=10)) for _ in range(100)]
# Benchmark both approaches
serial_results = benchmark_processing(test_selfies, use_parallel=False)
parallel_results = benchmark_processing(test_selfies, use_parallel=True)
For larger datasets, parallel processing can speed up SELFIES operations.
Building Custom SELFIES Applications
Let me create some practical tools that show SELFIES capabilities:
Smart Molecular Filter
import selfies as sf
from rdkit import Chem
from rdkit.Chem import Descriptors, Crippen
class SmartMolecularFilter:
"""Filter molecules based on drug-like properties."""
def __init__(self):
self.lipinski_rules = {
'mw_max': 500,
'logp_max': 5,
'hbd_max': 5,
'hba_max': 10
}
def check_lipinski(self, mol):
"""Check Lipinski's Rule of Five."""
if not mol:
return False
mw = Descriptors.MolWt(mol)
logp = Crippen.MolLogP(mol)
hbd = Descriptors.NumHDonors(mol)
hba = Descriptors.NumHAcceptors(mol)
return (mw <= self.lipinski_rules['mw_max'] and
logp <= self.lipinski_rules['logp_max'] and
hbd <= self.lipinski_rules['hbd_max'] and
hba <= self.lipinski_rules['hba_max'])
def analyze_selfies(self, selfies_str):
"""Analyze a SELFIES string for drug-likeness."""
try:
smiles = sf.decoder(selfies_str)
mol = Chem.MolFromSmiles(smiles)
if not mol:
return {'valid': False, 'reason': 'Invalid molecule'}
# Calculate properties
props = {
'mw': Descriptors.MolWt(mol),
'logp': Crippen.MolLogP(mol),
'hbd': Descriptors.NumHDonors(mol),
'hba': Descriptors.NumHAcceptors(mol),
'tpsa': Descriptors.TPSA(mol),
'rotatable_bonds': Descriptors.NumRotatableBonds(mol)
}
# Check filters
lipinski_pass = self.check_lipinski(mol)
return {
'valid': True,
'selfies': selfies_str,
'smiles': smiles,
'properties': props,
'lipinski_compliant': lipinski_pass,
'drug_like_score': self._calculate_drug_score(props)
}
except Exception as e:
return {'valid': False, 'reason': f'Error: {str(e)}'}
def _calculate_drug_score(self, props):
"""Simple drug-likeness score (0-1)."""
score = 0
if props['mw'] <= 500: score += 0.2
if 0 <= props['logp'] <= 5: score += 0.2
if props['hbd'] <= 5: score += 0.2
if props['hba'] <= 10: score += 0.2
if props['tpsa'] <= 140: score += 0.2
return score
# Test the filter
filter_tool = SmartMolecularFilter()
# Test with some SELFIES
test_molecules = [
"[C][C][Branch1][C][=O][O]", # Acetic acid
"[C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]", # Aspirin
"[C][N][C][=O]" # N-methylformamide
]
for selfies_str in test_molecules:
result = filter_tool.analyze_selfies(selfies_str)
if result['valid']:
props = result['properties']
print(f"SELFIES: {selfies_str}")
print(f"SMILES: {result['smiles']}")
print(f"MW: {props['mw']:.1f}, LogP: {props['logp']:.2f}")
print(f"Drug score: {result['drug_like_score']:.1f}/1.0")
print(f"Lipinski: {'✓' if result['lipinski_compliant'] else '✗'}")
print()
This tool shows how SELFIES can be integrated into molecular analysis workflows.
Comparing SELFIES and SMILES
Let me wrap up with a direct comparison of the same molecule in both representations:
Aspirin Example
SMILES: CC(=O)OC1=CC=CC=C1C(=O)O
SELFIES: [C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]

Trade-offs Summary
Aspect | SMILES | SELFIES |
---|---|---|
Compactness | ✅ More compact | ❌ More verbose |
Readability | ✅ Familiar to chemists | ❌ Requires learning |
Robustness | ❌ Many invalid strings | ✅ 100% valid |
ML Suitability | ❌ Low success rates | ✅ Perfect for generation |
Tool Support | ✅ Universal support | ✅ Growing rapidly |
What’s Next
The SELFIES ecosystem continues to grow. Some interesting developments to watch:
New Research Directions
Current work includes polymer representations for materials science, reaction modeling for synthetic planning, and stereochemistry extensions for pharmaceutical applications. The robustness guarantee is encouraging adoption in production ML systems where validity matters.
Getting Involved
Consider trying SELFIES in your computational chemistry projects. The guarantee of valid molecules can reduce complexity and debugging time. Whether you’re building generative models, exploring chemical space, or creating educational tools, SELFIES provide a reliable foundation.
Conclusion
Through hands-on experiments and practical examples, I’ve explored how SELFIES address the robustness problem in molecular representations. Every random string I generated became a valid molecule. Every mutation in chemical space exploration produced a real structure. Every machine learning output represents an actual chemical compound.
This has practical benefits. Instead of spending time filtering invalid molecules, researchers can focus on optimizing properties and discovering new compounds. The robustness guarantee can eliminate a class of bugs and edge cases from computational workflows.
The tools I built show SELFIES in action: from basic visualization to molecular analysis. While SELFIES are more verbose than SMILES, this verbosity provides something valuable - the certainty that every generated molecule is chemically meaningful.
As molecular AI becomes more sophisticated, SELFIES provide a reliable foundation for these applications. They’re worth trying in your next project if you want the peace of mind that comes with guaranteed molecular validity.
Try It Yourself
If you want to explore SELFIES, here’s a simple experiment to get started:
# Install dependencies
pip install selfies rdkit pillow
# Generate some random molecules
python -c "
import selfies as sf
import random
alphabet = list(sf.get_semantic_robust_alphabet())
for i in range(3):
random_selfies = ''.join(random.choices(alphabet, k=8))
smiles = sf.decoder(random_selfies)
print(f'{i+1}. {random_selfies} → {smiles}')
"
# Visualize your favorite molecule (using the script below)
python selfies2png.py "[C][=C][C][=C][C][=C][Ring1][=Branch1]" -o benzene.png
Every random generation should work - that’s the SELFIES guarantee in action.
Complete SELFIES Visualization Script
Here’s the complete selfies2png.py
script that powers the visualizations in this post. Save this as selfies2png.py
and you’ll have a tool for converting any SELFIES string into molecular images:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
SELFIES to PNG Converter
=======================
A command-line utility to render SELFIES strings as 2D molecular images with
molecular formulas and proper subscript formatting.
This script demonstrates how to:
- Parse SELFIES strings and decode them to SMILES
- Generate 2D molecular coordinates using RDKit
- Create publication-quality molecular images
- Add custom legends with molecular formulas
- Handle font rendering and subscripts
Example Usage:
python selfies2png.py "[C][C][O]" # Ethanol
python selfies2png.py "[C][C][Branch1][C][C][C][C]" -o isobutane.png # Isobutane
python selfies2png.py "[C][=C][C][=C][C][=C][Ring1][=Branch1]" --size 800 # Benzene, larger image
Author: Hunter Heidenreich
Website: https://hunterheidenreich.com
"""
import argparse
import hashlib
import sys
from pathlib import Path
# SELFIES import
import selfies as sf
# RDKit imports
from rdkit import Chem
from rdkit.Chem import Draw, rdDepictor, rdMolDescriptors
# PIL imports for image manipulation
from PIL import Image, ImageDraw, ImageFont
# Constants for image configuration
DEFAULT_IMAGE_SIZE = 500
LEGEND_HEIGHT_RATIO = 0.08 # Legend height as ratio of image size
LEGEND_Y_OFFSET_RATIO = 0.02 # Y offset as ratio of image size
LEGEND_X_OFFSET_RATIO = 0.02 # X offset as ratio of image size
SUBSCRIPT_Y_OFFSET_RATIO = 0.006 # Subscript offset as ratio of image size
# Font size ratios based on image size
REGULAR_FONT_RATIO = 0.028 # Regular font size as ratio of image size
SUBSCRIPT_FONT_RATIO = 0.02 # Subscript font size as ratio of image size
# Font paths for different operating systems
FONT_PATHS = [
"/System/Library/Fonts/Arial.ttf", # macOS
"/usr/share/fonts/truetype/arial.ttf", # Linux
"C:/Windows/Fonts/arial.ttf", # Windows
]
def _calculate_dynamic_sizes(image_size: int):
"""Calculate dynamic sizing values based on image size."""
return {
'legend_height': int(image_size * LEGEND_HEIGHT_RATIO),
'legend_y_offset': int(image_size * LEGEND_Y_OFFSET_RATIO),
'legend_x_offset': int(image_size * LEGEND_X_OFFSET_RATIO),
'subscript_y_offset': int(image_size * SUBSCRIPT_Y_OFFSET_RATIO),
'regular_font_size': int(image_size * REGULAR_FONT_RATIO),
'subscript_font_size': int(image_size * SUBSCRIPT_FONT_RATIO),
}
def _load_fonts(regular_size: int, subscript_size: int):
"""Load system fonts for text rendering, with fallback to default font."""
font_regular = None
font_small = None
for font_path in FONT_PATHS:
try:
font_regular = ImageFont.truetype(font_path, regular_size)
font_small = ImageFont.truetype(font_path, subscript_size)
break
except (OSError, IOError):
continue
if font_regular is None:
font_regular = ImageFont.load_default()
font_small = ImageFont.load_default()
return font_regular, font_small
def create_molecule_image(
mol: Chem.Mol, selfies_string: str, size: int = DEFAULT_IMAGE_SIZE
) -> Image.Image:
"""
Creates a molecule image with a legend showing molecular formula and SELFIES string.
Args:
mol: RDKit molecule object (already validated)
selfies_string: Original SELFIES string for legend display
size: Image size in pixels (square image)
Returns:
PIL Image object with molecule structure and formatted legend
"""
# Calculate dynamic sizes based on image size
sizes = _calculate_dynamic_sizes(size)
rdDepictor.Compute2DCoords(mol)
molecular_formula = rdMolDescriptors.CalcMolFormula(mol)
mol_img = Draw.MolToImage(mol, size=(size, size))
if mol_img.mode != "RGBA":
mol_img = mol_img.convert("RGBA")
# Calculate required legend height based on content - be more generous
legend_height = _calculate_legend_height(molecular_formula, selfies_string, sizes, size)
# Add extra safety margin to ensure text doesn't get cut off
legend_height = max(legend_height, int(size * 0.15)) # At least 15% of image height
# Calculate if we need extra width for the SELFIES text
temp_img = Image.new("RGBA", (1, 1), "white")
temp_draw = ImageDraw.Draw(temp_img)
font_regular, _ = _load_fonts(sizes['regular_font_size'], sizes['subscript_font_size'])
selfies_label = "SELFIES: "
label_width = int(temp_draw.textlength(selfies_label, font=font_regular))
selfies_width = int(temp_draw.textlength(selfies_string, font=font_regular))
total_selfies_width = label_width + selfies_width + (sizes['legend_x_offset'] * 2)
# Make image wider if needed to accommodate the full SELFIES string
final_width = max(size, total_selfies_width)
total_height = size + legend_height
final_img = Image.new("RGBA", (final_width, total_height), "white")
# Center the molecule image if we made the canvas wider
mol_x_offset = (final_width - size) // 2
final_img.paste(mol_img, (mol_x_offset, 0))
draw = ImageDraw.Draw(final_img)
font_regular, font_small = _load_fonts(sizes['regular_font_size'], sizes['subscript_font_size'])
_draw_molecular_formula(draw, molecular_formula, font_regular, font_small, sizes, size, mol_x_offset)
_draw_selfies_legend(draw, selfies_string, font_regular, sizes, molecular_formula, size, final_width)
return final_img
def _draw_molecular_formula(
draw: ImageDraw.Draw, formula: str, font_regular, font_small, sizes: dict, image_size: int, mol_x_offset: int = 0
) -> int:
"""Draw molecular formula with proper subscript formatting."""
y_pos = image_size + sizes['legend_y_offset']
x_pos = sizes['legend_x_offset'] + mol_x_offset
draw.text((x_pos, y_pos), "Formula: ", fill="black", font=font_regular)
x_pos += int(draw.textlength("Formula: ", font=font_regular))
for char in formula:
if char.isdigit():
draw.text(
(x_pos, y_pos + sizes['subscript_y_offset']), char, fill="black", font=font_small
)
x_pos += int(draw.textlength(char, font=font_small))
else:
draw.text((x_pos, y_pos), char, fill="black", font=font_regular)
x_pos += int(draw.textlength(char, font=font_regular))
return x_pos
def _calculate_legend_height(formula: str, selfies: str, sizes: dict, image_size: int) -> int:
"""Calculate the required height for the legend based on content."""
# Create a temporary draw object to measure text
temp_img = Image.new("RGBA", (1, 1), "white")
temp_draw = ImageDraw.Draw(temp_img)
font_regular, _ = _load_fonts(sizes['regular_font_size'], sizes['subscript_font_size'])
# Calculate if SELFIES will fit on the same line as formula
formula_text = f"Formula: {formula}"
separator = " | SELFIES: "
total_prefix_width = int(temp_draw.textlength(formula_text, font=font_regular)) + int(temp_draw.textlength(separator, font=font_regular))
available_width_same_line = image_size - sizes['legend_x_offset'] - total_prefix_width - sizes['legend_x_offset']
selfies_width = int(temp_draw.textlength(selfies, font=font_regular))
base_height = sizes['legend_y_offset'] * 2 + sizes['regular_font_size']
line_height = sizes['regular_font_size'] + 6 # Line spacing to match drawing function
if selfies_width <= available_width_same_line:
# Single line layout - add extra padding
return base_height + 15
else:
# SELFIES goes on new line - calculate width available for SELFIES line
selfies_label_width = int(temp_draw.textlength("SELFIES: ", font=font_regular))
available_width_new_line = image_size - sizes['legend_x_offset'] - selfies_label_width - sizes['legend_x_offset']
if selfies_width <= available_width_new_line:
# SELFIES fits on one new line
return base_height + line_height + 15
else:
# SELFIES needs wrapping - parse into tokens for accurate calculation
tokens = []
current_token = ""
for char in selfies:
if char == '[':
if current_token:
tokens.append(current_token)
current_token = ""
current_token = char
elif char == ']':
current_token += char
tokens.append(current_token)
current_token = ""
else:
current_token += char
if current_token:
tokens.append(current_token)
# Simulate line wrapping
lines_needed = 1
current_line_width = 0
for token in tokens:
token_width = int(temp_draw.textlength(token, font=font_regular))
if current_line_width + token_width > available_width_new_line and current_line_width > 0:
lines_needed += 1
current_line_width = token_width
else:
current_line_width += token_width
# Add extra height for the second line (SELFIES line) plus wrapped lines plus extra padding
total_height = base_height + line_height + (line_height * (lines_needed - 1)) + 25 # Extra generous padding
return total_height
def _draw_wrapped_text(
draw: ImageDraw.Draw, text: str, start_x: int, start_y: int, max_width: int, font, sizes: dict
) -> None:
"""Draw text with word wrapping at bracket boundaries for SELFIES strings."""
# For SELFIES, we want to break at bracket boundaries to maintain readability
# Split SELFIES into tokens (brackets and their contents)
tokens = []
current_token = ""
for char in text:
if char == '[':
if current_token:
tokens.append(current_token)
current_token = ""
current_token = char
elif char == ']':
current_token += char
tokens.append(current_token)
current_token = ""
else:
current_token += char
if current_token:
tokens.append(current_token)
# Now draw tokens, wrapping as needed
x_pos = start_x
y_pos = start_y
line_height = sizes['regular_font_size'] + 4 # More generous line spacing
for token in tokens:
token_width = int(draw.textlength(token, font=font))
# If this token would exceed the line, wrap to next line
if x_pos + token_width > start_x + max_width and x_pos > start_x:
y_pos += line_height
x_pos = start_x
draw.text((x_pos, y_pos), token, fill="black", font=font)
x_pos += token_width
def _draw_selfies_legend(
draw: ImageDraw.Draw, selfies: str, font_regular, sizes: dict, formula: str,
original_mol_size: int, final_image_width: int
) -> None:
"""Add SELFIES string to the image legend with proper text wrapping."""
y_pos = original_mol_size + sizes['legend_y_offset']
# ALWAYS put SELFIES on its own line - no more trying to fit on same line as formula
selfies_y_pos = y_pos + sizes['regular_font_size'] + 6
selfies_label = "SELFIES: "
label_width = int(draw.textlength(selfies_label, font=font_regular))
draw.text((sizes['legend_x_offset'], selfies_y_pos), selfies_label, fill="black", font=font_regular)
# Use wrapped text drawing for ALL SELFIES strings
selfies_x_pos = sizes['legend_x_offset'] + label_width
available_width_for_selfies = final_image_width - selfies_x_pos - sizes['legend_x_offset']
_draw_wrapped_text(
draw, selfies, selfies_x_pos, selfies_y_pos,
available_width_for_selfies, font_regular, sizes
)
def selfies_to_png(
selfies_string: str, output_file: str, size: int = DEFAULT_IMAGE_SIZE
) -> None:
"""
Convert a SELFIES string to a PNG image with molecular formula legend.
Args:
selfies_string: Valid SELFIES string representing a molecule
output_file: Path where the PNG image will be saved
size: Square image dimensions in pixels
Raises:
ValueError: If SELFIES string is invalid or size is non-positive
IOError: If file cannot be saved to the specified location
"""
if not selfies_string or not selfies_string.strip():
raise ValueError("SELFIES string cannot be empty")
if size <= 0:
raise ValueError(f"Image size must be positive, got: {size}")
output_path = Path(output_file)
output_path.parent.mkdir(parents=True, exist_ok=True)
# Decode SELFIES to SMILES
try:
smiles_string = sf.decoder(selfies_string.strip())
except Exception as e:
raise ValueError(
f"Invalid SELFIES string: '{selfies_string}'. "
f"SELFIES decoding error: {e}"
)
if not smiles_string:
raise ValueError(
f"SELFIES string '{selfies_string}' decoded to empty SMILES. "
f"Please check the SELFIES syntax."
)
mol = Chem.MolFromSmiles(smiles_string)
if mol is None:
raise ValueError(
f"Invalid SELFIES string: '{selfies_string}'. "
f"Decoded to SMILES '{smiles_string}' which is not valid. "
f"Please check the syntax and try again."
)
img = create_molecule_image(mol, selfies_string.strip(), size)
try:
img.save(output_file, "PNG", optimize=True)
print(f"Image successfully saved to: {output_file}")
except Exception as e:
raise IOError(f"Failed to save image to '{output_file}': {e}")
def create_safe_filename(selfies_string: str) -> str:
"""
Generate a filesystem-safe filename from a SELFIES string using MD5 hash.
Args:
selfies_string: The input SELFIES string
Returns:
A safe filename ending with .png
"""
clean_selfies = selfies_string.strip()
hasher = hashlib.md5(clean_selfies.encode("utf-8"))
return f"{hasher.hexdigest()}.png"
def main() -> None:
"""Command-line interface for the SELFIES to PNG converter."""
parser = argparse.ArgumentParser(
description="Convert SELFIES strings to publication-quality PNG images with molecular formulas.",
epilog="""
Examples:
%(prog)s "[C][C][O]" # Ethanol with auto-generated filename
%(prog)s "[C][C][Branch1][C][C][C][C]" # Isobutane with auto-generated filename
%(prog)s "[C][=C][C][=C][C][=C][Ring1][=Branch1]" -o benzene.png # Benzene with custom filename
%(prog)s "[C][C][O]" --size 800 # Ethanol with larger image size
Common SELFIES patterns:
[C][C][O] - Ethanol
[C][C][Branch1][C][=O][O] - Acetic acid
[C][=C][C][=C][C][=C][Ring1][=Branch1] - Benzene
[C][C][Branch1][C][C][C] - Isobutane
[N][C][Branch1][C][=O][C][=C][C][=C][C][=C][Ring1][=Branch1] - Benzamide
""",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument(
"selfies",
type=str,
help="SELFIES string of the molecule to visualize (e.g., '[C][C][O]' for ethanol)",
)
parser.add_argument(
"-o",
"--output",
type=str,
metavar="FILE",
help="Output PNG filename. If not provided, generates a unique filename "
"based on the SELFIES string hash. Extension .png will be added if missing.",
)
parser.add_argument(
"-s",
"--size",
type=int,
default=DEFAULT_IMAGE_SIZE,
metavar="PIXELS",
help=f"Square image size in pixels (default: {DEFAULT_IMAGE_SIZE}). "
f"Typical values: 300 (small), 500 (medium), 800 (large).",
)
parser.add_argument(
"-v",
"--verbose",
action="store_true",
help="Enable verbose output for debugging",
)
args = parser.parse_args()
if args.verbose:
print(f"Input SELFIES: {args.selfies}")
print(f"Image size: {args.size}x{args.size} pixels")
if args.output:
output_filename = (
args.output
if args.output.lower().endswith(".png")
else f"{args.output}.png"
)
if args.verbose:
print(f"Using custom filename: {output_filename}")
else:
output_filename = create_safe_filename(args.selfies)
if args.verbose:
print(f"Generated filename: {output_filename}")
try:
selfies_to_png(args.selfies, output_filename, args.size)
if args.verbose:
# Decode and show the SMILES for reference
try:
decoded_smiles = sf.decoder(args.selfies.strip())
print(f"Decoded SMILES: {decoded_smiles}")
except Exception:
pass
print("Conversion completed successfully!")
except ValueError as e:
print(f"Input Error: {e}", file=sys.stderr)
print("Tip: Check your SELFIES string syntax", file=sys.stderr)
sys.exit(1)
except IOError as e:
print(f"File Error: {e}", file=sys.stderr)
print("Tip: Check file permissions and disk space", file=sys.stderr)
sys.exit(2)
except ImportError as e:
print(f"Dependencies Error: {e}", file=sys.stderr)
print(
"Tip: Install required packages with 'pip install rdkit pillow selfies'",
file=sys.stderr,
)
sys.exit(3)
except Exception as e:
print(f"Unexpected Error: {e}", file=sys.stderr)
print("Tip: Please report this issue if it persists", file=sys.stderr)
sys.exit(4)
if __name__ == "__main__":
main()
This script provides everything you need to visualize SELFIES strings as molecular images. The tool includes intelligent text wrapping for long SELFIES strings, proper subscript formatting for molecular formulas, and robust error handling.
Usage Examples
# Basic usage - generates auto-named file
python selfies2png.py "[C][C][O]"
# Custom filename and size
python selfies2png.py "[C][=C][C][=C][C][=C][Ring1][=Branch1]" -o benzene.png --size 800
# Complex molecules with branching
python selfies2png.py "[C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]" -o aspirin.png
The script handles the complexity of SELFIES parsing, molecular coordinate generation, and image rendering, making it straightforward to create visualizations of any SELFIES-encoded molecule.