Converting SELFIES Strings to 2D Molecular Images

Introduction

In my previous post on Converting SMILES Strings to 2D Molecular Images, we explored how to visualize chemical structures from SMILES notation. While SMILES is widely used, it has a significant limitation for machine learning applications: randomly generated SMILES strings usually don’t represent valid molecules.

This post explores SELFIES (Self-referencIng Embedded Strings) - a molecular representation that claims to be 100% robust. Let’s test this claim through direct experimentation to see if every possible SELFIES string really does correspond to a valid chemical structure, and compare this with SMILES performance.

I’ll walk through building visualization tools, running some experiments, and looking at what this means for computational chemistry work.

Testing SELFIES vs SMILES Robustness

Let me test the robustness claims by comparing random string generation between SELFIES and SMILES representations.

Experiment 1: Random SELFIES Generation

Let’s start by looking at the SELFIES alphabet and generating some random strings:

import selfies as sf
import random
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors

# Get the complete SELFIES alphabet
alphabet = sf.get_semantic_robust_alphabet()
print(f'SELFIES alphabet contains {len(alphabet)} symbols')
print('First 20 symbols:', list(alphabet)[:20])

# Generate completely random SELFIES strings
random.seed(42)  # For reproducibility
for i in range(5):
    # Pick 8 random symbols from the alphabet
    random_selfies = ''.join(random.choices(list(alphabet), k=8))
    
    # Decode to SMILES and validate
    smiles = sf.decoder(random_selfies)
    mol = Chem.MolFromSmiles(smiles)
    formula = rdMolDescriptors.CalcMolFormula(mol) if mol else 'Invalid'
    
    print(f'{i+1}. SELFIES: {random_selfies}')
    print(f'   SMILES:  {smiles}')
    print(f'   Formula: {formula}')
    print(f'   Valid:   {mol is not None}')
    print()

This gives us 69 chemical symbols including atoms, bonds, charges, and structural operators like [Branch1] and [Ring1]. When I run this code, here’s what happens:

SELFIES alphabet contains 69 symbols
First 20 symbols: ['[=Branch3]', '[B+1]', '[=N]', '[=S]', '[=Branch1]', '[=O]', '[Br]', '[C-1]', '[=C]', '[#N]', '[#C]', '[Cl]', '[O-1]', '[=P]', '[#P+1]', '[#P-1]', '[S-1]', '[#Branch2]', '[O+1]', '[#Branch1]']

Random SELFIES generation:
========================================
1. SELFIES: [S][B+1][O+1][#P-1][#S+1][#B][Ring1][=O]
   SMILES:  S[B+1][O+1]=[P-1]#[S+1]=B
   Formula: H2B2OPS2+2
   Valid:   True

2. SELFIES: [#C-1][=N][#P-1][P][B+1][=P][S][=P-1]
   SMILES:  [C-1]=N[P-1]P[B+1]PS=[P-1]
   Formula: CH3BNP4S-2
   Valid:   True

3. SELFIES: [#P-1][N][=C+1][=Branch3][=C+1][=O+1][P-1][#C]
   SMILES:  [P-1]N=[C+1][C+1]=[O+1][P-1]#C
   Formula: C3HNOP2+
   Valid:   True

4. SELFIES: [#B-1][P-1][Br][Br][=B+1][=Ring1][=C+1][#S+1]
   SMILES:  [B-1][P-1]Br
   Formula: BBrP-2
   Valid:   True

5. SELFIES: [C][Branch2][I][#S][B-1][#O+1][Ring2][#Branch3]
   SMILES:  C([B-1])#[O+1]
   Formula: CBO
   Valid:   True

Results: 5/5 strings produced valid molecules (100% success rate). Each random string decoded to a chemically valid structure with proper molecular formulas.

Let me visualize these unusual but chemically valid structures:

Random SELFIES molecule 1 — **Random Molecule 1**: `[S][B+1][O+1][#P-1][#S+1][#B][Ring1][=O]` → H₂B₂OPS₂⁺² - A charged boron-phosphorus-sulfur cluster that would never occur naturally but is chemically valid

Random SELFIES molecule 2 — **Random Molecule 2**: `[#C-1][=N][#P-1][P][B+1][=P][S][=P-1]` → CH₃BNP₄S⁻² - A complex organophosphorus compound with multiple formal charges

Random SELFIES molecule 3 — **Random Molecule 3**: `[#P-1][N][=C+1][=Branch3][=C+1][=O+1][P-1][#C]` → C₃HNOP₂⁺ - An unusual carbon-nitrogen-phosphorus framework with formal charges

Random SELFIES molecule 4 — **Random Molecule 4**: `[#B-1][P-1][Br][Br][=B+1][=Ring1][=C+1][#S+1]` → BBrP⁻² - A simple but exotic boron-phosphorus-bromide structure

Random SELFIES molecule 5 — **Random Molecule 5**: `[C][Branch2][I][#S][B-1][#O+1][Ring2][#Branch3]` → CBO - The simplest of our random molecules, but still an unusual carbon-boron-oxygen combination

These visualizations confirm what SELFIES promises: every random string generates a valid chemical structure. While these molecules are chemically unusual and would likely be unstable in reality, they all satisfy valence rules and represent legitimate points in chemical space.

Experiment 2: Random SMILES Generation

Now let me test SMILES under the same conditions:

import random
from rdkit import Chem

# Common SMILES characters including bonds and rings
smiles_chars = ['C', 'N', 'O', 'S', 'P', 'F', 'Cl', 'Br', 
               '(', ')', '[', ']', '=', '#', '1', '2', '3', '4', '5']

random.seed(42)  # Same seed for fair comparison
valid_count = 0

for i in range(20):
    # Generate random 8-character SMILES string
    random_smiles = ''.join(random.choices(smiles_chars, k=8))
    mol = Chem.MolFromSmiles(random_smiles)
    valid = mol is not None
    
    if valid:
        valid_count += 1
        print(f'{i+1}. SMILES: {random_smiles} ✓ VALID')
    else:
        print(f'{i+1}. SMILES: {random_smiles} ✗ Invalid')

print(f'Success rate: {valid_count}/20 ({valid_count/20*100:.1f}%)')

Result: 0% success rate! None of the random SMILES strings with the full character set parsed successfully.

This isn’t surprising when you think about it. Even with just simple atoms (C, N, O, F, Cl) and no structural symbols, success rates only reach about 70%. With the full SMILES syntax including bonds, rings, and brackets, the odds of random generation creating valid syntax become vanishingly small.

Analysis of SMILES Failure Modes

The fundamental issue is that SMILES requires syntactic correctness across multiple dimensions:

Balanced parentheses for branches
Proper ring closure numbering
Valid bond orders and atom valences
Correct stereochemistry notation

Random sampling violates these constraints almost immediately, leading to parse errors or chemically impossible structures.

Building a SELFIES Visualization Tool

Having confirmed SELFIES’ robustness, let me build a tool to visualize these structures. I’ll create a command-line utility that converts any SELFIES string into a molecular image.

Setting Up the Core Converter

The process is similar to the SMILES visualizer I built previously but handles SELFIES-specific formatting:

#!/usr/bin/env python3
import selfies as sf
from rdkit import Chem
from rdkit.Chem import Draw, rdDepictor, rdMolDescriptors
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path

def selfies_to_png(selfies_string: str, output_file: str, size: int = 500):
    """Convert a SELFIES string to a PNG image with molecular formula legend."""
    
    # Input validation
    if not selfies_string or not selfies_string.strip():
        raise ValueError("SELFIES string cannot be empty")
    
    if size <= 0:
        raise ValueError(f"Image size must be positive, got: {size}")

    # Ensure output directory exists
    output_path = Path(output_file)
    output_path.parent.mkdir(parents=True, exist_ok=True)
    
    # Decode SELFIES to SMILES with comprehensive error handling
    try:
        smiles_string = sf.decoder(selfies_string.strip())
    except Exception as e:
        raise ValueError(f"Invalid SELFIES string: '{selfies_string}'. SELFIES decoding error: {e}")
    
    if not smiles_string:
        raise ValueError(f"SELFIES string '{selfies_string}' decoded to empty SMILES.")
    
    # Create RDKit molecule
    mol = Chem.MolFromSmiles(smiles_string)
    if mol is None:
        raise ValueError(f"Invalid SELFIES: decoded to invalid SMILES '{smiles_string}'")
    
    # Generate the image with legend
    img = create_molecule_image(mol, selfies_string, size)
    img.save(output_file, "PNG", optimize=True)
    
def create_molecule_image(mol: Chem.Mol, selfies_string: str, size: int = 500):
    """Create molecule image with SELFIES string in legend and dynamic sizing."""
    
    # Calculate dynamic sizes based on image size
    sizes = _calculate_dynamic_sizes(size)
    
    # Generate 2D coordinates and molecular properties
    rdDepictor.Compute2DCoords(mol)
    molecular_formula = rdMolDescriptors.CalcMolFormula(mol)
    
    # Create base molecular image
    mol_img = Draw.MolToImage(mol, size=(size, size))
    if mol_img.mode != "RGBA":
        mol_img = mol_img.convert("RGBA")
    
    # Calculate required legend height and width for long SELFIES strings
    legend_height = _calculate_legend_height(molecular_formula, selfies_string, sizes, size)
    legend_height = max(legend_height, int(size * 0.15))  # Ensure minimum height
    
    # Calculate if we need extra width for the SELFIES text
    final_width = _calculate_required_width(selfies_string, sizes, size)
    
    # Create final image with appropriate dimensions
    total_height = size + legend_height
    final_img = Image.new("RGBA", (final_width, total_height), "white")
    
    # Center the molecule image if we made the canvas wider
    mol_x_offset = (final_width - size) // 2
    final_img.paste(mol_img, (mol_x_offset, 0))
    
    # Add formatted legend with molecular formula and SELFIES
    draw = ImageDraw.Draw(final_img)
    font_regular, font_small = _load_fonts(sizes['regular_font_size'], sizes['subscript_font_size'])
    
    _draw_molecular_formula(draw, molecular_formula, font_regular, font_small, sizes, size, mol_x_offset)
    _draw_selfies_legend(draw, selfies_string, font_regular, sizes, molecular_formula, size, final_width)
    
    return final_img

Handling Long SELFIES Strings

One challenge is that SELFIES strings can be much longer than SMILES. For complex molecules, I need intelligent text wrapping:

The main challenge is handling SELFIES notation in the legend, which can be quite long for complex molecules. I need intelligent text wrapping that breaks at logical boundaries:

def _draw_wrapped_text(
    draw: ImageDraw.Draw, text: str, start_x: int, start_y: int, max_width: int, font, sizes: dict
) -> None:
    """Draw text with word wrapping at bracket boundaries for SELFIES strings."""
    # For SELFIES, we want to break at bracket boundaries to maintain readability
    # Split SELFIES into tokens (brackets and their contents)
    tokens = []
    current_token = ""
    
    for char in text:
        if char == '[':
            if current_token:
                tokens.append(current_token)
                current_token = ""
            current_token = char
        elif char == ']':
            current_token += char
            tokens.append(current_token)
            current_token = ""
        else:
            current_token += char
    
    if current_token:
        tokens.append(current_token)
    
    # Now draw tokens, wrapping as needed at logical boundaries
    x_pos = start_x
    y_pos = start_y
    line_height = sizes['regular_font_size'] + 4  # Generous line spacing
    
    for token in tokens:
        token_width = int(draw.textlength(token, font=font))
        
        # If this token would exceed the line, wrap to next line
        if x_pos + token_width > start_x + max_width and x_pos > start_x:
            y_pos += line_height
            x_pos = start_x
        
        draw.text((x_pos, y_pos), token, fill="black", font=font)
        x_pos += token_width

def _calculate_dynamic_sizes(image_size: int):
    """Calculate dynamic sizing values based on image size for responsive design."""
    return {
        'legend_height': int(image_size * 0.08),
        'legend_y_offset': int(image_size * 0.02),
        'legend_x_offset': int(image_size * 0.02),
        'subscript_y_offset': int(image_size * 0.006),
        'regular_font_size': int(image_size * 0.028),
        'subscript_font_size': int(image_size * 0.02),
    }

This approach ensures SELFIES strings remain readable even when they wrap across multiple lines.

Testing the Tool: From Simple to Complex

Let me put the visualization tool through its paces, starting with simple molecules and building up to complex pharmaceuticals.

Simple Molecules First

# Ethanol - simplest alcohol
python selfies2png.py "[C][C][O]" -o ethanol_selfies.png --size 600

# Acetone - demonstrates branching
python selfies2png.py "[C][C][Branch1][C][=O][C]" -o acetone_selfies.png --size 600

Ethanol from SELFIES — **Ethanol**: `[C][C][O]` - The simplest alcohol demonstrating basic SELFIES syntax

Acetone from SELFIES — **Acetone**: `[C][C][Branch1][C][=O][C]` - Shows how SELFIES handles branching with explicit branch notation

Aromatic Systems

Benzene from SELFIES — **Benzene**: `[C][=C][C][=C][C][=C][Ring1][=Branch1]` - Ring closure using local indexing

Toluene from SELFIES — **Toluene**: `[C][C][=C][C][=C][C][=C][Ring1][=Branch1]` - Substituted aromatic ring

Complex Pharmaceutical Molecules

Aspirin from SELFIES — **Aspirin**: `[C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]` - A complex molecule showing SELFIES handling of multiple functional groups and intelligent text wrapping in the legend

Notice how SELFIES handles complexity gracefully. The bracket notation is more verbose than SMILES, but this verbosity is what enables the robustness guarantee.

Practical Applications

SELFIES work well for several applications:

Molecular Generation: Train a language model on SELFIES and every generated sequence is a valid molecule. With SMILES, you need additional filtering and validation steps.

Chemical Space Exploration: Random walks in SELFIES space explore valid chemical structures, which can be useful for diversity-oriented synthesis planning.

Property Optimization: Bayesian optimization and genetic algorithms can work directly in SELFIES space without worrying about validity constraints.

Getting Started

The tools I’ve built provide a foundation for SELFIES-based projects. The complete selfies2png.py script is included at the end of this post, along with all the experimental code shown above.

If you’re working on machine learning projects involving molecules, SELFIES might be worth considering. The guaranteed validity can eliminate complexity and make model training more straightforward.

Advanced SELFIES Features

Let me explore some additional SELFIES capabilities through hands-on examples:

Debugging with Attribution

If you’re curious how SELFIES tokens map to the final SMILES, let’s trace the conversion for a complex molecule:

import selfies as sf

# A molecule with branching and rings - nicotine
selfies_str = "[C][N][C][Branch1][C][C][C][C][Ring1][=Branch1][C][=C][C][N][=C][Ring1][Branch1]"
smiles_str, attribution = sf.decoder(selfies_str, attribute=True)

print(f"Original SELFIES: {selfies_str}")
print(f"Decoded SMILES:   {smiles_str}")
print("\nHow each SMILES token was built:")
for i, attr_map in enumerate(attribution):
    contributing_tokens = [a.token for a in attr_map.attribution]
    print(f"  #{i+1}: '{attr_map.token}' ← {contributing_tokens}")

Running this shows exactly how the ring closures and branches are constructed:

Original SELFIES: [C][N][C][Branch1][C][C][C][C][Ring1][=Branch1][C][=C][C][N][=C][Ring1][Branch1]
Decoded SMILES:   CN1C(C)CCC1c1cccnc1

How each SMILES token was built:
  #1: 'C' ← ['[C]']
  #2: 'N' ← ['[N]']
  #3: '1' ← ['[Ring1]', '[=Branch1]']
  #4: 'C' ← ['[C]']
  #5: '(' ← ['[Branch1]']
  #6: 'C' ← ['[C]']
  #7: ')' ← ['[C]', '[C]', '[Ring1]']
  ...

This attribution can be useful when debugging molecular transformations or understanding why certain SELFIES patterns produce specific structures.

Customizing Chemical Constraints

For exploring unusual chemistry, SELFIES lets you enable hypervalent atoms and other exotic structures:

import selfies as sf
from rdkit import Chem
from rdkit.Chem import Draw

# Create a hypervalent iodine compound (common in organic synthesis)
hypervalent_smiles = 'O=I(O)(O)(O)(O)O'  # Periodic acid
mol = Chem.MolFromSmiles(hypervalent_smiles)

if mol:
    # Convert to SELFIES with different constraint levels
    standard_selfies = sf.encoder(hypervalent_smiles, strict=True)
    relaxed_selfies = sf.encoder(hypervalent_smiles, strict=False)
    
    print(f"Original SMILES: {hypervalent_smiles}")
    print(f"Standard SELFIES: {standard_selfies}")
    print(f"Relaxed SELFIES:  {relaxed_selfies}")
    
    # Now decode with hypervalent constraints enabled
    sf.set_semantic_constraints("hypervalent")
    decoded_hypervalent = sf.decoder(relaxed_selfies)
    
    print(f"Hypervalent decode: {decoded_hypervalent}")
    
    # Visualize the hypervalent structure
    mol_hypervalent = Chem.MolFromSmiles(decoded_hypervalent)
    if mol_hypervalent:
        img = Draw.MolToImage(mol_hypervalent, size=(300, 300))
        img.save("hypervalent_iodine.png")

This produces molecules with expanded valence shells - structures that standard SMILES handling might reject.

Building ML-Ready Datasets

Here’s how to prepare SELFIES data for machine learning models:

import selfies as sf
import numpy as np
from collections import Counter

# Example: preparing a small drug dataset
drug_smiles = [
    "CC(=O)OC1=CC=CC=C1C(=O)O",  # Aspirin
    "CN1C=NC2=C1C(=O)N(C(=O)N2C)C",  # Caffeine  
    "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",  # Ibuprofen
    "O=C(O)C1=CC=CC=C1",  # Benzoic acid
]

# Convert to SELFIES
drug_selfies = [sf.encoder(smiles) for smiles in drug_smiles]
print("SELFIES representations:")
for i, selfies_str in enumerate(drug_selfies):
    print(f"{i+1}. {selfies_str}")

# Build vocabulary from our dataset
alphabet = sf.get_alphabet_from_selfies(drug_selfies)
alphabet.add("[nop]")  # Padding token for variable-length sequences
alphabet = sorted(alphabet)

print(f"\nVocabulary size: {len(alphabet)}")
print(f"Tokens: {alphabet[:10]}...")  # Show first 10 tokens

# Create encoding dictionaries
token_to_idx = {token: i for i, token in enumerate(alphabet)}
idx_to_token = {i: token for token, i in token_to_idx.items()}

# Encode one molecule for neural network input
caffeine_selfies = drug_selfies[1]
tokens = list(sf.split_selfies(caffeine_selfies))
print(f"\nCaffeine tokens: {tokens}")

# Convert to numerical representation
indices = [token_to_idx[token] for token in tokens]
print(f"As indices: {indices}")

# Create padded one-hot encoding
max_len = 25  # Pad to fixed length
label_encoding, one_hot = sf.selfies_to_encoding(
    caffeine_selfies, 
    vocab_stoi=token_to_idx, 
    pad_to_len=max_len,
    enc_type="both"
)

print(f"Label encoding shape: {label_encoding.shape}")
print(f"One-hot encoding shape: {one_hot.shape}")

This produces ML-ready numerical representations that can feed directly into transformer models or RNNs.

SELFIES in Practice: Some Real Applications

Let me show SELFIES solving actual problems in computational chemistry:

Molecular Generation That Works

Traditional SMILES-based generation requires validity checking. With SELFIES, every output is guaranteed valid:

import selfies as sf
import random
from rdkit import Chem
from rdkit.Chem import Descriptors

def generate_random_drug_like_molecules(n=5):
    """Generate random molecules and filter for drug-like properties."""
    alphabet = list(sf.get_semantic_robust_alphabet())
    
    # Focus on common drug atoms
    drug_atoms = [token for token in alphabet if any(atom in token for atom in ['C', 'N', 'O', 'S', 'F'])]
    
    molecules = []
    for i in range(n):
        # Generate random 12-token SELFIES
        random_selfies = ''.join(random.choices(drug_atoms, k=12))
        
        # Decode and analyze
        smiles = sf.decoder(random_selfies)
        mol = Chem.MolFromSmiles(smiles)
        
        # Calculate drug-like properties
        mw = Descriptors.MolWt(mol)
        logp = Descriptors.MolLogP(mol)
        hbd = Descriptors.NumHDonors(mol)
        hba = Descriptors.NumHAcceptors(mol)
        
        # Check Lipinski's Rule of Five
        lipinski_pass = (mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10)
        
        molecules.append({
            'selfies': random_selfies,
            'smiles': smiles,
            'mw': mw,
            'logp': logp,
            'lipinski': lipinski_pass
        })
        
        print(f"Molecule {i+1}:")
        print(f"  SELFIES: {random_selfies}")
        print(f"  SMILES:  {smiles}")
        print(f"  MW: {mw:.1f}, LogP: {logp:.1f}")
        print(f"  Lipinski compliant: {lipinski_pass}")
        print()
    
    return molecules

# Generate some random drug-like molecules
random.seed(42)
drug_molecules = generate_random_drug_like_molecules(3)

Output shows chemically sensible (though unusual) molecules:

Molecule 1:
  SELFIES: [O+1][C][F][=N][O][C-1][N][S-1][C][#N][C-1][O]
  SMILES:  [O+1]CF=N[O][C-1]N[S-1]C#N[C-1]O
  MW: 168.1, LogP: -0.3
  Lipinski compliant: True

Molecule 2:
  SELFIES: [=N][#C][S][=C][C][#N][=O][F][#N][C][S][C]
  SMILES:  N#CS=CC#N=OF#NCS
  MW: 183.2, LogP: 1.2
  Lipinski compliant: True
...

Every generated molecule is valid - no filtering required!

Chemical Space Exploration

SELFIES enable exploration of chemical space without validity constraints:

import selfies as sf
from rdkit import Chem
from rdkit.Chem import Descriptors
import random

def mutate_molecule(selfies_str, mutation_rate=0.2):
    """Mutate a SELFIES string by randomly changing tokens."""
    alphabet = list(sf.get_semantic_robust_alphabet())
    tokens = list(sf.split_selfies(selfies_str))
    
    mutated_tokens = []
    for token in tokens:
        if random.random() < mutation_rate:
            # Replace with random token
            mutated_tokens.append(random.choice(alphabet))
        else:
            mutated_tokens.append(token)
    
    return ''.join(mutated_tokens)

def explore_chemical_space(starting_molecule, generations=5):
    """Perform a random walk through chemical space."""
    current_selfies = sf.encoder(starting_molecule)
    
    print(f"Starting molecule: {starting_molecule}")
    print(f"Starting SELFIES:  {current_selfies}")
    print()
    
    for gen in range(generations):
        # Mutate the molecule
        mutated_selfies = mutate_molecule(current_selfies)
        mutated_smiles = sf.decoder(mutated_selfies)
        mol = Chem.MolFromSmiles(mutated_smiles)
        
        # Calculate properties
        if mol:
            mw = Descriptors.MolWt(mol)
            logp = Descriptors.MolLogP(mol)
            
            print(f"Generation {gen+1}:")
            print(f"  SELFIES: {mutated_selfies}")
            print(f"  SMILES:  {mutated_smiles}")
            print(f"  MW: {mw:.1f}, LogP: {logp:.2f}")
            print()
            
            current_selfies = mutated_selfies

# Start with aspirin and explore
random.seed(123)
explore_chemical_space("CC(=O)OC1=CC=CC=C1C(=O)O", generations=3)

This shows how molecular properties change as we walk through chemical space - every step guaranteed to be a valid molecule.

Performance Tips

When working with SELFIES in production systems:

Batch Processing for Speed

import selfies as sf
from rdkit import Chem
from rdkit.Chem import Descriptors
from concurrent.futures import ProcessPoolExecutor
import time

def process_single_selfies(selfies_str):
    """Process one SELFIES string and return properties."""
    try:
        smiles = sf.decoder(selfies_str)
        mol = Chem.MolFromSmiles(smiles)
        if mol:
            return {
                'selfies': selfies_str,
                'smiles': smiles,
                'mw': Descriptors.MolWt(mol),
                'logp': Descriptors.MolLogP(mol),
                'valid': True
            }
    except Exception as e:
        return {'selfies': selfies_str, 'valid': False, 'error': str(e)}

def benchmark_processing(selfies_list, use_parallel=True):
    """Compare serial vs parallel processing."""
    
    if use_parallel:
        # Parallel processing
        start_time = time.time()
        with ProcessPoolExecutor(max_workers=4) as executor:
            results = list(executor.map(process_single_selfies, selfies_list))
        parallel_time = time.time() - start_time
        print(f"Parallel processing: {parallel_time:.2f}s for {len(selfies_list)} molecules")
        return results
    else:
        # Serial processing  
        start_time = time.time()
        results = [process_single_selfies(s) for s in selfies_list]
        serial_time = time.time() - start_time
        print(f"Serial processing: {serial_time:.2f}s for {len(selfies_list)} molecules")
        return results

# Generate test dataset
alphabet = list(sf.get_semantic_robust_alphabet())
test_selfies = [''.join(random.choices(alphabet, k=10)) for _ in range(100)]

# Benchmark both approaches
serial_results = benchmark_processing(test_selfies, use_parallel=False)
parallel_results = benchmark_processing(test_selfies, use_parallel=True)

For larger datasets, parallel processing can speed up SELFIES operations.

Building Custom SELFIES Applications

Let me create some practical tools that show SELFIES capabilities:

Smart Molecular Filter

import selfies as sf
from rdkit import Chem
from rdkit.Chem import Descriptors, Crippen

class SmartMolecularFilter:
    """Filter molecules based on drug-like properties."""
    
    def __init__(self):
        self.lipinski_rules = {
            'mw_max': 500,
            'logp_max': 5,
            'hbd_max': 5,
            'hba_max': 10
        }
    
    def check_lipinski(self, mol):
        """Check Lipinski's Rule of Five."""
        if not mol:
            return False
        
        mw = Descriptors.MolWt(mol)
        logp = Crippen.MolLogP(mol)
        hbd = Descriptors.NumHDonors(mol)
        hba = Descriptors.NumHAcceptors(mol)
        
        return (mw <= self.lipinski_rules['mw_max'] and
                logp <= self.lipinski_rules['logp_max'] and
                hbd <= self.lipinski_rules['hbd_max'] and
                hba <= self.lipinski_rules['hba_max'])
    
    def analyze_selfies(self, selfies_str):
        """Analyze a SELFIES string for drug-likeness."""
        try:
            smiles = sf.decoder(selfies_str)
            mol = Chem.MolFromSmiles(smiles)
            
            if not mol:
                return {'valid': False, 'reason': 'Invalid molecule'}
            
            # Calculate properties
            props = {
                'mw': Descriptors.MolWt(mol),
                'logp': Crippen.MolLogP(mol),
                'hbd': Descriptors.NumHDonors(mol),
                'hba': Descriptors.NumHAcceptors(mol),
                'tpsa': Descriptors.TPSA(mol),
                'rotatable_bonds': Descriptors.NumRotatableBonds(mol)
            }
            
            # Check filters
            lipinski_pass = self.check_lipinski(mol)
            
            return {
                'valid': True,
                'selfies': selfies_str,
                'smiles': smiles,
                'properties': props,
                'lipinski_compliant': lipinski_pass,
                'drug_like_score': self._calculate_drug_score(props)
            }
            
        except Exception as e:
            return {'valid': False, 'reason': f'Error: {str(e)}'}
    
    def _calculate_drug_score(self, props):
        """Simple drug-likeness score (0-1)."""
        score = 0
        if props['mw'] <= 500: score += 0.2
        if 0 <= props['logp'] <= 5: score += 0.2
        if props['hbd'] <= 5: score += 0.2
        if props['hba'] <= 10: score += 0.2
        if props['tpsa'] <= 140: score += 0.2
        return score

# Test the filter
filter_tool = SmartMolecularFilter()

# Test with some SELFIES
test_molecules = [
    "[C][C][Branch1][C][=O][O]",  # Acetic acid
    "[C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]",  # Aspirin
    "[C][N][C][=O]"  # N-methylformamide
]

for selfies_str in test_molecules:
    result = filter_tool.analyze_selfies(selfies_str)
    if result['valid']:
        props = result['properties']
        print(f"SELFIES: {selfies_str}")
        print(f"SMILES:  {result['smiles']}")
        print(f"MW: {props['mw']:.1f}, LogP: {props['logp']:.2f}")
        print(f"Drug score: {result['drug_like_score']:.1f}/1.0")
        print(f"Lipinski: {'✓' if result['lipinski_compliant'] else '✗'}")
        print()

This tool shows how SELFIES can be integrated into molecular analysis workflows.

Comparing SELFIES and SMILES

Let me wrap up with a direct comparison of the same molecule in both representations:

Aspirin Example

SMILES: CC(=O)OC1=CC=CC=C1C(=O)O SELFIES: [C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]

Aspirin molecule rendered from SELFIES — **Aspirin rendered from SELFIES**: The verbose but robust SELFIES notation guarantees a valid molecular structure

Trade-offs Summary

Aspect	SMILES	SELFIES
Compactness	✅ More compact	❌ More verbose
Readability	✅ Familiar to chemists	❌ Requires learning
Robustness	❌ Many invalid strings	✅ 100% valid
ML Suitability	❌ Low success rates	✅ Perfect for generation
Tool Support	✅ Universal support	✅ Growing rapidly

What’s Next

The SELFIES ecosystem continues to grow. Some interesting developments to watch:

New Research Directions

Current work includes polymer representations for materials science, reaction modeling for synthetic planning, and stereochemistry extensions for pharmaceutical applications. The robustness guarantee is encouraging adoption in production ML systems where validity matters.

Getting Involved

Consider trying SELFIES in your computational chemistry projects. The guarantee of valid molecules can reduce complexity and debugging time. Whether you’re building generative models, exploring chemical space, or creating educational tools, SELFIES provide a reliable foundation.

Conclusion

Through hands-on experiments and practical examples, I’ve explored how SELFIES address the robustness problem in molecular representations. Every random string I generated became a valid molecule. Every mutation in chemical space exploration produced a real structure. Every machine learning output represents an actual chemical compound.

This has practical benefits. Instead of spending time filtering invalid molecules, researchers can focus on optimizing properties and discovering new compounds. The robustness guarantee can eliminate a class of bugs and edge cases from computational workflows.

The tools I built show SELFIES in action: from basic visualization to molecular analysis. While SELFIES are more verbose than SMILES, this verbosity provides something valuable - the certainty that every generated molecule is chemically meaningful.

As molecular AI becomes more sophisticated, SELFIES provide a reliable foundation for these applications. They’re worth trying in your next project if you want the peace of mind that comes with guaranteed molecular validity.

Try It Yourself

If you want to explore SELFIES, here’s a simple experiment to get started:

# Install dependencies
pip install selfies rdkit pillow

# Generate some random molecules
python -c "
import selfies as sf
import random
alphabet = list(sf.get_semantic_robust_alphabet())
for i in range(3):
    random_selfies = ''.join(random.choices(alphabet, k=8))
    smiles = sf.decoder(random_selfies)
    print(f'{i+1}. {random_selfies} → {smiles}')
"

# Visualize your favorite molecule (using the script below)
python selfies2png.py "[C][=C][C][=C][C][=C][Ring1][=Branch1]" -o benzene.png

Every random generation should work - that’s the SELFIES guarantee in action.

Complete SELFIES Visualization Script

Here’s the complete selfies2png.py script that powers the visualizations in this post. Save this as selfies2png.py and you’ll have a tool for converting any SELFIES string into molecular images:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
SELFIES to PNG Converter
=======================

A command-line utility to render SELFIES strings as 2D molecular images with
molecular formulas and proper subscript formatting.

This script demonstrates how to:
- Parse SELFIES strings and decode them to SMILES
- Generate 2D molecular coordinates using RDKit
- Create publication-quality molecular images
- Add custom legends with molecular formulas
- Handle font rendering and subscripts

Example Usage:
    python selfies2png.py "[C][C][O]"  # Ethanol
    python selfies2png.py "[C][C][Branch1][C][C][C][C]" -o isobutane.png  # Isobutane
    python selfies2png.py "[C][=C][C][=C][C][=C][Ring1][=Branch1]" --size 800  # Benzene, larger image

Author: Hunter Heidenreich
Website: https://hunterheidenreich.com
"""

import argparse
import hashlib
import sys
from pathlib import Path

# SELFIES import
import selfies as sf

# RDKit imports
from rdkit import Chem
from rdkit.Chem import Draw, rdDepictor, rdMolDescriptors

# PIL imports for image manipulation
from PIL import Image, ImageDraw, ImageFont

# Constants for image configuration
DEFAULT_IMAGE_SIZE = 500
LEGEND_HEIGHT_RATIO = 0.08  # Legend height as ratio of image size
LEGEND_Y_OFFSET_RATIO = 0.02  # Y offset as ratio of image size
LEGEND_X_OFFSET_RATIO = 0.02  # X offset as ratio of image size
SUBSCRIPT_Y_OFFSET_RATIO = 0.006  # Subscript offset as ratio of image size

# Font size ratios based on image size
REGULAR_FONT_RATIO = 0.028  # Regular font size as ratio of image size
SUBSCRIPT_FONT_RATIO = 0.02  # Subscript font size as ratio of image size

# Font paths for different operating systems
FONT_PATHS = [
    "/System/Library/Fonts/Arial.ttf",  # macOS
    "/usr/share/fonts/truetype/arial.ttf",  # Linux
    "C:/Windows/Fonts/arial.ttf",  # Windows
]


def _calculate_dynamic_sizes(image_size: int):
    """Calculate dynamic sizing values based on image size."""
    return {
        'legend_height': int(image_size * LEGEND_HEIGHT_RATIO),
        'legend_y_offset': int(image_size * LEGEND_Y_OFFSET_RATIO),
        'legend_x_offset': int(image_size * LEGEND_X_OFFSET_RATIO),
        'subscript_y_offset': int(image_size * SUBSCRIPT_Y_OFFSET_RATIO),
        'regular_font_size': int(image_size * REGULAR_FONT_RATIO),
        'subscript_font_size': int(image_size * SUBSCRIPT_FONT_RATIO),
    }


def _load_fonts(regular_size: int, subscript_size: int):
    """Load system fonts for text rendering, with fallback to default font."""
    font_regular = None
    font_small = None

    for font_path in FONT_PATHS:
        try:
            font_regular = ImageFont.truetype(font_path, regular_size)
            font_small = ImageFont.truetype(font_path, subscript_size)
            break
        except (OSError, IOError):
            continue

    if font_regular is None:
        font_regular = ImageFont.load_default()
        font_small = ImageFont.load_default()

    return font_regular, font_small


def create_molecule_image(
    mol: Chem.Mol, selfies_string: str, size: int = DEFAULT_IMAGE_SIZE
) -> Image.Image:
    """
    Creates a molecule image with a legend showing molecular formula and SELFIES string.

    Args:
        mol: RDKit molecule object (already validated)
        selfies_string: Original SELFIES string for legend display
        size: Image size in pixels (square image)

    Returns:
        PIL Image object with molecule structure and formatted legend
    """
    # Calculate dynamic sizes based on image size
    sizes = _calculate_dynamic_sizes(size)
    
    rdDepictor.Compute2DCoords(mol)
    molecular_formula = rdMolDescriptors.CalcMolFormula(mol)

    mol_img = Draw.MolToImage(mol, size=(size, size))
    if mol_img.mode != "RGBA":
        mol_img = mol_img.convert("RGBA")

    # Calculate required legend height based on content - be more generous
    legend_height = _calculate_legend_height(molecular_formula, selfies_string, sizes, size)
    
    # Add extra safety margin to ensure text doesn't get cut off
    legend_height = max(legend_height, int(size * 0.15))  # At least 15% of image height
    
    # Calculate if we need extra width for the SELFIES text
    temp_img = Image.new("RGBA", (1, 1), "white")
    temp_draw = ImageDraw.Draw(temp_img)
    font_regular, _ = _load_fonts(sizes['regular_font_size'], sizes['subscript_font_size'])
    
    selfies_label = "SELFIES: "
    label_width = int(temp_draw.textlength(selfies_label, font=font_regular))
    selfies_width = int(temp_draw.textlength(selfies_string, font=font_regular))
    total_selfies_width = label_width + selfies_width + (sizes['legend_x_offset'] * 2)
    
    # Make image wider if needed to accommodate the full SELFIES string
    final_width = max(size, total_selfies_width)
    
    total_height = size + legend_height
    final_img = Image.new("RGBA", (final_width, total_height), "white")
    
    # Center the molecule image if we made the canvas wider
    mol_x_offset = (final_width - size) // 2
    final_img.paste(mol_img, (mol_x_offset, 0))

    draw = ImageDraw.Draw(final_img)
    font_regular, font_small = _load_fonts(sizes['regular_font_size'], sizes['subscript_font_size'])

    _draw_molecular_formula(draw, molecular_formula, font_regular, font_small, sizes, size, mol_x_offset)
    _draw_selfies_legend(draw, selfies_string, font_regular, sizes, molecular_formula, size, final_width)

    return final_img


def _draw_molecular_formula(
    draw: ImageDraw.Draw, formula: str, font_regular, font_small, sizes: dict, image_size: int, mol_x_offset: int = 0
) -> int:
    """Draw molecular formula with proper subscript formatting."""
    y_pos = image_size + sizes['legend_y_offset']
    x_pos = sizes['legend_x_offset'] + mol_x_offset

    draw.text((x_pos, y_pos), "Formula: ", fill="black", font=font_regular)
    x_pos += int(draw.textlength("Formula: ", font=font_regular))

    for char in formula:
        if char.isdigit():
            draw.text(
                (x_pos, y_pos + sizes['subscript_y_offset']), char, fill="black", font=font_small
            )
            x_pos += int(draw.textlength(char, font=font_small))
        else:
            draw.text((x_pos, y_pos), char, fill="black", font=font_regular)
            x_pos += int(draw.textlength(char, font=font_regular))

    return x_pos


def _calculate_legend_height(formula: str, selfies: str, sizes: dict, image_size: int) -> int:
    """Calculate the required height for the legend based on content."""
    # Create a temporary draw object to measure text
    temp_img = Image.new("RGBA", (1, 1), "white")
    temp_draw = ImageDraw.Draw(temp_img)
    font_regular, _ = _load_fonts(sizes['regular_font_size'], sizes['subscript_font_size'])
    
    # Calculate if SELFIES will fit on the same line as formula
    formula_text = f"Formula: {formula}"
    separator = " | SELFIES: "
    total_prefix_width = int(temp_draw.textlength(formula_text, font=font_regular)) + int(temp_draw.textlength(separator, font=font_regular))
    available_width_same_line = image_size - sizes['legend_x_offset'] - total_prefix_width - sizes['legend_x_offset']
    selfies_width = int(temp_draw.textlength(selfies, font=font_regular))
    
    base_height = sizes['legend_y_offset'] * 2 + sizes['regular_font_size']
    line_height = sizes['regular_font_size'] + 6  # Line spacing to match drawing function
    
    if selfies_width <= available_width_same_line:
        # Single line layout - add extra padding
        return base_height + 15
    else:
        # SELFIES goes on new line - calculate width available for SELFIES line
        selfies_label_width = int(temp_draw.textlength("SELFIES: ", font=font_regular))
        available_width_new_line = image_size - sizes['legend_x_offset'] - selfies_label_width - sizes['legend_x_offset']
        
        if selfies_width <= available_width_new_line:
            # SELFIES fits on one new line
            return base_height + line_height + 15
        else:
            # SELFIES needs wrapping - parse into tokens for accurate calculation
            tokens = []
            current_token = ""
            
            for char in selfies:
                if char == '[':
                    if current_token:
                        tokens.append(current_token)
                        current_token = ""
                    current_token = char
                elif char == ']':
                    current_token += char
                    tokens.append(current_token)
                    current_token = ""
                else:
                    current_token += char
            
            if current_token:
                tokens.append(current_token)
            
            # Simulate line wrapping
            lines_needed = 1
            current_line_width = 0
            
            for token in tokens:
                token_width = int(temp_draw.textlength(token, font=font_regular))
                if current_line_width + token_width > available_width_new_line and current_line_width > 0:
                    lines_needed += 1
                    current_line_width = token_width
                else:
                    current_line_width += token_width
            
            # Add extra height for the second line (SELFIES line) plus wrapped lines plus extra padding
            total_height = base_height + line_height + (line_height * (lines_needed - 1)) + 25  # Extra generous padding
            return total_height


def _draw_wrapped_text(
    draw: ImageDraw.Draw, text: str, start_x: int, start_y: int, max_width: int, font, sizes: dict
) -> None:
    """Draw text with word wrapping at bracket boundaries for SELFIES strings."""
    # For SELFIES, we want to break at bracket boundaries to maintain readability
    # Split SELFIES into tokens (brackets and their contents)
    tokens = []
    current_token = ""
    
    for char in text:
        if char == '[':
            if current_token:
                tokens.append(current_token)
                current_token = ""
            current_token = char
        elif char == ']':
            current_token += char
            tokens.append(current_token)
            current_token = ""
        else:
            current_token += char
    
    if current_token:
        tokens.append(current_token)
    
    # Now draw tokens, wrapping as needed
    x_pos = start_x
    y_pos = start_y
    line_height = sizes['regular_font_size'] + 4  # More generous line spacing
    
    for token in tokens:
        token_width = int(draw.textlength(token, font=font))
        
        # If this token would exceed the line, wrap to next line
        if x_pos + token_width > start_x + max_width and x_pos > start_x:
            y_pos += line_height
            x_pos = start_x
        
        draw.text((x_pos, y_pos), token, fill="black", font=font)
        x_pos += token_width


def _draw_selfies_legend(
    draw: ImageDraw.Draw, selfies: str, font_regular, sizes: dict, formula: str, 
    original_mol_size: int, final_image_width: int
) -> None:
    """Add SELFIES string to the image legend with proper text wrapping."""
    y_pos = original_mol_size + sizes['legend_y_offset']
    
    # ALWAYS put SELFIES on its own line - no more trying to fit on same line as formula
    selfies_y_pos = y_pos + sizes['regular_font_size'] + 6
    selfies_label = "SELFIES: "
    label_width = int(draw.textlength(selfies_label, font=font_regular))
    
    draw.text((sizes['legend_x_offset'], selfies_y_pos), selfies_label, fill="black", font=font_regular)
    
    # Use wrapped text drawing for ALL SELFIES strings
    selfies_x_pos = sizes['legend_x_offset'] + label_width
    available_width_for_selfies = final_image_width - selfies_x_pos - sizes['legend_x_offset']
    _draw_wrapped_text(
        draw, selfies, selfies_x_pos, selfies_y_pos, 
        available_width_for_selfies, font_regular, sizes
    )


def selfies_to_png(
    selfies_string: str, output_file: str, size: int = DEFAULT_IMAGE_SIZE
) -> None:
    """
    Convert a SELFIES string to a PNG image with molecular formula legend.

    Args:
        selfies_string: Valid SELFIES string representing a molecule
        output_file: Path where the PNG image will be saved
        size: Square image dimensions in pixels

    Raises:
        ValueError: If SELFIES string is invalid or size is non-positive
        IOError: If file cannot be saved to the specified location
    """
    if not selfies_string or not selfies_string.strip():
        raise ValueError("SELFIES string cannot be empty")

    if size <= 0:
        raise ValueError(f"Image size must be positive, got: {size}")

    output_path = Path(output_file)
    output_path.parent.mkdir(parents=True, exist_ok=True)

    # Decode SELFIES to SMILES
    try:
        smiles_string = sf.decoder(selfies_string.strip())
    except Exception as e:
        raise ValueError(
            f"Invalid SELFIES string: '{selfies_string}'. "
            f"SELFIES decoding error: {e}"
        )

    if not smiles_string:
        raise ValueError(
            f"SELFIES string '{selfies_string}' decoded to empty SMILES. "
            f"Please check the SELFIES syntax."
        )

    mol = Chem.MolFromSmiles(smiles_string)
    if mol is None:
        raise ValueError(
            f"Invalid SELFIES string: '{selfies_string}'. "
            f"Decoded to SMILES '{smiles_string}' which is not valid. "
            f"Please check the syntax and try again."
        )

    img = create_molecule_image(mol, selfies_string.strip(), size)

    try:
        img.save(output_file, "PNG", optimize=True)
        print(f"Image successfully saved to: {output_file}")
    except Exception as e:
        raise IOError(f"Failed to save image to '{output_file}': {e}")


def create_safe_filename(selfies_string: str) -> str:
    """
    Generate a filesystem-safe filename from a SELFIES string using MD5 hash.

    Args:
        selfies_string: The input SELFIES string

    Returns:
        A safe filename ending with .png
    """
    clean_selfies = selfies_string.strip()
    hasher = hashlib.md5(clean_selfies.encode("utf-8"))
    return f"{hasher.hexdigest()}.png"


def main() -> None:
    """Command-line interface for the SELFIES to PNG converter."""
    parser = argparse.ArgumentParser(
        description="Convert SELFIES strings to publication-quality PNG images with molecular formulas.",
        epilog="""
Examples:
  %(prog)s "[C][C][O]"                                    # Ethanol with auto-generated filename
  %(prog)s "[C][C][Branch1][C][C][C][C]"                  # Isobutane with auto-generated filename
  %(prog)s "[C][=C][C][=C][C][=C][Ring1][=Branch1]" -o benzene.png   # Benzene with custom filename  
  %(prog)s "[C][C][O]" --size 800                         # Ethanol with larger image size

Common SELFIES patterns:
  [C][C][O]                                  - Ethanol
  [C][C][Branch1][C][=O][O]                  - Acetic acid
  [C][=C][C][=C][C][=C][Ring1][=Branch1]     - Benzene
  [C][C][Branch1][C][C][C]                   - Isobutane
  [N][C][Branch1][C][=O][C][=C][C][=C][C][=C][Ring1][=Branch1]  - Benzamide
        """,
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )

    parser.add_argument(
        "selfies",
        type=str,
        help="SELFIES string of the molecule to visualize (e.g., '[C][C][O]' for ethanol)",
    )

    parser.add_argument(
        "-o",
        "--output",
        type=str,
        metavar="FILE",
        help="Output PNG filename. If not provided, generates a unique filename "
        "based on the SELFIES string hash. Extension .png will be added if missing.",
    )

    parser.add_argument(
        "-s",
        "--size",
        type=int,
        default=DEFAULT_IMAGE_SIZE,
        metavar="PIXELS",
        help=f"Square image size in pixels (default: {DEFAULT_IMAGE_SIZE}). "
        f"Typical values: 300 (small), 500 (medium), 800 (large).",
    )

    parser.add_argument(
        "-v",
        "--verbose",
        action="store_true",
        help="Enable verbose output for debugging",
    )

    args = parser.parse_args()

    if args.verbose:
        print(f"Input SELFIES: {args.selfies}")
        print(f"Image size: {args.size}x{args.size} pixels")

    if args.output:
        output_filename = (
            args.output
            if args.output.lower().endswith(".png")
            else f"{args.output}.png"
        )
        if args.verbose:
            print(f"Using custom filename: {output_filename}")
    else:
        output_filename = create_safe_filename(args.selfies)
        if args.verbose:
            print(f"Generated filename: {output_filename}")

    try:
        selfies_to_png(args.selfies, output_filename, args.size)

        if args.verbose:
            # Decode and show the SMILES for reference
            try:
                decoded_smiles = sf.decoder(args.selfies.strip())
                print(f"Decoded SMILES: {decoded_smiles}")
            except Exception:
                pass
            print("Conversion completed successfully!")

    except ValueError as e:
        print(f"Input Error: {e}", file=sys.stderr)
        print("Tip: Check your SELFIES string syntax", file=sys.stderr)
        sys.exit(1)

    except IOError as e:
        print(f"File Error: {e}", file=sys.stderr)
        print("Tip: Check file permissions and disk space", file=sys.stderr)
        sys.exit(2)

    except ImportError as e:
        print(f"Dependencies Error: {e}", file=sys.stderr)
        print(
            "Tip: Install required packages with 'pip install rdkit pillow selfies'",
            file=sys.stderr,
        )
        sys.exit(3)

    except Exception as e:
        print(f"Unexpected Error: {e}", file=sys.stderr)
        print("Tip: Please report this issue if it persists", file=sys.stderr)
        sys.exit(4)


if __name__ == "__main__":
    main()

This script provides everything you need to visualize SELFIES strings as molecular images. The tool includes intelligent text wrapping for long SELFIES strings, proper subscript formatting for molecular formulas, and robust error handling.

Usage Examples

# Basic usage - generates auto-named file
python selfies2png.py "[C][C][O]"

# Custom filename and size
python selfies2png.py "[C][=C][C][=C][C][=C][Ring1][=Branch1]" -o benzene.png --size 800

# Complex molecules with branching
python selfies2png.py "[C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]" -o aspirin.png

The script handles the complexity of SELFIES parsing, molecular coordinate generation, and image rendering, making it straightforward to create visualizations of any SELFIES-encoded molecule.

Introduction#

Testing SELFIES vs SMILES Robustness#

Experiment 1: Random SELFIES Generation#

Experiment 2: Random SMILES Generation#

Analysis of SMILES Failure Modes#

Building a SELFIES Visualization Tool#

Setting Up the Core Converter#

Handling Long SELFIES Strings#

Testing the Tool: From Simple to Complex#

Simple Molecules First#

Aromatic Systems#

Complex Pharmaceutical Molecules#

Practical Applications#

Getting Started#

Advanced SELFIES Features#

Debugging with Attribution#

Customizing Chemical Constraints#

Building ML-Ready Datasets#

SELFIES in Practice: Some Real Applications#

Molecular Generation That Works#

Chemical Space Exploration#

Performance Tips#

Batch Processing for Speed#

Building Custom SELFIES Applications#

Smart Molecular Filter#

Comparing SELFIES and SMILES#

Aspirin Example#

Trade-offs Summary#

What’s Next#

New Research Directions#

Getting Involved#

Conclusion#

Try It Yourself#

Complete SELFIES Visualization Script#

Usage Examples#