Image-to-sequence models reframe OCSR as an image captioning problem: an encoder (typically a CNN or Vision Transformer) extracts visual features, and an autoregressive decoder generates a string representation of the molecule, most commonly SMILES, InChI, or SELFIES. DECIMER pioneered this approach with CNN-based encoders trained on millions of synthetic images; subsequent work explored Transformer encoders (SwinOCSR, ICMDT), improved training objectives (MolSight’s reinforcement learning for stereochemistry), and alternative string targets (RFL’s ring-free language). These models benefit from large-scale synthetic data and generally handle diverse drawing styles better than rule-based predecessors, though they can hallucinate tokens for structures outside their training distribution.