Transformers

A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.

MolParser-7M & WildMol: Large-Scale OCSR Datasets

The MolParser project introduces two key datasets: MolParser-7M, the largest training dataset for Optical Chemical Structure Recognition (OCSR) with 7.7M pairs of images and E-SMILES strings, and WildMol, a new 20k-sample benchmark for evaluating models on challenging real-world data. The training data uniquely combines millions of diverse synthetic molecules with 400,000 manually annotated in-the-wild samples.

Optical Chemical Structure Recognition

MolParser: End-to-End Molecular Structure Recognition

A 2025 end-to-end OCSR system addressing both technical and data challenges, introducing MolParser-7M (7M+ image-text pairs) and MolDet (YOLO-based detector) for extracting and recognizing molecular structures from real-world documents with diverse quality and styles.

Molecular Simulation

Adaptive grid merging visualization for benzene molecule showing multi-resolution spatial discretization

Beyond Atoms: 3D Space Modeling for Molecular Pretraining

ICML 2025 paper introducing SpaceFormer, a Transformer architecture that challenges the atom-centric paradigm by modeling the continuous 3D space surrounding molecules using adaptive multi-resolution grids, ranking first in 10 of 15 molecular property prediction tasks.

Document Processing

Stream accuracy versus relative throughput for Mistral-7B and XGBoost models

LLMs for Insurance Document Automation

We explore LLM applications for page stream segmentation in insurance document processing, demonstrating that parameter-efficient fine-tuning achieves strong accuracy but revealing significant calibration challenges that limit deployment confidence.

Document Processing

Diagram showing page stream segmentation workflow: an input stream of pages is processed through binary classification of page pairs to predict document breaks, producing segmented output documents

LLMs for Page Stream Segmentation

We create TabMe++, an enhanced page stream segmentation benchmark with commercial-grade OCR, and show that parameter-efficiently fine-tuned decoder-based LLMs like Mistral-7B achieve 80% straight-through processing rates, outperforming encoder-based models.

Natural Language Processing

Word vector illustration showing text classification and NLP concepts

Sarcasm Detection with Transformers: A Cautionary Tale

What happens when you achieve 99.8% accuracy on sarcasm detection? You might have accidentally built a domain classifier. A cautionary ML tale about dataset bias.

Computational Social Science

Diagram of the Universal Message schema showing fields like ID, Text, Author, and Reply Sets that normalize data across platforms

Look, Don't Tweet: Unified Data Models for Social NLP

Bachelor’s thesis introducing PyConversations, an open-source library that normalizes over 308 million posts from Twitter, Reddit, Facebook, and 4chan into a unified data model for cross-platform social media research.

Computational Social Science

PyConversations: Social Media Conversational Analysis

Research project that investigated how different NLP models perform on social media data, finding that domain-specific approaches often outperform large pre-trained models. Includes PyConversations, a Python module for analyzing conversations across social media platforms.

Natural Language Processing

A nonsensical trigger sequence 'WTC theoriesclimate Flat Hubbard Principle' is fed into GPT-2, which then generates Flat Earth conspiracy text

GPT-2 Susceptibility to Universal Adversarial Triggers

We demonstrate that universal adversarial triggers can control both the topic and stance of GPT-2’s generated text, revealing security vulnerabilities in deployed language models and proposing constructive applications for bias auditing.