ZINC-22

ZINC-22: Multi-Billion Molecule Database
Dataset Details
AuthorsBenjamin I. Tingle, Khanh G. Tang, Mar Castanon, John J. Gutierrez, Munkhzul Khurelbaatar, Chinzorig Dandarchuluun, Yurii S. Moroz, John J. Irwin
Paper TitleZINC-22─A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery
InstitutionsUniversity of California, San Francisco, Taras Shevchenko National University of Kyiv, Chemspace LLC
Published InJournal of Chemical Information and Modeling
CategoryComputational Chemistry
FormatSMILES SDF mol2 db2 Vendor catalogs
SizeMolecules 2d: 37,000,000,000+
Molecules 3d: 4,500,000,000+
Scaffolds: 680,000,000+
DateSeptember 2025
Year2023
Links📊 Dataset🔗 DOI📄 Paper
ZINC-22 Tranche Browser showing molecular count distribution
ZINC-22’s Tranche Browser displaying the distribution of 96.4 billion molecules organized by heavy atom count and lipophilicity (LogP)

Key Contribution

ZINC-22 addresses the critical infrastructure challenges of managing multi-billion-scale libraries of make-on-demand chemical compounds through scalable database architecture, intuitive web interfaces, and cloud distribution systems that enable modern virtual screening at unprecedented scale.

Dataset Information

Format

SMILES SDF mol2 db2 Vendor catalogs

Size

TypeCount
Molecules 2d37,000,000,000+
Molecules 3d4,500,000,000+
Scaffolds680,000,000+

Dataset Examples

ZINC-22's 2D Tranche Browser showing the organization of 96.4 billion molecules by physicochemical properties
ZINC-22’s 2D Tranche Browser showing the organization of 96.4 billion molecules by physicochemical properties

Dataset Subsets

SubsetCountDescription
2D Database37B+Complete 2D chemical structures from make-on-demand catalogs
3D Database4.5B+Ready-to-dock 3D conformations with pre-calculated properties
Custom TranchesVariableUser-selected molecular subsets via Tranche Browser

Strengths

  • Unprecedented scale with 37+ billion purchasable compounds
  • Integrated CartBlanche GUI makes complex searches accessible
  • Scalable federated database architecture supports trillion-molecule growth
  • Cloud distribution via AWS, Oracle OCI, and UCSF servers
  • Advanced search tools (SmallWorld, Arthor) for similarity and substructure
  • Organized tranche system for targeted chemical space selection
  • Free and open access to entire database
  • Parallel public/private servers protect undisclosed catalogs
  • Chemical diversity maintained despite massive scale growth
  • Ready-to-dock 3D conformations with pre-calculated properties

Limitations

  • Data transfer bottlenecks due to petabyte-scale storage requirements
  • Interactive search limited to 20,000 hits for server performance
  • Individual 3D molecule downloads no longer available directly
  • Difficulty removing discontinued vendor molecules from federated structure
  • Cloud-based computation nearly mandatory due to file sizes
  • Asynchronous service required for large queries and 3D rebuilding

Technical Notes

Technical Implementation

Database Architecture

Uses 174 independent PostgreSQL databases in a sharded architecture to manage multi-billion molecule scale. Molecules organized into tranches based on heavy atom count, lipophilicity (LogP), charge, and file format for parallel processing.

Search Infrastructure

Integrates SmallWorld for ultra-fast similarity searching and Arthor for substructure searching, wrapped in the CartBlanche web interface for user-friendly access to billion-scale chemical libraries.

3D Database & Distribution

3D database reorganized into federated smaller databases distributed across Amazon AWS, Oracle OCI, and UCSF servers. Nearly petabyte of storage for 4.5 billion ready-to-dock conformations.

User Interface Features

Tranche Browser enables graphical selection of molecular subsets based on physicochemical properties. Shopping cart system facilitates molecular collection, price quotes, and vendor purchasing workflows.

Database Content & Access

Chemical Diversity

Analysis using Bemis-Murcko scaffolds shows chemical diversity grows logarithmically with database size, with most new scaffolds appearing in higher heavy atom count compounds. 680M+ unique scaffolds maintained.

Vendor Integration

Focuses on large make-on-demand catalogs from major vendors including Enamine, WuXi, and Mcule. Emphasizes commercially available compounds over theoretical chemical space.

Access Control

Parallel public and private search servers protect patentability of undisclosed chemical matter while enabling comprehensive searching after user authentication. Supports both academic and commercial needs.

Related Datasets

DatasetRelationshipLink
ZINC-15PredecessorN/A
ChEMBLComplementaryN/A
PubChemComplementaryN/A