atomworks is an open-source platform that maximizes research velocity for biomolecular modeling tasks. Much like how Torchdata enables rapid prototyping within the vision and language domains, AtomWorks aims to accelerate development and experimentation within biomolecular modeling.
⚠️ Notice: We are currently finalizing some cleanup work within our repositories. Please expect the APIs (e.g., function and class names, inputs and outputs) to stabilize within the next two weeks. Thank you for your patience!
If you're looking for the models themselves (e.g., RF3, MPNN) that integrate with AtomWorks rather than the underlying framework, check out ModelForge
AtomWorks is composed of two symbiotic libraries:
- atomworks.io: A universal Python toolkit for parsing, cleaning, manipulating, and converting biological data (structures, sequences, small molecules). Built on the biotite API, it seamlessly loads and exports between standard formats like mmCIF, PDB, FASTA, SMILES, MOL, and more.
- atomworks.ml: Advanced dataset featurization and sampling for deep learning workflows that uses
atomworks.io
as its structural backbone. We provide a comprehensive, pre-built and well-tested set ofTransforms
for common tasks that can be easily composed into full deep-learning pipelines; users may also create their ownTransforms
for custom operations.
For more detail on the motivation for and applications of AtomWorks, please see the preprint.
AtomWorks is built atop biotite: We are grateful to the Biotite developers for maintaining such a high-quality and flexible toolkit, and hope that our package will prove a helpful addition to the broader biotite
community.
A general-purpose Python toolkit for cleaning up, standardizing, and working with biomolecular files - based on biotite
atomworks.io lets you:
- Parse, convert, and clean any common biological file (structure or sequence). For example, identifying and removing leaving groups, correcting bond order after nucleophilic addition, fixing charges, parsing covalent geometries, and appropriate treatment of structures with multiple occupancies and ligands at symmetry centers
- Transform all data to a consistent
AtomArray
representation for further analysis or machine learning applications, regardless of initial source - Model missing atoms (those implied by the sequence but not represented in the coordinates) and initialize entity- and instance-level annotations (see the glossary for more detail on our composable naming conventions)
We have found atomworks.io
to be useful to a general bioinformatics and protein design audience; in many cases, atomworks.io
can replace bespoke scripts and manual curation, enabling researchers to spend more time testing hypothesis and less time juggling dozens of tools and dependencies.
Modular, component-based library for dataset featurization within biomolecular deep learning workflows
atomworks.ml provides:
- A library of pre-built, well-tested
Transforms
that can be slotted into novel pipelines - An extensible framework, integrated with
atomworks.io
, to writeTransforms
for arbitrary use cases - Scripts to pre-process the PDB or other databases into dataframes appropriate for network training
- Efficient sampling and batching utilities for training machine learning models
Within the AtomWorks paradigm, the output of each Transform
is not an opaque dictionary with model-specific tensors but instead an updated version of our atom-level structural representation (Biotite's AtomArray
). Operations within – and between – pipelines thus maintain a common vocabulary of inputs and outputs.
pip install atomworks # base installation version without torch (for only atomworks.io)
pip install "atomworks[ml]" # with torch and ML dependencies (for atomworks.io plus atomworks.ml)
pip install "atomworks[dev]" # with development dependencies
pip install "atomworks[ml,dev]" # with all dependencies
If you are using uv for package management, you can install atomworks with:
uv pip install "atomworks[ml,openbabel,dev]"
For more advanced setup options (including how to run workflows via apptainers) see the full documentation.
-
Use
atomworks.io
when you:- Need to parse/clean/convert between biological file formats (mmCIF, PDB, FASTA, etc.)
- Want a unified structural representation to plug into any downstream analysis or modeling
- Need structural operations like adding missing atoms, filtering ligands/solvents, or assembly generation
-
Use
atomworks.ml
when you:- Need to featurize entire datasets for deep learning
- Want ready-made sampling and batching utilities for training pipelines
- Already use
atomworks.io
and want a seamless bridge to ML-ready feature engineering
To parse a pdb file (parse = load, clean, annotate relevant metadata such as entities, molecules, etc) you can use the parse
function:
from atomworks.io.parser import parse
result = parse(filename="3nez.cif.gz")
asym_unit: AtomArrayStack = result["asym_unit"]
assemblies: dict[str, AtomArrayStack] = result["assemblies"]
for chain_id, info in result["chain_info"].items():
print(chain_id, info["sequence"])
The output of parse
includes:
- chain_info — Sequences/metadata for each chain
- ligand_info — Ligand annotation & metrics
- asym_unit — Structure (
AtomArrayStack
) - assemblies — Built biological assemblies (each are their own
AtomArrayStack
) - metadata — Experimental and source information
See usage examples for more details.
If you just want to load a file, you can use the load_any
function:
from atomworks.io.utils.io_utils import load_any
atom_array: AtomArray = load_any("3nez.cif.gz", model=1) # model=1 means that we want to load the model 1 (i.e. the first model) rather than a stack of all models in the file
⚠️ Disclaimer: Documentation for this section is currently under construction. Please check back soon for updates!
Step 1 — Mirror the PDB (mmCIFs)
To train on the PDB, you first need to make sure you have access to the samples form the PDB. We use mmCIF
files as the highly recommended format for training.
For convenience, we provide a command to mirror the PDB:
# Full mirror (~100 GB)
atomworks pdb sync /path/to/pdb_mirror # This will create a carbon-copy of the PDB, dated today, in the specified directory. It will download the .mmcif files in the same sharding pattern as the original PDB and keep them gzipped for efficiency.
# # If, for some reason you only want to download specific IDs, the CLI also supports this:
# atomworks pdb sync /path/to/pdb_mirror --pdb-id 1A0I --pdb-id 7XYZ # This will only download the specified PDB IDs.
# # or
# atomworks pdb sync /path/to/pdb_mirror --pdb-ids-file /path/to/ids.txt # This will download the PDB IDs listed in the file, one per line. Each line should be a PDB ID (e.g. '6lyz') and separated by a newline.
Once the mirror is created, set the environment variable:
export PDB_MIRROR_PATH=/path/to/pdb_mirror
To have this more permanent, you can add it to a .env
file in your home directory. Here is an example of a .env
file structure that you can copy, rename to .env
and edit with your own paths.
Step 2 — Get PDB metadata (PN units and interfaces) To calculate sampling probabilities and filter examples for splits, we pre-process the PDB with metadata for each PDB entry. To save you the work, we provide pre-computed metadata (dated July 15/2025) for downloading:
atomworks setup metadata /path/to/metadata # This will download the metadata (as .tar.gz) and extract it to the specified directory.
This produces parquet files at:
-
/path/to/metadata/pn_units_df.parquet
— Contains metadata for each PN unit in the PDB. The term pn unit is shorthand forpolymer XOR non-polymer unit
and behaves for almost all purposes like thechain
in a PDB file. The only difference is that a ligand composed of multiple covalently bonded ligands is considered a single PN unit (whilst it would be multiple chains in a PDB file). Effectively this.parquet
is a large table of all individual chains, ligands, etc (to be precise, it has one entry per pn unit) in the PDB that includes helpful metadata for filtering and sampling. -
/path/to/metadata/interfaces_df.parquet
— Contains metadata for each interface in the PDB. This.parquet
is a large table of all binary interfaces in the PDB. It lists each interface as (pn_unit_1, pn_unit_2) pairs and includes helpful metadata for filtering and sampling.Alternatively, you can generate fresher metadata yourself (scripts will be uploaded in the coming weeks).
Step 3 — Configure an AF3-style dataset (example: train only on D-polypeptides) Next we need to use the metadata to configure a dataset that we would like to sample from. This includes e.g. training cut-off, filters, transforms to apply, etc. Here's a simple example that:
- Filters to D-polypeptide and L-polypeptide chains only (
POLYPEPTIDE_D
andPOLYPEPTIDE_L
-- to include additional chain types, replace the lists with the appropriate IDs (see mapping in comments). - Excludes ligands in the AF3 list of excluded ligands, available at
atomworks.io.constants.AF3_EXCLUDED_LIGANDS_REGEX
.
# NOTE: The below is a hydra config and the _target_ fields are the hydra syntax for instantiating a class.
# You can use this without hyrda, but will then instead need to provide the corresponding arguments for the
# _target_ objects directly.
# Chain type ids used below (from atomworks.enums.ChainType):
# 0=CyclicPseudoPeptide, 1=OtherPolymer, 2=PeptideNucleicAcid,
# 3=DNA, 4=DNA_RNA_HYBRID, 5=POLYPEPTIDE_D, 6=POLYPEPTIDE_L, 7=RNA,
# 8=NON_POLYMER, 9=WATER, 10=BRANCHED, 11=MACROLIDE
af3_pdb_dataset:
_target_: atomworks.ml.datasets.datasets.ConcatDatasetWithID
datasets:
# Single PN units
- _target_: atomworks.ml.datasets.datasets.StructuralDatasetWrapper
dataset_parser:
_target_: atomworks.ml.datasets.parsers.PNUnitsDFParser
transform:
_target_: atomworks.ml.pipelines.af3.build_af3_transform_pipeline
is_inference: false
n_recycles: 5 # This means that we will subsample 5 random sets from the MSA for each example.
crop_size: 256
crop_contiguous_probability: 0.3333333333333333
crop_spatial_probability: 0.6666666666666666
diffusion_batch_size: 32
# Optional templates (if available)
template_lookup_path: ${paths.shared}/template_lookup.csv
template_base_dir: ${paths.shared}/template
# Optional MSAs (see Step 4)
# protein_msa_dirs:
# - { dir: /path/to/msa, extension: .a3m.gz, directory_depth: 2 }
# rna_msa_dirs:
# - { dir: /path/to/msa, extension: .afa, directory_depth: 0 }
dataset:
_target_: atomworks.ml.datasets.datasets.PandasDataset
name: pn_units
id_column: example_id
data: /path/to/metadata/pn_units_df.parquet
filters:
- "deposition_date < '2022-01-01'"
- "resolution < 5.0 and ~method.str.contains('NMR')"
- "num_polymer_pn_units <= 20"
- "cluster.notnull()"
- "method in ['X-RAY_DIFFRACTION', 'ELECTRON_MICROSCOPY']"
# Train only on D-polypeptides:
- "q_pn_unit_type in [5, 6]" # 5 = POLYPEPTIDE_D, 6 = POLYPEPTIDE_L
# Exclude ligands from AF3 excluded set:
- "~(q_pn_unit_non_polymer_res_names.notnull() and q_pn_unit_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
columns_to_load: null
save_failed_examples_to_dir: null
# Binary interfaces
- _target_: atomworks.ml.datasets.datasets.StructuralDatasetWrapper
dataset_parser:
_target_: atomworks.ml.datasets.parsers.InterfacesDFParser
transform:
_target_: atomworks.ml.pipelines.af3.build_af3_transform_pipeline
is_inference: false
n_recycles: 5
crop_size: 256
crop_spatial_probability: 1.0
crop_contiguous_probability: 0.0
diffusion_batch_size: 32
template_lookup_path: ${paths.shared}/template_lookup.csv
template_base_dir: ${paths.shared}/template
# Optional MSAs (see Step 4)
# protein_msa_dirs:
# - { dir: /path/to/msa, extension: .a3m.gz, directory_depth: 2 }
# rna_msa_dirs:
# - { dir: /path/to/msa, extension: .afa, directory_depth: 0 }
dataset:
_target_: atomworks.ml.datasets.datasets.PandasDataset
name: interfaces
id_column: example_id
data: /path/to/metadata/interfaces_df.parquet
filters:
- "deposition_date < '2022-01-01'"
- "resolution < 5.0 and ~method.str.contains('NMR')"
- "num_polymer_pn_units <= 20"
- "cluster.notnull()"
- "method in ['X-RAY_DIFFRACTION', 'ELECTRON_MICROSCOPY']"
# Train only on D-polypeptide interfaces:
- "pn_unit_1_type in [5, 6]" # 5 = POLYPEPTIDE_D, 6 = POLYPEPTIDE_L
- "pn_unit_2_type in [5, 6]" # 5 = POLYPEPTIDE_D, 6 = POLYPEPTIDE_L
- "~(pn_unit_1_non_polymer_res_names.notnull() and pn_unit_1_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
- "~(pn_unit_2_non_polymer_res_names.notnull() and pn_unit_2_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
columns_to_load: null
cif_parser_args:
cache_dir: null
save_failed_examples_to_dir: null
Step 4 — MSAs (optional) We are working on a way to make MSAs accessible to the public, but due to the large storage requirements (multiple TB) we are still working on this. If your organization has interest & capacity to host the MSAs, please contact us. In the meantime, if you have MSAs (e.g., from OpenProteinSet) you can configure the pipeline to use them like so:
protein_msa_dirs:
- { dir: /path/to/msa, extension: .a3m.gz, directory_depth: 2 }
rna_msa_dirs:
- { dir: /path/to/msa, extension: .afa, directory_depth: 0 }
Or alternatively not use MSAs.
Step 5 — Train a model You now have a full fledged dataset that you can use to train models on! If you want to just try this out without having to download the whole PDB and the metdatada, you can instead run our tests which have a mini-mockup of the pipeline with real pdb files, metadata, distillation data, templates and MSAs for the example of AF3. You can download all this relevant metadata via the atomworks CLI:
atomworks setup tests # This will download the test pack to `tests/data` and unpack it there (~500 MB)
You will now have a mini PDB at tests/data/pdb
and a mini custom CCD at tests/data/ccd
. MSA and template data is in tests/data/shared
and the distillation and metadata are in data/ml/af2_distillation
, data/ml/pdb_pn_units
and data/ml/pdb_interfaces
. A dataset that uses all of these is for example here.
To run the tests for the various datasets, you can run the following command:
# Make sure you have the correct environment activated, and set your paths correctly in the .env file / shell environment variables (see points above)
pytest tests/ml/test_data_loading_pipelines.py
We welcome improvements!
Please see the full documentation for contribution guidelines.
If you make use of AtomWorks in your research, please cite:
N. Corley*, S. Mathis*, R. Krishna*, M. S. Bauer, T. R. Thompson, W. Ahern, M. W. Kazman, R. I. Brent, K. Didi, A. Kubaney, L. McHugh, A. Nagle, A. Favor, M. Kshirsagar, P. Sturmfels, Y. Li, J. Butcher, B. Qiang, L. L. Schaaf, R. Mitra, K. Campbell, O. Zhang, R. Weissman, I. R. Humphreys, Q. Cong, J. Funk, S. Sonthalia, P. Lio, D. Baker, F. DiMaio, "Accelerating Biomolecular Modeling with AtomWorks and RF3," bioRxiv, August 2025. doi: 10.1101/2025.08.14.670328
If you use bibtex, here's the GoogleScholar formatted citation:
@article{corley2025accelerating,
title={Accelerating Biomolecular Modeling with AtomWorks and RF3},
author={Corley, Nathaniel and Mathis, Simon and Krishna, Rohith and Bauer, Magnus S and Thompson, Tuscan R and Ahern, Woody and Kazman, Maxwell W and Brent, Rafael I and Didi, Kieran and Kubaney, Andrew and others},
journal={bioRxiv},
pages={2025--08},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}