Skip to content

~~‹~~2

Introduction#

AlphaFold 3 is available on Maestro cluster under module name alphafold/3.0.0.You can find its source code here.
AlphaFold 3 databases are hosted on Maestro (environment variable ALPHAFOLD3_DATA) and managed via Biomaj for automatic updates.

Due to the Google DeepMind Alphafold3 usage policy, you must request your own copy of AlphaFold 3 model parameters (weights) using this form. Do not share the models with others.

The input#

AlphaFold 3 uses a custom JSON input format.
The custom AlphaFold 3 format allows:

  • Specifying protein, RNA, and DNA chains, including modified residues.
  • Specifying custom multiple sequence alignment (MSA) for protein and RNA chains.
  • Specifying custom structural templates for protein chains, in mmCIF format/mmCIF file.
  • Specifying ligands using Chemical Component Dictionary (CCD) codes.
  • Specifying ligands using SMILES.
  • Specifying ligands by defining them using the CCD mmCIF format and supplying them via the user-provided CCD.
  • Specifying covalent bonds between entities.
  • Specifying multiple random seeds.

Here is a single input JSON file as an example:

Code Block (py)

{
  "name": "2PV7",
  "sequences": [
    {
      "protein": {
        "id": ["A", "B"],
        "sequence": "GMRESYANENQFGFKTINSDIHKIVIVGGYGKLGGLFARYLRASGYPISILDREDWAVAESILANADVVIVSVPINLTLETIERLKPYLTENMLLADLTSVKREPLAKMLEVHTGAVLGLHPMFGADIASMAKQVVVRCDGRFPERYEWLLEQIQIWGAKIYQTNATEHDHNMTYIQALRHFSTFANGLHLSKQPINLANLLALSSPIYRLELAMIGRLFAQDAELYADIIMDKSENLAVIETLKQTYDEALTFFENNDRQGFIDAFHKVRDWFGDYSEQFLKESRQLLQQANDLKQG"
      }
    }
  ],
  "modelSeeds": [1],
  "dialect": "alphafold3",
  "version": 1
}

For more details on the input file JSON format structure refer to JSON input format.

Running#

The full AlphaFold 3 pipeline consists of two parts

  • the CPU part: called the data pipeline
  • the GPU part: which is TensorFlow inference with the weights you provided

On the Maestro cluster, we provide the AlphaFold wrapper alphafold3_runner.sh, which has the required databases already set up. It only needs the JSON input file(s) and YOUR model location.
Let's go over its options:

Code Block (bash)

$ alphafold3_runner.sh -h
Usage: alphafold3_runner.sh [-m MODEL_DIR] [options] alphafold3_input
Wrapper script for alphafold with default values preset
  -h | --help                      ... display this message and exit.
  -d | --db_dir <dir>              ... Use <dir> for alphafold data location.
                                       (default /opt/gensoft/data/alphafold/3.0.0)
  -m | --model_dir <dir>           ... MANDATORY by default or when -D | --run_inference is set
                                       (Path to the model to use for inference)
  -j | --jackhmmer_n_cpu <int>     ... Number of CPUs to use for Jackhmmer..
                                       (default 8)
  -n | --nhmmer_n_cpu <int>        ... Number of CPUs to use for Nhmmer.
                                       (default 8)
  -o | --out <dir>                 ... Use <dir> for OUTDIR.
                                       (default current working directory)
                                       will be created if does not exist
  -D | --run_data_pipeline         ... Only run the data pipeline.
  -I | --run_inference             ... Only run inference pipeline

alphafold3_input is
    either the path to a single JSON file
    either the path to a directory of JSON files

Using  -D | --run_data_pipeline you only run the first pipeline step, i.e. data pipeline (CPU intensive // no GPU usage),

while using -I | --run_inference option will only run the pipeline's second step, i.e. inference pipeline (no CPU usage // GPU intensive).

If none of these options are provided the full pipeline will be run.

NB -m | --model <dir>  is required when no step option is given or when -m | --model <dir>  is used

Performance considerationPlease do not run the complete pipeline on GPU nodes. You will be blocking a GPU node while not using the GPU. Use the  Snakemake template below instead. See 

Data pipeline (CPU part)#

This first step will compute the MSAs

with srun#

Code Block (bash)

$ module load alphafold/3.0.0    # modify this version to the latest one
$ srun --qos=fast --cpus-per-task=8 alphafold3_runner.sh --run_data_pipeline alphafold3_input.json

with sbatch#

Code Block (bash)

#!/bin/bash

#SBATCH -N 1
#SBATCH --partition=common,dedicated       # NON GPU partition that fits your needs
#SBATCH --qos=fast                         # qos allowed on the chosen partition
#SBATCH --cpus-per-task=8                  # nhmmer // jackhmmer default requirement
#SBATCH -J 2PV7_data_pipeline

module load alphafold/3.0.0    # modify this version to the latest one
INPUT_JSON=2pv7.json
OUTPUT_DIR=2pv7_data_pipeline

alphafold3_runner.sh --run_data_pipeline --out ${OUTPUT_DIR}  ${INPUT_JSON}

This will create the subdirectory 2pv7_data_pipeline/2pv7 with a single file 2pv7_data.json containing all the MSAs. Note the lowercase, even though you provided the name in capital letters. This is now your input file for the second stage.

Note that if you provide the original one while requesting inference, AF3 will destroy the calculated MSA's and overwrite the output, forcing you to restart the data pipeline from the beginning.

Inference pipeline (GPU part)#

Here we will use previously computed MSAs to infer the folding.

with srun#

Code Block (bash)

$ module load alphafold/3.0.0    # modify this version to the latest one
$ srun  --partition=gpu --qos=gpu --gres=gpu:1  --mem=80G  alphafold3_runner.sh --run_inference -m model_dir 2pv7_data_pipeline/2pv7/2pv7.json

with sbatch#

Code Block (bash)

#!/bin/bash

#SBATCH -N 1
#SBATCH --partition=gpu                     # GPU partition that fit your needs
#SBATCH --qos=gpu                           # qos allowed on the GPU partition
#SBATCH --gres=gpu:1                        # remember 1 GPU
#SBATCH --mem=80G
#SBATCH -J 2PV7_inference_pipeline

module load alphafold/3.0.0                 # modify this version to the latest one
INPUT_JSON=2pv7_data_pipeline/2pv7/2pv7_data.json
OUTPUT_DIR=2pv7_inference_pipeline
MODEL_DIR=/path/to/modeldir

alphafold3_runner.sh --run_inference -m ${MODEL_DIR} --out ${OUTPUT_DIR} ${INPUT_JSON}

This will create  2pv7_inference_pipeline/2pv7/ directory containing the final results.

If 2pv7_inference_pipeline/2pv7/ exists and is non-empty, AF3 will create another directory 2pv7_inference_pipeline/2pv7-.

Snakemake templateSnakemake#

To simplify this process, we provide a Snakemake template here /opt/gensoft/exe/alphafold/3.0.0/share/Snakefile.template.

Please copy this file into a file called Snakefile in the directory where your input files are. Here is what is in the template.

Code Block (py)

import json
from glob import glob

INPUT_JSON_FILES = glob('*.json')
samples_dict = {}

for json_file in INPUT_JSON_FILES:
    with open(json_file) as f:
        data = json.load(f)
        sample_name = data['name']
        sample_name_lower = sample_name.lower()
        samples_dict[sample_name_lower] = {
            'original_name': sample_name,
            'json_file': json_file
        }

SAMPLES_LOWER = list(samples_dict.keys())

rule all:
    input:
        expand('outdir_results/{sample_name}/{sample_name}_model.cif', sample_name=SAMPLES_LOWER)

rule datapipeline:
    input:
        lambda wildcards: samples_dict[wildcards.sample_name]['json_file']
    output:
        'outdir/{sample_name}/{sample_name}_data.json'
    threads: 8
    resources:
        mem_mb=4000,
        slurm_partition="common,dedicated",
        slurm_extra= "-q fast"
    shell:
        r"""
    module load alphafold/3.0.0
        alphafold3_runner.sh -o outdir --run_data_pipeline {input}
    """

rule inference:
    input:
        'outdir/{sample_name}/{sample_name}_data.json'
    output:
        'outdir_results/{sample_name}/{sample_name}_model.cif'
    params:
        models='models'
    threads: 1
    resources:
        mem_mb=80000,
        slurm_partition="gpu",
        slurm_extra= "-q gpu --gres=gpu:1 -t 120"
    shell:
        r"""
    module load alphafold/3.0.0
        alphafold3_runner.sh -m {params.models} -o outdir_results --run_inference {input}
    """

Now you can do:

Code Block (bash)

$ module load alphafold/3.0.0    # modify this version to the latest one
$ module load snakemake
$ snakemake --executor=slurm --slurm-requeue -j 8

It will look for all JSON files in the current directory and run the full pipeline for you. The results will be located in the outdir_results directory, labeled by the name of each sample. You may need to adjust the value of the models parameter above. Do not forget executor=slurm. You should manually clean outdir_results/samplename between runs. If you do not, AF3 will start creating sub-folders with names like outdir_results/2pv7-<currentdatetime>. This will cause Snakemake to think that it failed, even in the case of successful folding.

BenchmarksBenchmarks#

Here are the execution times for several samples on an A100 GPU card:

Sample CPU data pipeline GPU inference
2PV7 (298 aa) 14m 2m
hellofold 33m 1m30
Q4J5ES (1380 aa) 28m 15m
multimer 268 aa // 193 aa 30m 2m

As you can see, the part that runs on the CPU is always longer than the GPU part, so it is highly recommended that they be run separately using the Snakemake template above. Note that the samples above are quite short, for longer ones you may need to wait longer. If you cannot fit into 2h, replace "fast" with "normal" in the Snakefile above.

Troubleshooting#

The data pipeline usually completes fine. For the inference step, you may get an out-of-memory error. The most common one is forgetting to specify RAM "–mem=80G". If the error message mentions "cuda", "jax" or "xla", it means that you need more VRAM on the GPU. You can request it with: "--gres=gpu:1,gmem:50G ". Please do not use this option for smaller multimers as we have only one server with 80GB of memory:

Code Block (bash)

(0)->srun -p gpu -q fast --gres=gpu:1,gmem:50G   hostname
maestro-3010

Final notes#

You may disregard all warnings mentioning "rocm", "tpu", " jax" and "cuda" versions, such as:

Code Block (plain)

Unable to initialize "rocm/tpu"

or

Code Block (plain)

The NVIDIA driver's CUDA version is 12.2 which is older than the PTX compiler version 12.6.77.

They do not influence the folding.

The code is fast evolving, and we will be updating it regularly.