Alphafold

You should normally used the most recent version of Alphafold available on maestro cluster using "module load alphafold".

If you need a particular version or repoducibility, choose one of the versions by running

"module av alphafold"

alphafold/2.0.1 was previously available, update to version 2.3.2 took place on monday 09 october 2023

a warning message will remember you this point when you load the alphafold manual. see:

Code Block (text)

maestro-submit::~ > module load alphafold/2.3.2

********************  WARNING  ********************
alphafold DATA are locally available at Pasteur
through ALPHAFOLD_DATA env var
see 'module show alphafold/2.3.0' for exported variables
********************  WARNING  ********************

PLEASE. DO NOT duplicate this 2.2 TB data set.

data are currently hosted on /opt/gensoft/data/alphafold/ and are referenced accordingly by the environment variable ALPHAFOLD_DATA set when alphafold module is loaded.
some other environment variables are set when alphafold module is loaded: as usual see module show modulename command otutput to display environment changes made by loading the module

Code Block (text)

maestro-submit:~ > module show alphafold/2.3.2
-------------------------------------------------------------------
/opt/gensoft/modules/alphafold/2.3.2:

module-whatis   {Set environnement for alphafold (2.3.2)}
module-whatis   topic_0130
module-whatis   operation_2415
module-whatis   operation_0479
module-whatis   operation_0481
module-whatis   operation_0480
prepend-path    PATH /opt/gensoft/exe/alphafold/2.3.2/bin
setenv          ALPHAFOLD_DATA /opt/gensoft/data/alphafold/2.3.2
setenv          XLA_PYTHON_CLIENT_MEM_FRACTION 4.0
setenv          ALPHAFOLD_JACKHMMER_N_CPU 8
setenv          ALPHAFOLD_HHBLITS_N_CPU 4
setenv          TF_FORCE_UNIFIED_MEMORY 0
setenv          OPENMM_PLATFORM CUDA
setenv          OPENMM_CPU_THREADS 8
-------------------------------------------------------------------

ALPHAFOLD_DATA : data location
XLA_PYTHON_CLIENT_MEM_FRACTION: this makes JAX (used by alphafold pipeline) to preallocate XX% of currently-available GPU memory, instead of the default 90%. Lowering the amount preallocated can fix OOMs that occur when the JAX program starts. default on maestro is 40%
ALPHAFOLD_JACKHMMER_N_CPU: number of thread jackhmmer will be run with (default 8)
ALPHAFOLD_HHBLITS_N_CPU: number of threads hhblits will be run with (default 4)
TF_FORCE_UNIFIED_MEMORY: when set to 1 (default 0 on maestro) tensorflow will utilize memory of additional GPUs as well as CPU memory. allowing to handle large structure prediction. If your job fails with "unable to allocate memory", consider setting TF_FORCE_UNIFIED_MEMORY to 1
OPENMM_PLATFORM: platform to use for the amber mininmization step. default is GPU. when OPENMM_PLATFORM is set to GPU, all alphafold steps can be run on CPU mode, meaning that alphafold can be run on standard compute nodes. note that this mode is much more slower: see: running in cpu mode
OPENMM_CPU_THREADS: this environment variable set the number of threads used by openmm when run in CPU mode only. by default in GPU mode it is ignored.

you can change any of this environnement variable, to suit your needs BUT keep in mind that on maestro the any value of ALPHAFOLD_JACKHMMER_N_CPU,ALPHAFOLD_HHBLITS_N_CPU, (and maybe OPENMM_CPU_THREADS when running in CPU mode) greater than N the number of core requested by your allocation (ie -c N, --cpus-per-task=N value) will be automaticaly set to N, and a warnning display will inform you about this change
eg:if you forgot the --cpus-per-task on you allocation: ie request only 1 (one) core you will get

Code Block (bash)

WARNING: ALPHAFOLD_JACKHMMER_N_CPU greater than SLURM_CPUS_PER_TASK, reduced to: 1
WARNING: ALPHAFOLD_HHBLITS_N_CPU greater than SLURM_CPUS_PER_TASK, reduced to: 1

same applies to OPENMM_CPU_THREADS in CPU mode.

NB our installation is run through a singularity container.
by default the container will bind mount the follwing paths:

/pasteur: giving access to the pasteur tree, eg: projects, scratch, homes etc etc
$HOME
/local/databases: giving access to data banks
/local/scratch: is bind mounted to /tmp in the container to respect the temporary files location policy used on maestro
${ALPHAFOLD_DATA}: of course alphafold data are bind mounted on the container

NB you can bind mount other volumes on the container using the SINGULARITY_BINDPATH environnement variable.

IMPORTANT NOTES.#

alphafold does take all available gpus but use just one. so be sure to require only 1 (ONE ) gpu for you job allocations. see below:
remenber alphafold requires at least 8 cpus for the jackhmmer and hhblits steps.
alphafold MUST be run trough a slurm allocation sbtach or srun//salloc.
NB to avoid problems while running the container please use absolute paths for the input file (--fasta_paths) argument AND the results output directory (--output_dir) argument
we had some rare (and hard to diagnose) conflicts when running alphafold when a conda environnement is active. so please DEACTIVATE any conda environnemt before running alphafold

Command line option change#

WARNING since alphafold/2.3.0 the command line option from --uniclust30_database_path changed to --uniref30_database_path.

current documentation was updated to reflect this change. so adapt the example command line given bellow if you use alphafold version < 2.3.0

running alphafold via srun for monomer prediction#

Code Block (bash)

maestro-submit:~ >module load alphafold
maestro-submit:~ > srun -p gpu --qos=gpu --gres=gpu:A100:1 --cpus-per-task=8 \
     alphafold --fasta_paths /pasteur/appa/scratch/public/edeveaud/1STU.fasta \
               --output_dir /pasteur/appa/scratch/public/edeveaud/1STU_out \
               --max_template_date 2020-05-14 \
               --model_preset monomer \
               --data_dir ${ALPHAFOLD_DATA} \
               --uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta \
               --mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
               --pdb70_database_path ${ALPHAFOLD_DATA}/pdb70/pdb70 \
               --template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif/mmcif_files \
               --obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
               --bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
               --uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
               --db_preset=full_dbs \
               --verbosity 1 2>&1 | tee 1STU.log

Note the use of ${ALPHAFOLD_DATA} env var to point to the data.

for the explanations of options used see alphafold documentation.

running alphafold via sbatch for monomer prediction#

Code Block (bash)

#!/bin/bash

#SBATCH -N 1
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1          # remember 1 GPU
#SBATCH --cpus-per-task=8.    # jackhmmer default requirement
#SBATCH --constraint='A100|V100|P100:1'

#---- Job Name
#SBATCH -J 1STU_af2_job

INPUT_FASTA=/pasteur/appa/scratch/public/edeveaud/1STU.fasta
OUTPUT_DIR=/pasteur/appa/scratch/public/edeveaud/1STU_out
PRESET_MODEL=MONOMER # <monomer|monomer_casp14|monomer_ptm> AKA monomer model, monomer model with extra ensembling, monomer model with pTM head
PRESET_DB=full_dbs # <reduced_dbs|full_dbs>: Choose preset MSA database configuration

alphafold --fasta_paths ${INPUT_FASTA} \
          --output_dir ${OUTPUT_DIR} \
          --max_template_date 2020-05-14 \
          --model_preset ${PRESET_MODEL} \
          --data_dir ${ALPHAFOLD_DATA} \
          --uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta \
          --mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
          --pdb70_database_path ${ALPHAFOLD_DATA}/pdb70/pdb70 \
          --template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif/mmcif_files \
          --obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
          --bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
          --uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
          --db_preset ${PRESET_DB} \
          --verbosity 1

running alphafold via srun for multimer prediction#

warning when running in multimer model prediction

--pdb70_database_path must be unset
--uniprot_database_path must be set
--pdb_seqres_database_path must be set

Code Block (bash)

maestro-submit:~ >module load alphafold
maestro-submit:~ > srun -p gpu --qos=gpu --gres=gpu:A100:1 --cpus-per-task=8 \
alphafold --fasta_paths /pasteur/appa/scratch/public/edeveaud/multimer.fasta \
          --max_template_date 2020-05-14 \
          --output_dir /pasteur/appa/scratch/public/edeveaud/multimer \
          --model_preset multimer \
          --data_dir ${ALPHAFOLD_DATA} \
          --uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta \
          --mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
          --template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif/mmcif_files \
          --obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
          --bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
          --uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
          --db_preset full_dbs \
          --uniprot_database_path ${ALPHAFOLD_DATA}/uniprot/uniprot.fa \
          --pdb_seqres_database_path ${ALPHAFOLD_DATA}/pdb_seqres/pdb_seqres.txt \
          2>&1 | tee 1STU.log

running alphafold via sbatch for multimer prediction#

warning when running in multimer model prediction

--pdb70_database_path must be unset
--uniprot_database_path must be set
--pdb_seqres_database_path must be set

Code Block (bash)

#!/bin/bash

#SBATCH -N 1
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1          # remember 1 GPU
#SBATCH --cpus-per-task=8.    # jackhmmer default requirement
#SBATCH --constraint='A100|V100|P100:1'

#---- Job Name
#SBATCH -J 1STU_af2_job

INPUT_FASTA=/pasteur/appa/scratch/public/edeveaud/multimer.fasta
OUTPUT_DIR=/pasteur/appa/scratch/public/edeveaud/multimer_out
PRESET_MODEL=multimer # multimer model
PRESET_DB=full_dbs # <reduced_dbs|full_dbs>: Choose preset MSA database configuration

alphafold --fasta_paths T1050.fasta \
          --max_template_date 2020-05-14 \
          --output_dir ${OUTPUT_DIR} \
          --model_preset multimer --data_dir ${ALPHAFOLD_DATA} \
          --uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta \
          --mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
          --template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif/mmcif_files \
          --obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
          --bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
          --uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
          --db_preset full_dbs --uniprot_database_path ${ALPHAFOLD_DATA}/uniprot/uniprot.fa \
          --pdb_seqres_database_path ${ALPHAFOLD_DATA}/pdb_seqres/pdb_seqres.txt \
          --verbosity 1

systems with multiple conformation states#

are not officially supported by AF2. You can try modifying msa's (--use-precomputed-msa). There is more information available here

250

Template modelling scores#

The template modeling score or TM-score scores (measure of similarity between two protein structures) are not calculated using the default model. To get pTM scored models you need to use the pTM specific modesl in the input. To get pTM scores you will need to change the --preset_model argument to monomer_ptm.

this option is only avaible for monomer prediction.

running in cpu mode only#

cpu_mode

in order to run in CPU only mode (aka slow, slow mode) you need to toggle the OPENMM\_PLATFORM to CPU, and you may want to set OPENMM_CPU_THREADS to the value that suit your needs

in this case the container is run without --nv context that will lead to the following warnings.

Code Block (text)

2021-10-15 10:42:11.383112: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2021-10-15 10:42:11.383146: W external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
I1015 10:42:11.383237 139628216674112 xla_bridge.py:232] Unable to initialize backend 'gpu': Failed precondition: No visible GPU devices.
I1015 10:42:11.383634 139628216674112 xla_bridge.py:232] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
W1015 10:42:11.383754 139628216674112 xla_bridge.py:236] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

later during the run you will see more warnings about the slow execution time. This warning will pops out for each model requested during the "Running predict with shape" stage.

Code Block (text)

2021-10-15 12:33:31.794150: E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:55]
********************************
Very slow compile?  If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
Compiling module jit_apply_fn__2.101482

in order to compare here's the timings of alphafold runs performed using 1STU.fasta (68 residues), T1061.fasta (949 residues) and S-layer.fasta (1380 residues) sequence on both modes.
Runs weere done on a 56 cpu machine and compared to a run on Quadro RTX 5000 16 G

for CPU mode command was run with OPENMM_PLATFORM=CPU and OPENMM_CPU_THREADS=56, all other environment variables left at default value
for gpu mode command was performed with default environment variables left to default

	CPU mode	GPU mode
1STU.fasta	real 113m5.287s user 261m10.399s sys 19m36.303s	real 37m28.122s user 43m50.137s sys 2m15.193s
T1061.fasta	real 1188m51.609s user 11780m36.541s sys 274m50.460s	real 262m33.751s user 245m11.956s sys 29m53.662s
S-layer.fasta	real 2375m34.827s user 33601m2.526s sys 406m14.152s	real 107m51.450s user 389m59.774s sys 17m14.930s

as always one picture worth thousands word...

GPU memory#

Memory is going to be an issue with large protein sizes. The original publication suggests some things to try:

"Inferencing large proteins can easily exceed the memory of a single GPU. For a V100 with 16 GB of memory, we can predict the structure of proteins up to ~1,300 residues without ensembling and the 256- and 384-residue inference times are using a single GPU’s memory. "

"The memory usage is approximately quadratic in the number of residues, so a 2,500 residue protein involves using unified memory so that we can greatly exceed the memory of a single V100. In our cloud setup, a single V100 is used for computation on a 2,500 residue protein but we requested four GPUs to have sufficient memory."

The following environment variable settings may help with larger polypeptide calculations (> 1,200 aa).

Code Block (text)

TF_FORCE_UNIFIED_MEMORY=1
XLA_PYTHON_CLIENT_MEM_FRACTION=0.5

maestro cluster provides he following GPU cards

A100 with 40GB ram
V100 with 32 GB ram
P100 with 16 GB

alphafold_runner.sh#

On maestro cluster we provide an alphafold wrapper alphafold_runner.sh that will ease the run of alphafold with default parameters with just the required fasta file as input.

this script must be run trough a slurm allocation srun // sbtatch // salloc. remember for example. srun -p gpu --qos=gpu --gres=gpu:A100:1 --cpus-per-task=8

see:

Code Block (bash)

Usage: alphafold_runner.sh [options] fasta_file
Wrapper script for alphafold with default values preset
  -h | --help                      ... display this message and exit.
                                       then will use specified ones
  -V | --verbose                   ... Toggle verbosity on
                                       (default OFF)
  -d | --data_path <dir>           ... Use <dir> for alphafold data location.
                                       (default /opt/gensoft/data/alphafold/2.3.0)
  -i | --is_prokaryote_list <values>... Optional for multimer system
                                        <values> list should contain a boolean for each fasta specifying true where the target complex is from a prokaryote, and false where it is not, or where the origin is unknown
                                       (default /opt/gensoft/data/alphafold/2.3.0)
  -m | --models_preset <monomer|monomer_casp14|monomer_ptm|multimer>
                                   ... Choose preset model configuration
                                       monomer model, monomer model with extra ensembling, monomer model with pTM head, or multimer model
                                       (default 'monomer')
  -o | --out <dir>                 ... Use <dir> for OUTDIR.
                                       (default current working directory)
                                       will be created if does not exist
  -p | --db_preset <full_dbs|reduced_dbs>
                                   ... Choose preset MSA database configuration.
                                       (default 'full_dbs')
                                       preset may be:
                                       reduced_dbs: smaller genetic database config.
                                       full_dbs: full genetic database config.
  -r | --relax_gpu                 ... Relax on GPU, default NO
  -t | --template_date <template>  ... Use <template> as template date.
                                       (default '2020-05-14')

NB to avoid problems use full path to the fasta sequence file and for the results directory location

runnig for example the following command

Code Block (text)

maestro-submit:~ > alphafold_runner.sh $HOME/foo.fasta

will in fact run the following alphafold command.

Code Block (bash)

alphafold --fasta_paths /pasteur/appa/homes/edeveaud/foo.fasta \ 
          --max_template_date 2020-05-14 \
          --output_dir /home/edeveaud \
          --model_preset monomer \
          --data_dir ${ALPHAFOLD_DATA} \
          --uniref90_database_path \
          --mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
          --pdb70_database_path ${ALPHAFOLD_DATA}/pdb70/pdb70 \
          --template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif \
          --obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
          --bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
          --uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
          --db_preset full_dbs \
          --use_gpu_relax=False

while

Code Block (text)

maestro-submit:~ > alphafold_runner.sh -m multimer $HOME/foo.fasta

will run the following alphafold command.

Code Block (bash)

alphafold --fasta_paths foo.fasta \
          --max_template_date 2020-05-14 \
          --output_dir `pwd` \
           --model_preset multimer \
           --data_dir ${ALPHAFOLD_DATA} \
           --uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta \
           --mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
           --template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif/mmcif_files \
           --obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
           --bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
           --uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
           --db_preset full_dbs \
           --use_gpu_relax=False \
           --uniprot_database_path ${ALPHAFOLD_DATA}/uniprot/uniprot.fa \
           --pdb_seqres_database_path ${ALPHAFOLD_DATA}/pdb_seqres/pdb_seqres.txt

playing with options you can change any default value for your alphafold options value. eg change date template and use template modelling scores

Code Block (text)

maestro-submit:~ > alphafold_runner.sh -t 2021-01-01 -m monomer_ptm $HOME/foo.fasta

will run

Code Block (bash)

alphafold --fasta_paths /pasteur/appa/homes/edeveaud/foo.fasta \ 
          --max_template_date 2021-01-11 \
          --output_dir /home/edeveaud \
          --model_preset monomer_ptm \
          --data_dir ${ALPHAFOLD_DATA} \
          --uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta\
          --mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
          --pdb70_database_path ${ALPHAFOLD_DATA}/pdb70/pdb70 \
          --template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif \
          --obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
          --bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
          --uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
          --preset_db full_dbs \
          --use_gpu_relax=False

Alphafold references:#

alphafold home: https://github.com/deepmind/alphafold
alphafold installed version: alphafold/2.0.1 alphafold/2.1.1 alphafold/2.2.0 alphafold/2.3.0
alphafold reference: Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

Alphafold available data#

data	version	content
`${ALPHAFOLD_DATA}/params`	`2021-07-14`	CASP14//pTM models
`${ALPHAFOLD_DATA}/mgnify`	2019_05
Code Block (`text`)

MGnify protein sequence database

| | ${ALPHAFOLD_DATA}/uniref30 | 2018_08 | databases cluster UniProtKB sequences at the level of 30% pairwise sequence identity | | ${ALPHAFOLD_DATA}/uniref90 | 2021_03 | databases cluster UniProtKB sequences at the level of 90% pairwise sequence identity | | ${ALPHAFOLD_DATA}/pdb70 | 210901 | PDB70 data base | | ${ALPHAFOLD_DATA}/pdb_mmcif | NA | PDB in mmcif format see: https://www.ebi.ac.uk/pdbe/docs/documentation/mmcif.html | | ${ALPHAFOLD_DATA}/small_bfd | NA | small Big Fantastic Database see: https://bfd.mmseqs.com/ | | ${ALPHAFOLD_DATA}/bfd | NA | full Big Fantastic Database see: https://bfd.mmseqs.com/ | | ${ALPHAFOLD_DATA}/uniprot | NA | concatenation of uniprot_trembl and uniprot_sprot | | ${ALPHAFOLD_DATA}/pdb_seqres | NA | PDB SEQRES records |

NB pdb_mmcif are regulary updated and will be up to date.

gpu_usage

GPU usage:#

as stated before alphafold will run by default on all available GPUs but will only use one.
here is some details that emphasis this point.

when CUDA_VISIBLE_DEVICE is not set alphafold will tak all available gpu on host whitout using it. just the first one will be used. see the nvidia-smi output.

Code Block (text)

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1596461      C   python                          14919MiB |
|    1   N/A  N/A   1596461      C   python                            255MiB |
|    2   N/A  N/A   1596461      C   python                            255MiB |
|    3   N/A  N/A   1596461      C   python                            255MiB |
|    4   N/A  N/A   1596461      C   python                            255MiB |
|    5   N/A  N/A   1596461      C   python                            255MiB |
|    6   N/A  N/A   1596461      C   python                            255MiB |
|    7   N/A  N/A   1596461      C   python                            255MiB |
+-----------------------------------------------------------------------------+

and here is the memory consuption per gpu graph

you can note that GPU memory as GPU usage is only handled on 1 GPU unit. So no need to use more than 1 (one) gpu in your allocation

Alphafold data repository structure#

the database directory look like this:

Code Block (text)

├── bfd
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
│   └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
├── mgnify
│   └── mgy_clusters.fa
├── params
│   ├── LICENSE
│   ├── params_model_1.npz
│   ├── params_model_multimer_1.npz 
│   ├── params_model_1_ptm.npz
│   ├── params_model_2.npz
│   ├── params_model_multimer_2.npz 
│   ├── params_model_2_ptm.npz
│   ├── params_model_3.npz
│   ├── params_model_multimer_3.npz
│   ├── params_model_3_ptm.npz
│   ├── params_model_4.npz
│   ├── params_model_multimer_4.npz
│   ├── params_model_4_ptm.npz
│   ├── params_model_5.npz
│   ├── params_model_multimer_5.npz
│   └── params_model_5_ptm.npz
├── pdb70
│   ├── md5sum
│   ├── pdb70_a3m.ffdata
│   ├── pdb70_a3m.ffindex
│   ├── pdb70_clu.tsv
│   ├── pdb70_cs219.ffdata
│   ├── pdb70_cs219.ffindex
│   ├── pdb70_hhm.ffdata
│   ├── pdb70_hhm.ffindex
│   └── pdb_filter.dat
├── pdb_mmcif
│   ├── mmcif_files
|   |   ├── *.cif # warning zillions of files  
|   |   ├── 
|   |   └── 
│   └── obsolete.dat
├── pdb_seqres
│   ├── pdb_seqres.txt
├── small_bfd
│   └── bfd-first_non_consensus_sequences.fasta
├── uniclust30
│   └── uniclust30_2018_08
|       ├── uniclust30_2018_08.cs219
|       ├── uniclust30_2018_08.cs219.sizes
|       ├── uniclust30_2018_08_a3m.ffdata
|       ├── uniclust30_2018_08_a3m.ffindex
|       ├── uniclust30_2018_08_a3m_db@
|       ├── uniclust30_2018_08_a3m_db.index
|       ├── uniclust30_2018_08_cs219.ffdata
|       ├── uniclust30_2018_08_cs219.ffindex
|       ├── uniclust30_2018_08_hhm.ffdata
|       ├── uniclust30_2018_08_hhm.ffindex
|       ├── uniclust30_2018_08_hhm_db@
|       ├── uniclust30_2018_08_hhm_db.index
|       └── uniclust30_2018_08_md5sum 
├── uniprot
│   └── uniprot.fasta
└── uniref90
    └── uniref90.fasta

monomer benchmarks#

run alphafold on casp14 T1050 target on 4 different GPU devices: A100, P100, V100 and RTX5000 using the followig script
NB these tests were run with alphafold/2.1.1

Code Block (bash)

#!/bin/bash 

TARGET=$1 #T1050.fasta

GPU_NAME=$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader | head -n 1 | tr ' ' '_')

OUTDIR=$(basename $TARGET .fasta).${GPU_NAME}

mkdir -p ${OUTDIR}

time alphafold --fasta_paths `pwd`/${TARGET} \
--max_template_date 2020-05-14 \
--output_dir `pwd`/${OUTDIR} \
--model_preset monomer \
--data_dir ${ALPHAFOLD_DATA} \
--uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta \
--mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
--pdb70_database_path ${ALPHAFOLD_DATA}/pdb70/pdb70 \
--template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
--bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path ${ALPHAFOLD_DATA}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--db_preset=full_dbs \
--stderrthreshold debug 2>&1 | tee ${OUTDIR}/log

input = T1050.fasta from CASP14
input and outdir on appa /pasteur/appa/scratch/public/edeveaud/T1050
environement:
CUDA_VISIBLE_DEVICES=0 # only one GPU
ALPHAFOLD_DATA=/opt/gensoft/data/alphafold/2.1.1
XLA_PYTHON_CLIENT_MEM_FRACTION, TF_FORCE_UNIFIED_MEMORYsee above

	Tesla P100	Tesla V100	Tesla A100	Quadro RTX 5000 (maestro-builder)
XLA_PYTHON_CLIENT_MEM_FRACTION=4.0 TF_FORCE_UNIFIED_MEMORY=1	real 166m56.273s user 386m28.066s sys 21m21.025s	real 130m45.294s user 302m26.913s sys 14m49.230s	real 84m0.195s user 262m15.817s sys 2m45.695s	real 164m38.699s user 304m52.392s sys 27m18.801s

impact of TF_FORCE_UNIFIED_MEMORY#

as stated above, sometimes your sequence does not "fit in gpu memory" and in this case we recommend to set TF_FORCE_UNIFIED_MEMORY to 1 here are ome examples

using Q4J6E5.fasta (1380 residues)

Code Block (text)

>sp|Q4J6E5|SLAA_SULAC S-layer protein A OS=Sulfolobus acidocaldarius (strain ATCC 33909 / DSM 639 / JCM 8929 / NBRC 15157 / NCIMB 11770) OX=330779 GN=slaA PE=1 SV=1
MNKLVGLLVSSLFLASILIGIAPAITTTALTPPVSAGGIQAYLLTGSGAPASGLVLFVVNVSNIQVSSSNVTNVISTVVSNIQINAKTENAQTGATTGSVTVRFPTSGYNAYYDSVDKVVFVVVSFLYPYTTTSVNIPLSYLSKYLPGLLTAQPYDETGAQVTSVSSTPFGSLIDTSTGQQILGTNPVLTSYNSYTTQANTNMQEGVVSGTLTSFTLGGQSFSGSTVPVILYAPFIFSNSPYQAGLYNPMQVNGNLGSLSSEAYYHPVIWGRALINTTLIDTYASGSVPFTFQLNYSVPGPLTINMAQLAWIASINNLPTSFTYLSYKFSNGYESFLGIISNSTQLTAGALTINPSGNFTINGKKFYVYLLVVGSTNSTTPVEYVTKLVVEYPSSTNFLPQGVTVTTSSNKYTLPVYEIGGPAGTTITLTGNWYSTPYTVQITVGSTPTLTNYVSQILLKAVAYEGINVSTTQSPYYSTAILSTPPSEISITGSSTITAQGKLTATSASATVNLLTNATLTYENIPLTQYSFNGIIVTPGYAAINGTTAMAYVIGALYNKTSDYVLSFAGSQEPMQVMNNNLTEVTTLAPFGLTLLAPSVPATETGTSPLQLEFFTVPSTSYIALVDFGLWGNLTSVTVSAYDTVNNKLSVNLGYFYGIVIPPSISTAPYNYQNFICPNNYVTVTIYDPDAVLDPYPSGSFTTSSLPLKYGNMNITGAVIFPGSSVYNPSGVFGYSNFNKGAAVTTFTYTAQSGPFSPVALTGNTNYLSQYADNNPTDNYYFIQTVNGMPVLMGGLSIVASPVSASLPSSTSSPGFMYLLPSAAQVPSPLPGMATPNYNLNIYITYKIDGATVGNNMINGLYVASQNTLIYVVPNGSFVGSNIKLTYTTTDYAVLHYFYSTGQYKVFKTVSVPNVTANLYFPSSTTPLYQLSVPLYLSEPYYGSPLPTYIGLGTNGTSLWNSPNYVLFGVSAVQQYLGFIKSISVTLSNGTTVVIPLTTSNMQTLFPQLVGQELQACNGTFQFGISITGLEKLLNLNVQQLNNSILSVTYHDYVTGETLTATTKLVALSTLSLVAKGAGVVEFLLTAYPYTGNITFAPPWFIAENVVKQPFMTYSDLQFAKTNPSAILSLSTVNITVVGLGGKASVYYNSTSGQTVITNIYGQTVATLSGNVLPTLTELAAGNGTFTGSLQFTIVPNNTVVQIPSSLTKTSFAVYTNGSLAIVLNGKAYSLGPAGLFLLPFVTYTGSAIGANATAIITVSDGVGTSTTQVPITAENFTPIRLAPFQVPAQVPLPNAPKLKYEYNGSIVITPQQQVLKIYVTSILPYPQEFQIQAFVYEASQFNVHTGSPTAAPVYFSYSAVRAYPALGIGTSVPNLLVYV

we ran alphafold on 2 different cards

Quadro RTX 5000 16Gb memory
Tesla A100 40Gb moemory

	RTX500	A100
TF_FORCE_UNIFIED_MEMORY=0	Failure RuntimeError: Resource exhausted: Out of memory while trying to allocate 14752930272 bytes.	real 107m51.450s user 389m59.774s sys 17m14.930s
TF_FORCE_UNIFIED_MEMORY=1	real 283m30.444s user 350m40.434s sys 136m7.659s	real 108m7.859s user 389m27.946s sys 17m33.175s

NB time to look at i the real one.

first run show that RXT 5000 does not have enough memory to handle this structure modelisation,while A100 is memory sufficient
second run shows that setting TF_FORCE_UNIFIED_MEMORY=1 allows to run succesfully. One can note that the impact on A100 card is negligeable.

multimer benchmarks#

we ran multimer prediction using te following sequences on the same conditions than monomer benchmarks

max2.fasta (text)

>chain A
SKAWNRYRLPNTLKPDSYRVTLRPYLTPNDRGLYVFKGSSTVRFTCKEATDVIIIHSKKLNYTLSQGHRVVLRGVGGSQPPDIDKTELVEPTEYLVVHLKGSLVKDSQYEMDSEFEGELADDLAGFYRSEYMEGNVRKVVATTQMQAADARKSFPCFDEPAMKAEFNITLIHPKDLTALSNMLPKGPSTPLPEDPNWNVTEFHTTPKMSTYLLAFIVSEFDYVEKQASNGVLIRIWARPSAIAAGHGDYALNVTGPILNFFAGHYDTPYPLPKSDQIGLPDFNAGAMENWGLVTYRENSLLFDPLSSSSSNKERVVTVIAHELAHQWFGNLVTIEWWNDLWLNEGFASYVEYLGADYAEPTWNLKDLMVLNDVYRVMAVDALASSHPLSTPASEINTPAQISELFDAISYSKGASVLRMLSSFLSEDVFKQGLASYLHTFAYQNTIYLNLWDHLQEAVNNRSIQLPTTVRDIMNRWTLQMGFPVITVDTSTGTLSQEHFLLDPDSNVTRPSEFNYVWIVPITSIRDGRQQQDYWLIDVRAQNDLFSTSGNEWVLLNLNVTGYYRVNYDEENWRKIQTQLQRDHSAIPVINRAQIINDAFNLASAHKVPVTLALNNTLFLIEERQYMPWEAALSSLSYFKLMFDRSEVYGPMKNYLKKQVTPLFIHFRNNTNNWREIPENLMDQYSEVNAISTACSNGVPECEEMVSGLFKQWMENPNNNPIHPNLRSTVYCNAIAQGGEEEWDFAWEQFRNATLVNEADKLRAALACSKELWILNRYLSYTLNPDLIRKQDATSTIISITNNVIGQGLVWDFVQSNWKKLFNDYGGGSFSFSNLIQAVTRRFSTEYELQQLEQFKKDNEETGFGSGTRALEQALEKTKANIKWVKENKEVVLQWFTENSK
>chain B
KHMFIVLYVNFKLRSGVGRCYNCRPAVVNITLANFNETKGPLCVDTSHFTTQFVGAKFDRWSASINTGNCPFSFGKVNNFVKFGSVCFSLKDIPGGCAMPIMANLANLNSHTIGTLYVSWSDGDGITGVPQP

sequence were grabed from: NATURE COMMUNICATIONS | 8: 1735| DOI: 10.1038/s41467-017-01706-x

	Tesla P100	Tesla V100	Tesla A100	Quadro RTX 5000 (maestro-builder)
XLA_PYTHON_CLIENT_MEM_FRACTION=4.0 TF_FORCE_UNIFIED_MEMORY=1	real 195m51.152s user 415m53.242s sys 29m30.718s	real 149m38.843s user 311m14.591s sys 19m35.380s	real 153m51.796s user 426m15.276s sys 13m49.630s	real 186m18.490s user 347m39.079s sys 39m4.976s

input, outdir and script used to run alphafold multimer prediction available on appa /pasteur/appa/scratch/public/edeveaud/max2

How to view predicted structures.#

once ran alphafold will provide you n models that you may want to take a look at

various options provide visualisation capabilities: for example

software that requires local installation
pymol: https://pymol.org/2/ should run on any computer
chimeraX https://www.cgl.ucsf.edu/chimerax/ available for MacOS, Windows and various linux flavour
VTX https://vtx.drugdesign.fr/ available for Windows and various linux flavour
online web based tools
https://nglviewer.org/ngl/
https://www.rcsb.org/3d-view

here is a view of the 5 models predicted for CASP14 T1031 using alphafold. images generated using ChimeraX

T1031_ranked_0	T1031_ranked_1	T1031_ranked_2	T1031_ranked_3	T1031_ranked_4

accuracy.#

for curiosity we ran alphafold prediction on Galactose-1-phosphate uridylyltransferase homodimer and compare the prediction with the known X-Ray structure available on PDB

both structure, reference ones and prediction where loaded in ChimeraX and then "aligned" with MatchMaker.

here you will find the MatchMaker's RMS for each alphafold ranked prediction vs pdb1hxp.ent and pdb1hxq.ent

	pdb1hxp.ent	pdb1hxq.ent
ranked_0.pdb	Matchmaker pdb1hxp.ent, chain A (#1) with ranked_0.pdb, chain A (#2), sequence alignment score = 1806.7 RMSD between 331 pruned atom pairs is 0.302 angstroms; (across all 340 pairs: 1.458)	Matchmaker pdb1hxq.ent, chain B (#1) with ranked_0.pdb, chain A (#2), sequence alignment score = 1810.6 RMSD between 330 pruned atom pairs is 0.428 angstroms; (across all 332 pairs: 0.536)
ranked_1.pdb	Matchmaker pdb1hxp.ent, chain A (#1) with ranked_1.pdb, chain A (#3), sequence alignment score = 1794.7 RMSD between 331 pruned atom pairs is 0.288 angstroms; (across all 340 pairs: 1.448)	Matchmaker pdb1hxq.ent, chain A (#1) with ranked_1.pdb, chain A (#3), sequence alignment score = 1819.6 RMSD between 332 pruned atom pairs is 0.369 angstroms; (across all 340 pairs: 1.760)
ranked_2.pdb	Matchmaker pdb1hxp.ent, chain A (#1) with ranked_2.pdb, chain A (#4), sequence alignment score = 1806.7 RMSD between 331 pruned atom pairs is 0.313 angstroms; (across all 340 pairs: 1.458)	Matchmaker pdb1hxq.ent, chain A (#1) with ranked_2.pdb, chain B (#4), sequence alignment score = 1819.6 RMSD between 332 pruned atom pairs is 0.388 angstroms; (across all 340 pairs: 1.799)
ranked_3.pdb	Matchmaker pdb1hxp.ent, chain A (#1) with ranked_3.pdb, chain B (#5), sequence alignment score = 1803.1 RMSD between 331 pruned atom pairs is 0.305 angstroms; (across all 340 pairs: 1.474)	Matchmaker pdb1hxq.ent, chain A (#1) with ranked_3.pdb, chain A (#5), sequence alignment score = 1819.6 RMSD between 332 pruned atom pairs is 0.388 angstroms; (across all 340 pairs: 1.828)
ranked_4.pdb	Matchmaker pdb1hxp.ent, chain A (#1) with ranked_4.pdb, chain A (#6), sequence alignment score = 1806.7 RMSD between 331 pruned atom pairs is 0.281 angstroms; (across all 340 pairs: 1.430)	Matchmaker pdb1hxq.ent, chain B (#1) with ranked_4.pdb, chain A (#6), sequence alignment score = 1810.6 RMSD between 331 pruned atom pairs is 0.417 angstroms; (across all 332 pairs: 0.505)

just for the pleasure of the eyes, bellow is the overview of the spining matchmaker result (click to view)

blue: alphafold ranked_4.pdb prediction
brown: pdb1hxq.ent X-Ray structure

150