Alphafold
You should normally used the most recent version of Alphafold available on maestro cluster using "module load alphafold".
If you need a particular version or repoducibility, choose one of the versions by running
"module av alphafold"
alphafold/2.0.1 was previously available, update to version 2.3.2 took place on monday 09 october 2023
a warning message will remember you this point when you load the alphafold manual. see:
Code Block (text)
maestro-submit::~ > module load alphafold/2.3.2
******************** WARNING ********************
alphafold DATA are locally available at Pasteur
through ALPHAFOLD_DATA env var
see 'module show alphafold/2.3.0' for exported variables
******************** WARNING ********************
PLEASE. DO NOT duplicate this 2.2 TB data set.
data are currently hosted on /opt/gensoft/data/alphafold/
some other environment variables are set when alphafold module is loaded: as usual see module show modulename command otutput to display environment changes made by loading the module
Code Block (text)
maestro-submit:~ > module show alphafold/2.3.2
-------------------------------------------------------------------
/opt/gensoft/modules/alphafold/2.3.2:
module-whatis {Set environnement for alphafold (2.3.2)}
module-whatis topic_0130
module-whatis operation_2415
module-whatis operation_0479
module-whatis operation_0481
module-whatis operation_0480
prepend-path PATH /opt/gensoft/exe/alphafold/2.3.2/bin
setenv ALPHAFOLD_DATA /opt/gensoft/data/alphafold/2.3.2
setenv XLA_PYTHON_CLIENT_MEM_FRACTION 4.0
setenv ALPHAFOLD_JACKHMMER_N_CPU 8
setenv ALPHAFOLD_HHBLITS_N_CPU 4
setenv TF_FORCE_UNIFIED_MEMORY 0
setenv OPENMM_PLATFORM CUDA
setenv OPENMM_CPU_THREADS 8
-------------------------------------------------------------------
- ALPHAFOLD_DATA : data location
- XLA_PYTHON_CLIENT_MEM_FRACTION: this makes JAX (used by alphafold pipeline) to preallocate XX% of currently-available GPU memory, instead of the default 90%. Lowering the amount preallocated can fix OOMs that occur when the JAX program starts. default on maestro is 40%
- ALPHAFOLD_JACKHMMER_N_CPU: number of thread jackhmmer will be run with (default 8)
- ALPHAFOLD_HHBLITS_N_CPU: number of threads hhblits will be run with (default 4)
- TF_FORCE_UNIFIED_MEMORY: when set to 1 (default 0 on maestro) tensorflow will utilize memory of additional GPUs as well as CPU memory. allowing to handle large structure prediction. If your job fails with "unable to allocate memory", consider setting
TF_FORCE_UNIFIED_MEMORY to 1 OPENMM_PLATFORM:platform to use for the amber mininmization step. default is GPU. whenOPENMM_PLATFORMis set to GPU, all alphafold steps can be run on CPU mode, meaning that alphafold can be run on standard compute nodes. note that this mode is much more slower: see: running in cpu mode- OPENMM_CPU_THREADS: this environment variable set the number of threads used by openmm when run in CPU mode only. by default in GPU mode it is ignored.
you can change any of this environnement variable, to suit your needs BUT keep in mind that on maestro the any value of ALPHAFOLD_JACKHMMER_N_CPU,ALPHAFOLD_HHBLITS_N_CPU, (and maybe OPENMM_CPU_THREADS when running in CPU mode) greater than N the number of core requested by your allocation (ie -c N, --cpus-per-task=N value) will be automaticaly set to N, and a warnning display will inform you about this change
eg:if you forgot the --cpus-per-task on you allocation: ie request only 1 (one) core you will get
Code Block (bash)
WARNING: ALPHAFOLD_JACKHMMER_N_CPU greater than SLURM_CPUS_PER_TASK, reduced to: 1
WARNING: ALPHAFOLD_HHBLITS_N_CPU greater than SLURM_CPUS_PER_TASK, reduced to: 1
same applies to OPENMM_CPU_THREADS in CPU mode.
NB our installation is run through a singularity container.
by default the container will bind mount the follwing paths:
- /pasteur: giving access to the pasteur tree, eg: projects, scratch, homes etc etc
- $HOME
- /local/databases: giving access to data banks
- /local/scratch: is bind mounted to /tmp in the container to respect the temporary files location policy used on maestro
- ${ALPHAFOLD_DATA}: of course alphafold data are bind mounted on the container
NB you can bind mount other volumes on the container using the SINGULARITY_BINDPATH environnement variable.
IMPORTANT NOTES.#
- alphafold does take all available gpus but use just one. so be sure to require only 1 (ONE ) gpu for you job allocations. see below:
- remenber alphafold requires at least 8 cpus for the jackhmmer and hhblits steps.
- alphafold MUST be run trough a slurm allocation sbtach or srun//salloc.
- NB to avoid problems while running the container please use absolute paths for the input file (
--fasta_paths) argument AND the results output directory (--output_dir) argument - we had some rare (and hard to diagnose) conflicts when running alphafold when a conda environnement is active. so please DEACTIVATE any conda environnemt before running alphafold
Command line option change#
WARNING since alphafold/2.3.0 the command line option from --uniclust30_database_path changed to --uniref30_database_path.
current documentation was updated to reflect this change. so adapt the example command line given bellow if you use alphafold version < 2.3.0
running alphafold via srun for monomer prediction#
Code Block (bash)
maestro-submit:~ >module load alphafold
maestro-submit:~ > srun -p gpu --qos=gpu --gres=gpu:A100:1 --cpus-per-task=8 \
alphafold --fasta_paths /pasteur/appa/scratch/public/edeveaud/1STU.fasta \
--output_dir /pasteur/appa/scratch/public/edeveaud/1STU_out \
--max_template_date 2020-05-14 \
--model_preset monomer \
--data_dir ${ALPHAFOLD_DATA} \
--uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta \
--mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
--pdb70_database_path ${ALPHAFOLD_DATA}/pdb70/pdb70 \
--template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
--bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
--db_preset=full_dbs \
--verbosity 1 2>&1 | tee 1STU.log
Note the use of ${ALPHAFOLD_DATA} env var to point to the data.
for the explanations of options used see alphafold documentation.
running alphafold via sbatch for monomer prediction#
Code Block (bash)
#!/bin/bash
#SBATCH -N 1
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1 # remember 1 GPU
#SBATCH --cpus-per-task=8. # jackhmmer default requirement
#SBATCH --constraint='A100|V100|P100:1'
#---- Job Name
#SBATCH -J 1STU_af2_job
INPUT_FASTA=/pasteur/appa/scratch/public/edeveaud/1STU.fasta
OUTPUT_DIR=/pasteur/appa/scratch/public/edeveaud/1STU_out
PRESET_MODEL=MONOMER # <monomer|monomer_casp14|monomer_ptm> AKA monomer model, monomer model with extra ensembling, monomer model with pTM head
PRESET_DB=full_dbs # <reduced_dbs|full_dbs>: Choose preset MSA database configuration
alphafold --fasta_paths ${INPUT_FASTA} \
--output_dir ${OUTPUT_DIR} \
--max_template_date 2020-05-14 \
--model_preset ${PRESET_MODEL} \
--data_dir ${ALPHAFOLD_DATA} \
--uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta \
--mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
--pdb70_database_path ${ALPHAFOLD_DATA}/pdb70/pdb70 \
--template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
--bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
--db_preset ${PRESET_DB} \
--verbosity 1
running alphafold via srun for multimer prediction#
warning when running in multimer model prediction
- --pdb70_database_path must be unset
- --uniprot_database_path must be set
- --pdb_seqres_database_path must be set
Code Block (bash)
maestro-submit:~ >module load alphafold
maestro-submit:~ > srun -p gpu --qos=gpu --gres=gpu:A100:1 --cpus-per-task=8 \
alphafold --fasta_paths /pasteur/appa/scratch/public/edeveaud/multimer.fasta \
--max_template_date 2020-05-14 \
--output_dir /pasteur/appa/scratch/public/edeveaud/multimer \
--model_preset multimer \
--data_dir ${ALPHAFOLD_DATA} \
--uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta \
--mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
--template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
--bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
--db_preset full_dbs \
--uniprot_database_path ${ALPHAFOLD_DATA}/uniprot/uniprot.fa \
--pdb_seqres_database_path ${ALPHAFOLD_DATA}/pdb_seqres/pdb_seqres.txt \
2>&1 | tee 1STU.log
running alphafold via sbatch for multimer prediction#
warning when running in multimer model prediction
- --pdb70_database_path must be unset
- --uniprot_database_path must be set
- --pdb_seqres_database_path must be set
Code Block (bash)
#!/bin/bash
#SBATCH -N 1
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1 # remember 1 GPU
#SBATCH --cpus-per-task=8. # jackhmmer default requirement
#SBATCH --constraint='A100|V100|P100:1'
#---- Job Name
#SBATCH -J 1STU_af2_job
INPUT_FASTA=/pasteur/appa/scratch/public/edeveaud/multimer.fasta
OUTPUT_DIR=/pasteur/appa/scratch/public/edeveaud/multimer_out
PRESET_MODEL=multimer # multimer model
PRESET_DB=full_dbs # <reduced_dbs|full_dbs>: Choose preset MSA database configuration
alphafold --fasta_paths T1050.fasta \
--max_template_date 2020-05-14 \
--output_dir ${OUTPUT_DIR} \
--model_preset multimer --data_dir ${ALPHAFOLD_DATA} \
--uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta \
--mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
--template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
--bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
--db_preset full_dbs --uniprot_database_path ${ALPHAFOLD_DATA}/uniprot/uniprot.fa \
--pdb_seqres_database_path ${ALPHAFOLD_DATA}/pdb_seqres/pdb_seqres.txt \
--verbosity 1
systems with multiple conformation states#
are not officially supported by AF2. You can try modifying msa's (--use-precomputed-msa). There is more information available here
250
Template modelling scores#
The template modeling score or TM-score scores (measure of similarity between two protein structures) are not calculated using the default model. To get pTM scored models you need to use the pTM specific modesl in the input. To get pTM scores you will need to change the --preset_model argument to monomer_ptm.
this option is only avaible for monomer prediction.
running in cpu mode only#
cpu_mode
in order to run in CPU only mode (aka slow, slow mode) you need to toggle the OPENMM\_PLATFORM to CPU, and you may want to set OPENMM_CPU_THREADS to the value that suit your needs
in this case the container is run without --nv context that will lead to the following warnings.
Code Block (text)
2021-10-15 10:42:11.383112: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2021-10-15 10:42:11.383146: W external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
I1015 10:42:11.383237 139628216674112 xla_bridge.py:232] Unable to initialize backend 'gpu': Failed precondition: No visible GPU devices.
I1015 10:42:11.383634 139628216674112 xla_bridge.py:232] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
W1015 10:42:11.383754 139628216674112 xla_bridge.py:236] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
later during the run you will see more warnings about the slow execution time. This warning will pops out for each model requested during the "Running predict with shape" stage.
Code Block (text)
2021-10-15 12:33:31.794150: E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:55]
********************************
Very slow compile? If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
Compiling module jit_apply_fn__2.101482
in order to compare here's the timings of alphafold runs performed using 1STU.fasta (68 residues), T1061.fasta (949 residues) and S-layer.fasta (1380 residues) sequence on both modes.
Runs weere done on a 56 cpu machine and compared to a run on Quadro RTX 5000 16 G
- for CPU mode command was run with OPENMM_PLATFORM=CPU and OPENMM_CPU_THREADS=56, all other environment variables left at default value
- for gpu mode command was performed with default environment variables left to default
| CPU mode | GPU mode | |
|---|---|---|
| 1STU.fasta | real 113m5.287s user 261m10.399s sys 19m36.303s | real 37m28.122s user 43m50.137s sys 2m15.193s |
| T1061.fasta | real 1188m51.609s user 11780m36.541s sys 274m50.460s | real 262m33.751s user 245m11.956s sys 29m53.662s |
| S-layer.fasta | real 2375m34.827s user 33601m2.526s sys 406m14.152s | real 107m51.450s user 389m59.774s sys 17m14.930s |
as always one picture worth thousands word...
GPU memory#
Memory is going to be an issue with large protein sizes. The original publication suggests some things to try:
"Inferencing large proteins can easily exceed the memory of a single GPU. For a V100 with 16 GB of memory, we can predict the structure of proteins up to ~1,300 residues without ensembling and the 256- and 384-residue inference times are using a single GPU’s memory. "
"The memory usage is approximately quadratic in the number of residues, so a 2,500 residue protein involves using unified memory so that we can greatly exceed the memory of a single V100. In our cloud setup, a single V100 is used for computation on a 2,500 residue protein but we requested four GPUs to have sufficient memory."
The following environment variable settings may help with larger polypeptide calculations (> 1,200 aa).
Code Block (text)
TF_FORCE_UNIFIED_MEMORY=1
XLA_PYTHON_CLIENT_MEM_FRACTION=0.5
maestro cluster provides he following GPU cards
- A100 with 40GB ram
- V100 with 32 GB ram
- P100 with 16 GB
alphafold_runner.sh#
On maestro cluster we provide an alphafold wrapper alphafold_runner.sh that will ease the run of alphafold with default parameters with just the required fasta file as input.
this script must be run trough a slurm allocation srun // sbtatch // salloc. remember for example. srun -p gpu --qos=gpu --gres=gpu:A100:1 --cpus-per-task=8
see:
Code Block (bash)
Usage: alphafold_runner.sh [options] fasta_file
Wrapper script for alphafold with default values preset
-h | --help ... display this message and exit.
then will use specified ones
-V | --verbose ... Toggle verbosity on
(default OFF)
-d | --data_path <dir> ... Use <dir> for alphafold data location.
(default /opt/gensoft/data/alphafold/2.3.0)
-i | --is_prokaryote_list <values>... Optional for multimer system
<values> list should contain a boolean for each fasta specifying true where the target complex is from a prokaryote, and false where it is not, or where the origin is unknown
(default /opt/gensoft/data/alphafold/2.3.0)
-m | --models_preset <monomer|monomer_casp14|monomer_ptm|multimer>
... Choose preset model configuration
monomer model, monomer model with extra ensembling, monomer model with pTM head, or multimer model
(default 'monomer')
-o | --out <dir> ... Use <dir> for OUTDIR.
(default current working directory)
will be created if does not exist
-p | --db_preset <full_dbs|reduced_dbs>
... Choose preset MSA database configuration.
(default 'full_dbs')
preset may be:
reduced_dbs: smaller genetic database config.
full_dbs: full genetic database config.
-r | --relax_gpu ... Relax on GPU, default NO
-t | --template_date <template> ... Use <template> as template date.
(default '2020-05-14')
NB to avoid problems use full path to the fasta sequence file and for the results directory location
runnig for example the following command
Code Block (text)
maestro-submit:~ > alphafold_runner.sh $HOME/foo.fasta
will in fact run the following alphafold command.
Code Block (bash)
alphafold --fasta_paths /pasteur/appa/homes/edeveaud/foo.fasta \
--max_template_date 2020-05-14 \
--output_dir /home/edeveaud \
--model_preset monomer \
--data_dir ${ALPHAFOLD_DATA} \
--uniref90_database_path \
--mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
--pdb70_database_path ${ALPHAFOLD_DATA}/pdb70/pdb70 \
--template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif \
--obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
--bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
--db_preset full_dbs \
--use_gpu_relax=False
while
Code Block (text)
maestro-submit:~ > alphafold_runner.sh -m multimer $HOME/foo.fasta
will run the following alphafold command.
Code Block (bash)
alphafold --fasta_paths foo.fasta \
--max_template_date 2020-05-14 \
--output_dir `pwd` \
--model_preset multimer \
--data_dir ${ALPHAFOLD_DATA} \
--uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta \
--mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
--template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
--bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
--db_preset full_dbs \
--use_gpu_relax=False \
--uniprot_database_path ${ALPHAFOLD_DATA}/uniprot/uniprot.fa \
--pdb_seqres_database_path ${ALPHAFOLD_DATA}/pdb_seqres/pdb_seqres.txt
playing with options you can change any default value for your alphafold options value. eg change date template and use template modelling scores
Code Block (text)
maestro-submit:~ > alphafold_runner.sh -t 2021-01-01 -m monomer_ptm $HOME/foo.fasta
will run
Code Block (bash)
alphafold --fasta_paths /pasteur/appa/homes/edeveaud/foo.fasta \
--max_template_date 2021-01-11 \
--output_dir /home/edeveaud \
--model_preset monomer_ptm \
--data_dir ${ALPHAFOLD_DATA} \
--uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta\
--mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
--pdb70_database_path ${ALPHAFOLD_DATA}/pdb70/pdb70 \
--template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif \
--obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
--bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path ${ALPHAFOLD_DATA}/uniref30/UniRef30_2021_03 \
--preset_db full_dbs \
--use_gpu_relax=False
Alphafold references:#
- alphafold home: https://github.com/deepmind/alphafold
- alphafold installed version: alphafold/2.0.1 alphafold/2.1.1 alphafold/2.2.0 alphafold/2.3.0
- alphafold reference: Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Alphafold available data#
| data | version | content |
|---|---|---|
${ALPHAFOLD_DATA}/params |
2021-07-14 |
CASP14//pTM models |
${ALPHAFOLD_DATA}/mgnify |
2019_05 | |
Code Block (text) |
MGnify protein sequence database
${ALPHAFOLD_DATA}/uniref30 | 2018_08 | databases cluster UniProtKB sequences at the level of 30% pairwise sequence identity |
| ${ALPHAFOLD_DATA}/uniref90 | 2021_03 | databases cluster UniProtKB sequences at the level of 90% pairwise sequence identity |
| ${ALPHAFOLD_DATA}/pdb70 | 210901 | PDB70 data base |
| ${ALPHAFOLD_DATA}/pdb_mmcif | NA | PDB in mmcif format see: https://www.ebi.ac.uk/pdbe/docs/documentation/mmcif.html |
| ${ALPHAFOLD_DATA}/small_bfd | NA | small Big Fantastic Database see: https://bfd.mmseqs.com/ |
| ${ALPHAFOLD_DATA}/bfd | NA | full Big Fantastic Database see: https://bfd.mmseqs.com/ |
| ${ALPHAFOLD_DATA}/uniprot | NA | concatenation of uniprot_trembl and uniprot_sprot |
| ${ALPHAFOLD_DATA}/pdb_seqres | NA | PDB SEQRES records |
NB pdb_mmcif are regulary updated and will be up to date.
gpu_usage
GPU usage:#
as stated before alphafold will run by default on all available GPUs but will only use one.
here is some details that emphasis this point.
when CUDA_VISIBLE_DEVICE is not set alphafold will tak all available gpu on host whitout using it. just the first one will be used. see the nvidia-smi output.
Code Block (text)
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1596461 C python 14919MiB |
| 1 N/A N/A 1596461 C python 255MiB |
| 2 N/A N/A 1596461 C python 255MiB |
| 3 N/A N/A 1596461 C python 255MiB |
| 4 N/A N/A 1596461 C python 255MiB |
| 5 N/A N/A 1596461 C python 255MiB |
| 6 N/A N/A 1596461 C python 255MiB |
| 7 N/A N/A 1596461 C python 255MiB |
+-----------------------------------------------------------------------------+
and here is the memory consuption per gpu graph
you can note that GPU memory as GPU usage is only handled on 1 GPU unit. So no need to use more than 1 (one) gpu in your allocation
Alphafold data repository structure#
the database directory look like this:
Code Block (text)
├── bfd
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
├── mgnify
│ └── mgy_clusters.fa
├── params
│ ├── LICENSE
│ ├── params_model_1.npz
│ ├── params_model_multimer_1.npz
│ ├── params_model_1_ptm.npz
│ ├── params_model_2.npz
│ ├── params_model_multimer_2.npz
│ ├── params_model_2_ptm.npz
│ ├── params_model_3.npz
│ ├── params_model_multimer_3.npz
│ ├── params_model_3_ptm.npz
│ ├── params_model_4.npz
│ ├── params_model_multimer_4.npz
│ ├── params_model_4_ptm.npz
│ ├── params_model_5.npz
│ ├── params_model_multimer_5.npz
│ └── params_model_5_ptm.npz
├── pdb70
│ ├── md5sum
│ ├── pdb70_a3m.ffdata
│ ├── pdb70_a3m.ffindex
│ ├── pdb70_clu.tsv
│ ├── pdb70_cs219.ffdata
│ ├── pdb70_cs219.ffindex
│ ├── pdb70_hhm.ffdata
│ ├── pdb70_hhm.ffindex
│ └── pdb_filter.dat
├── pdb_mmcif
│ ├── mmcif_files
| | ├── *.cif # warning zillions of files
| | ├──
| | └──
│ └── obsolete.dat
├── pdb_seqres
│ ├── pdb_seqres.txt
├── small_bfd
│ └── bfd-first_non_consensus_sequences.fasta
├── uniclust30
│ └── uniclust30_2018_08
| ├── uniclust30_2018_08.cs219
| ├── uniclust30_2018_08.cs219.sizes
| ├── uniclust30_2018_08_a3m.ffdata
| ├── uniclust30_2018_08_a3m.ffindex
| ├── uniclust30_2018_08_a3m_db@
| ├── uniclust30_2018_08_a3m_db.index
| ├── uniclust30_2018_08_cs219.ffdata
| ├── uniclust30_2018_08_cs219.ffindex
| ├── uniclust30_2018_08_hhm.ffdata
| ├── uniclust30_2018_08_hhm.ffindex
| ├── uniclust30_2018_08_hhm_db@
| ├── uniclust30_2018_08_hhm_db.index
| └── uniclust30_2018_08_md5sum
├── uniprot
│ └── uniprot.fasta
└── uniref90
└── uniref90.fasta
monomer benchmarks#
run alphafold on casp14 T1050 target on 4 different GPU devices: A100, P100, V100 and RTX5000 using the followig script
NB these tests were run with alphafold/2.1.1
Code Block (bash)
#!/bin/bash
TARGET=$1 #T1050.fasta
GPU_NAME=$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader | head -n 1 | tr ' ' '_')
OUTDIR=$(basename $TARGET .fasta).${GPU_NAME}
mkdir -p ${OUTDIR}
time alphafold --fasta_paths `pwd`/${TARGET} \
--max_template_date 2020-05-14 \
--output_dir `pwd`/${OUTDIR} \
--model_preset monomer \
--data_dir ${ALPHAFOLD_DATA} \
--uniref90_database_path ${ALPHAFOLD_DATA}/uniref90/uniref90.fasta \
--mgnify_database_path ${ALPHAFOLD_DATA}/mgnify/mgy_clusters.fa \
--pdb70_database_path ${ALPHAFOLD_DATA}/pdb70/pdb70 \
--template_mmcif_dir ${ALPHAFOLD_DATA}/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path ${ALPHAFOLD_DATA}/pdb_mmcif/obsolete.dat \
--bfd_database_path ${ALPHAFOLD_DATA}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path ${ALPHAFOLD_DATA}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--db_preset=full_dbs \
--stderrthreshold debug 2>&1 | tee ${OUTDIR}/log
input = T1050.fasta from CASP14
input and outdir on appa /pasteur/appa/scratch/public/edeveaud/T1050
environement:
CUDA_VISIBLE_DEVICES=0 # only one GPU
ALPHAFOLD_DATA=/opt/gensoft/data/alphafold/2.1.1
XLA_PYTHON_CLIENT_MEM_FRACTION, TF_FORCE_UNIFIED_MEMORYsee above
| Tesla P100 | Tesla V100 | Tesla A100 | Quadro RTX 5000 (maestro-builder) | |
|---|---|---|---|---|
| XLA_PYTHON_CLIENT_MEM_FRACTION=4.0 TF_FORCE_UNIFIED_MEMORY=1 | real 166m56.273s user 386m28.066s sys 21m21.025s | real 130m45.294s user 302m26.913s sys 14m49.230s | real 84m0.195s user 262m15.817s sys 2m45.695s | real 164m38.699s user 304m52.392s sys 27m18.801s |
impact of TF_FORCE_UNIFIED_MEMORY#
as stated above, sometimes your sequence does not "fit in gpu memory" and in this case we recommend to set TF_FORCE_UNIFIED_MEMORY to 1 here are ome examples
using Q4J6E5.fasta (1380 residues)
Code Block (text)
>sp|Q4J6E5|SLAA_SULAC S-layer protein A OS=Sulfolobus acidocaldarius (strain ATCC 33909 / DSM 639 / JCM 8929 / NBRC 15157 / NCIMB 11770) OX=330779 GN=slaA PE=1 SV=1
MNKLVGLLVSSLFLASILIGIAPAITTTALTPPVSAGGIQAYLLTGSGAPASGLVLFVVNVSNIQVSSSNVTNVISTVVSNIQINAKTENAQTGATTGSVTVRFPTSGYNAYYDSVDKVVFVVVSFLYPYTTTSVNIPLSYLSKYLPGLLTAQPYDETGAQVTSVSSTPFGSLIDTSTGQQILGTNPVLTSYNSYTTQANTNMQEGVVSGTLTSFTLGGQSFSGSTVPVILYAPFIFSNSPYQAGLYNPMQVNGNLGSLSSEAYYHPVIWGRALINTTLIDTYASGSVPFTFQLNYSVPGPLTINMAQLAWIASINNLPTSFTYLSYKFSNGYESFLGIISNSTQLTAGALTINPSGNFTINGKKFYVYLLVVGSTNSTTPVEYVTKLVVEYPSSTNFLPQGVTVTTSSNKYTLPVYEIGGPAGTTITLTGNWYSTPYTVQITVGSTPTLTNYVSQILLKAVAYEGINVSTTQSPYYSTAILSTPPSEISITGSSTITAQGKLTATSASATVNLLTNATLTYENIPLTQYSFNGIIVTPGYAAINGTTAMAYVIGALYNKTSDYVLSFAGSQEPMQVMNNNLTEVTTLAPFGLTLLAPSVPATETGTSPLQLEFFTVPSTSYIALVDFGLWGNLTSVTVSAYDTVNNKLSVNLGYFYGIVIPPSISTAPYNYQNFICPNNYVTVTIYDPDAVLDPYPSGSFTTSSLPLKYGNMNITGAVIFPGSSVYNPSGVFGYSNFNKGAAVTTFTYTAQSGPFSPVALTGNTNYLSQYADNNPTDNYYFIQTVNGMPVLMGGLSIVASPVSASLPSSTSSPGFMYLLPSAAQVPSPLPGMATPNYNLNIYITYKIDGATVGNNMINGLYVASQNTLIYVVPNGSFVGSNIKLTYTTTDYAVLHYFYSTGQYKVFKTVSVPNVTANLYFPSSTTPLYQLSVPLYLSEPYYGSPLPTYIGLGTNGTSLWNSPNYVLFGVSAVQQYLGFIKSISVTLSNGTTVVIPLTTSNMQTLFPQLVGQELQACNGTFQFGISITGLEKLLNLNVQQLNNSILSVTYHDYVTGETLTATTKLVALSTLSLVAKGAGVVEFLLTAYPYTGNITFAPPWFIAENVVKQPFMTYSDLQFAKTNPSAILSLSTVNITVVGLGGKASVYYNSTSGQTVITNIYGQTVATLSGNVLPTLTELAAGNGTFTGSLQFTIVPNNTVVQIPSSLTKTSFAVYTNGSLAIVLNGKAYSLGPAGLFLLPFVTYTGSAIGANATAIITVSDGVGTSTTQVPITAENFTPIRLAPFQVPAQVPLPNAPKLKYEYNGSIVITPQQQVLKIYVTSILPYPQEFQIQAFVYEASQFNVHTGSPTAAPVYFSYSAVRAYPALGIGTSVPNLLVYV
we ran alphafold on 2 different cards
- Quadro RTX 5000 16Gb memory
- Tesla A100 40Gb moemory
| RTX500 | A100 | |
|---|---|---|
| TF_FORCE_UNIFIED_MEMORY=0 | Failure RuntimeError: Resource exhausted: Out of memory while trying to allocate 14752930272 bytes. | real 107m51.450s user 389m59.774s sys 17m14.930s |
| TF_FORCE_UNIFIED_MEMORY=1 | real 283m30.444s user 350m40.434s sys 136m7.659s | real 108m7.859s user 389m27.946s sys 17m33.175s |
NB time to look at i the real one.
first run show that RXT 5000 does not have enough memory to handle this structure modelisation,while A100 is memory sufficient
second run shows that setting TF_FORCE_UNIFIED_MEMORY=1 allows to run succesfully. One can note that the impact on A100 card is negligeable.
multimer benchmarks#
we ran multimer prediction using te following sequences on the same conditions than monomer benchmarks
max2.fasta (text)
>chain A
SKAWNRYRLPNTLKPDSYRVTLRPYLTPNDRGLYVFKGSSTVRFTCKEATDVIIIHSKKLNYTLSQGHRVVLRGVGGSQPPDIDKTELVEPTEYLVVHLKGSLVKDSQYEMDSEFEGELADDLAGFYRSEYMEGNVRKVVATTQMQAADARKSFPCFDEPAMKAEFNITLIHPKDLTALSNMLPKGPSTPLPEDPNWNVTEFHTTPKMSTYLLAFIVSEFDYVEKQASNGVLIRIWARPSAIAAGHGDYALNVTGPILNFFAGHYDTPYPLPKSDQIGLPDFNAGAMENWGLVTYRENSLLFDPLSSSSSNKERVVTVIAHELAHQWFGNLVTIEWWNDLWLNEGFASYVEYLGADYAEPTWNLKDLMVLNDVYRVMAVDALASSHPLSTPASEINTPAQISELFDAISYSKGASVLRMLSSFLSEDVFKQGLASYLHTFAYQNTIYLNLWDHLQEAVNNRSIQLPTTVRDIMNRWTLQMGFPVITVDTSTGTLSQEHFLLDPDSNVTRPSEFNYVWIVPITSIRDGRQQQDYWLIDVRAQNDLFSTSGNEWVLLNLNVTGYYRVNYDEENWRKIQTQLQRDHSAIPVINRAQIINDAFNLASAHKVPVTLALNNTLFLIEERQYMPWEAALSSLSYFKLMFDRSEVYGPMKNYLKKQVTPLFIHFRNNTNNWREIPENLMDQYSEVNAISTACSNGVPECEEMVSGLFKQWMENPNNNPIHPNLRSTVYCNAIAQGGEEEWDFAWEQFRNATLVNEADKLRAALACSKELWILNRYLSYTLNPDLIRKQDATSTIISITNNVIGQGLVWDFVQSNWKKLFNDYGGGSFSFSNLIQAVTRRFSTEYELQQLEQFKKDNEETGFGSGTRALEQALEKTKANIKWVKENKEVVLQWFTENSK
>chain B
KHMFIVLYVNFKLRSGVGRCYNCRPAVVNITLANFNETKGPLCVDTSHFTTQFVGAKFDRWSASINTGNCPFSFGKVNNFVKFGSVCFSLKDIPGGCAMPIMANLANLNSHTIGTLYVSWSDGDGITGVPQP
sequence were grabed from: NATURE COMMUNICATIONS | 8: 1735| DOI: 10.1038/s41467-017-01706-x
| Tesla P100 | Tesla V100 | Tesla A100 | Quadro RTX 5000 (maestro-builder) | |
|---|---|---|---|---|
| XLA_PYTHON_CLIENT_MEM_FRACTION=4.0 TF_FORCE_UNIFIED_MEMORY=1 | real 195m51.152s user 415m53.242s sys 29m30.718s | real 149m38.843s user 311m14.591s sys 19m35.380s | real 153m51.796s user 426m15.276s sys 13m49.630s | real 186m18.490s user 347m39.079s sys 39m4.976s |
input, outdir and script used to run alphafold multimer prediction available on appa /pasteur/appa/scratch/public/edeveaud/max2
How to view predicted structures.#
once ran alphafold will provide you n models that you may want to take a look at
various options provide visualisation capabilities: for example
- software that requires local installation
- pymol: https://pymol.org/2/ should run on any computer
- chimeraX https://www.cgl.ucsf.edu/chimerax/ available for MacOS, Windows and various linux flavour
- VTX https://vtx.drugdesign.fr/ available for Windows and various linux flavour
- online web based tools
- https://nglviewer.org/ngl/
- https://www.rcsb.org/3d-view
here is a view of the 5 models predicted for CASP14 T1031 using alphafold. images generated using ChimeraX
| T1031_ranked_0 | T1031_ranked_1 | T1031_ranked_2 | T1031_ranked_3 | T1031_ranked_4 |
|---|---|---|---|---|
accuracy.#
for curiosity we ran alphafold prediction on Galactose-1-phosphate uridylyltransferase homodimer and compare the prediction with the known X-Ray structure available on PDB
both structure, reference ones and prediction where loaded in ChimeraX and then "aligned" with MatchMaker.
here you will find the MatchMaker's RMS for each alphafold ranked prediction vs pdb1hxp.ent and pdb1hxq.ent
| pdb1hxp.ent | pdb1hxq.ent | |
|---|---|---|
| ranked_0.pdb | Matchmaker pdb1hxp.ent, chain A (#1) with ranked_0.pdb, chain A (#2), sequence alignment score = 1806.7 RMSD between 331 pruned atom pairs is 0.302 angstroms; (across all 340 pairs: 1.458) | Matchmaker pdb1hxq.ent, chain B (#1) with ranked_0.pdb, chain A (#2), sequence alignment score = 1810.6 RMSD between 330 pruned atom pairs is 0.428 angstroms; (across all 332 pairs: 0.536) |
| ranked_1.pdb | Matchmaker pdb1hxp.ent, chain A (#1) with ranked_1.pdb, chain A (#3), sequence alignment score = 1794.7 RMSD between 331 pruned atom pairs is 0.288 angstroms; (across all 340 pairs: 1.448) | Matchmaker pdb1hxq.ent, chain A (#1) with ranked_1.pdb, chain A (#3), sequence alignment score = 1819.6 RMSD between 332 pruned atom pairs is 0.369 angstroms; (across all 340 pairs: 1.760) |
| ranked_2.pdb | Matchmaker pdb1hxp.ent, chain A (#1) with ranked_2.pdb, chain A (#4), sequence alignment score = 1806.7 RMSD between 331 pruned atom pairs is 0.313 angstroms; (across all 340 pairs: 1.458) | Matchmaker pdb1hxq.ent, chain A (#1) with ranked_2.pdb, chain B (#4), sequence alignment score = 1819.6 RMSD between 332 pruned atom pairs is 0.388 angstroms; (across all 340 pairs: 1.799) |
| ranked_3.pdb | Matchmaker pdb1hxp.ent, chain A (#1) with ranked_3.pdb, chain B (#5), sequence alignment score = 1803.1 RMSD between 331 pruned atom pairs is 0.305 angstroms; (across all 340 pairs: 1.474) | Matchmaker pdb1hxq.ent, chain A (#1) with ranked_3.pdb, chain A (#5), sequence alignment score = 1819.6 RMSD between 332 pruned atom pairs is 0.388 angstroms; (across all 340 pairs: 1.828) |
| ranked_4.pdb | Matchmaker pdb1hxp.ent, chain A (#1) with ranked_4.pdb, chain A (#6), sequence alignment score = 1806.7 RMSD between 331 pruned atom pairs is 0.281 angstroms; (across all 340 pairs: 1.430) | Matchmaker pdb1hxq.ent, chain B (#1) with ranked_4.pdb, chain A (#6), sequence alignment score = 1810.6 RMSD between 331 pruned atom pairs is 0.417 angstroms; (across all 332 pairs: 0.505) |
just for the pleasure of the eyes, bellow is the overview of the spining matchmaker result (click to view)
blue: alphafold ranked_4.pdb prediction
brown: pdb1hxq.ent X-Ray structure
150