ColabFold

Introduction#

ColabFold is available on the Maestro cluster under module name ColabFold. You can find the sourec code here.

ColabFold is a protein folding prediction tool based on Google DeepMind’s Alphafold (https://github.com/google-deepmind/alphafold) that uses MMseqs2 (way faster than jackhmmer used by alphafold) to determine multiple sequence alignments (MSAs) or get them from the remote server.

Databases#

Databases described on https://colabfold.mmseqs.com/ are locally available at /opt/gensoft/data/ColabFold/<SOFTWARE_VERSION> and are referenced accordingly by the environment variable COLABFOLD_DB, that are set when ColabFold module is loaded. These databases will be updated following the evolution of the online repositories. Please do not duplicate this DBs on your own work spaces.

ColabFold Workflow#

Colabfold workflow is splitted in 2 parts:

MSAs generation on CPU / getting MSA from the server
Folding prediction on GPU.

MSAs generation#

Generating MSAs locally using MMseqs2 is done via the colabfold_search tool that takes 3 arguments.

Code Block (text)

$ colabfold_search  query_file.fasta   /path/to/database/databases   msas_dir

where:

query_file.fasta is a fasta files containing the queries,
/path/to/databases is the path to the ColabFold databases + indexes,
msas_dir is a directory for the generated msas.

colabfold_search as MMseqs2 accepts files containing multiple amino acid sequences as input, eg:

Code Block (text)

>seq_1
CGDYHVGHDYQSVSHSGGHMWMAIQQYMCHCASPGLCFYA
>seq_2
WKCKFVAFWNRTWTLNEVPYPCVWIYGVSMWTWCTGPMQL

including complexes where proteins are separated with a colon (:).

Code Block (text)

>complex
ENLWTLRSGWIGPEFPWSLLKAVTIYHSQQFRQCEYISHH:IRDWTNKSICSKQHGPSHNYAYAEQKESWIWYHMDIKSFC

This step must run on CPU nodes eg:

using srun

Code Block (text)

maestro-submit:~ > module load MMseqs2/14-7e284 Kalign/2.04 cuda/11.6 cudnn/11.x-v8.7.0.84 test/ColabFold
maestro-submit:~ > srun -p common -q fast --mem=128G -c 8 colabfold_search --threads 8 --db-load-mode 2 input.fasta ${COLABFOLD_DB} msas

NB remember that once loaded ColabFold module set COLABFOLD_DB to the correct location
NB remember that when srun called with -c n, you have to adapt colabfold_search thread number to the same value n

using sbatch

Code Block (text)

#!/bin/bash

#SBATCH -c 8                                 # Requested cores
#SBATCH --partition=common                   # Partition to run in
#SBATCH --qos=fast                           # won't take more than 2 hours, choose fast QoS to start quickly
#SBATCH --mem=128G                           # Requested Memory
#SBATCH -o %j.out                            
#SBATCH -e %j.err                            

#---- load required modules
. /opt/gensoft/adm/etc/profile.d/modules.sh
module load MMseqs2/14-7e284 Kalign/2.04 cuda/11.6 cudnn/11.x-v8.7.0.84 test/ColabFold/1.5.2

#---- do the job
colabfold_search --threads ${SLURM_CPUS_PER_TASK} \
                 --db-load-mode 2 \
                 input.fasta ${COLABFOLD_DB} msas

Folding predictions#

Now that the alignements have been generated and are available in the msas directory, it's time to fold. Since the folding is computed on GPUs, you have to to submit your job to the gpu partition.

ColabFold does NOT support multiple GPUs. Please don't request more than one GPU per colabfold_batch invocation

The folding is done via the colabfold_batch tool that takes 2 arguments.

Code Block (text)

$ colabfold_batch   msas_dir   folding_prediction_dir

where

msas_dir is a directory for the generated msas,
folding_prediction_dir is the ouput directory for predictions

Example using srun

Code Block (text)

maestro-submit:~ > module load MMseqs2/14-7e284 Kalign/2.04 cuda/11.6 cudnn/11.x-v8.7.0.84 test/ColabFold
maestro-submit:~ > srun -p gpu -q gpu --gres=gpu:1 --mem=128G  colabfold_batch  msas_dir  folding_prediction_dir

Example using sbatch

Code Block (text)

#!/bin/bash

#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --gres=gpu:1          # remember 1 GPU
#SBATCH --mem=128G            # Requested Memory
#SBATCH -o %j.out
#SBATCH -e %j.err

#---- load required modules
. /opt/gensoft/adm/etc/profile.d/modules.sh
module load MMseqs2/14-7e284 Kalign/2.04 cuda/11.6 cudnn/11.x-v8.7.0.84 test/ColabFold/1.5.2

#---- do the job
colabfold_batch  msas_dir  folding_prediction_dir

Comparison with Alphafold#

Tests ran on local node with 64 proc//512GB of RAM and a Quadro RTX 5000 card (with 16GB of memory), both tests using 64 threads.

Alphafold ran using alphafold_runner.sh and colabfold ran through a simple batch script combining the 2 steps. Basically it performs the following:

colabfold_search input DB msas_dir && colabfold_batch msas_dir folding_prediction_dir

Code Block (text)

module load alphafold/2.3.1
time alphafold_runner.sh `pwd`/T1059 2>&1 | tee alphafold.log
alphafold_runner.sh `pwd`/T1059.fasta 2>&1  12665.19s user 2120.28s system 537% cpu 45:50.72 total

Code Block (text)

module load  MMseqs2/14-7e284 Kalign/2.04 cuda/11.6 cudnn/11.x-v8.7.0.84 test/ColabFold/1.5.2
time colabfold_search `pwd`/T1059 $COLABFOLD_DB T1059_msas && colabfold_batch T1059_msas T1059_colabfold_fold./colabfod_runner.sh T1059.fasta 32.80s user 118.87s system 66% cpu 3:47.16 total

You can see a significant difference in exectution times, with MMseqs leaving over jakhmmer//hhblits.

plus ColabFold just predict "unrelaxed" structures while alphafold predicts "relaxed" and "unrelaxed" structures. (NB this is no longer true with alphafold/2.3.2 as one can skip the relaxation step using --models_to_relax option) Meanwhile structures are really similars. see:

colabfold	alphafold	colabfod vs alphafold