ColabFold
Introduction#
ColabFold is available on the Maestro cluster under module name ColabFold. You can find the sourec code here.
ColabFold is a protein folding prediction tool based on Google DeepMind’s Alphafold (https://github.com/google-deepmind/alphafold) that uses MMseqs2 (way faster than jackhmmer used by alphafold) to determine multiple sequence alignments (MSAs) or get them from the remote server.
Databases#
Databases described on https://colabfold.mmseqs.com/ are locally available at /opt/gensoft/data/ColabFold/<SOFTWARE_VERSION> and are referenced accordingly by the environment variable COLABFOLD_DB, that are set when ColabFold module is loaded. These databases will be updated following the evolution of the online repositories. Please do not duplicate this DBs on your own work spaces.
ColabFold Workflow#
Colabfold workflow is splitted in 2 parts:
- MSAs generation on CPU / getting MSA from the server
- Folding prediction on GPU.
MSAs generation#
Generating MSAs locally using MMseqs2 is done via the colabfold_search tool that takes 3 arguments.
Code Block (text)
$ colabfold_search query_file.fasta /path/to/database/databases msas_dir
where:
query_file.fastais a fasta files containing the queries,/path/to/databasesis the path to the ColabFold databases + indexes,msas_diris a directory for the generated msas.
colabfold_search as MMseqs2 accepts files containing multiple amino acid sequences as input, eg:
Code Block (text)
>seq_1
CGDYHVGHDYQSVSHSGGHMWMAIQQYMCHCASPGLCFYA
>seq_2
WKCKFVAFWNRTWTLNEVPYPCVWIYGVSMWTWCTGPMQL
including complexes where proteins are separated with a colon (:).
Code Block (text)
>complex
ENLWTLRSGWIGPEFPWSLLKAVTIYHSQQFRQCEYISHH:IRDWTNKSICSKQHGPSHNYAYAEQKESWIWYHMDIKSFC
This step must run on CPU nodes eg:
using srun
Code Block (text)
maestro-submit:~ > module load MMseqs2/14-7e284 Kalign/2.04 cuda/11.6 cudnn/11.x-v8.7.0.84 test/ColabFold
maestro-submit:~ > srun -p common -q fast --mem=128G -c 8 colabfold_search --threads 8 --db-load-mode 2 input.fasta ${COLABFOLD_DB} msas
NB remember that once loaded ColabFold module set COLABFOLD_DB to the correct location
NB remember that when srun called with -c n, you have to adapt colabfold_search thread number to the same value n
using sbatch
Code Block (text)
#!/bin/bash
#SBATCH -c 8 # Requested cores
#SBATCH --partition=common # Partition to run in
#SBATCH --qos=fast # won't take more than 2 hours, choose fast QoS to start quickly
#SBATCH --mem=128G # Requested Memory
#SBATCH -o %j.out
#SBATCH -e %j.err
#---- load required modules
. /opt/gensoft/adm/etc/profile.d/modules.sh
module load MMseqs2/14-7e284 Kalign/2.04 cuda/11.6 cudnn/11.x-v8.7.0.84 test/ColabFold/1.5.2
#---- do the job
colabfold_search --threads ${SLURM_CPUS_PER_TASK} \
--db-load-mode 2 \
input.fasta ${COLABFOLD_DB} msas
Folding predictions#
Now that the alignements have been generated and are available in the msas directory, it's time to fold. Since the folding is computed on GPUs, you have to to submit your job to the gpu partition.
ColabFold does NOT support multiple GPUs. Please don't request more than one GPU per colabfold_batch invocation
The folding is done via the colabfold_batch tool that takes 2 arguments.
Code Block (text)
$ colabfold_batch msas_dir folding_prediction_dir
where
msas_diris a directory for the generated msas,folding_prediction_diris the ouput directory for predictions
Example using srun
Code Block (text)
maestro-submit:~ > module load MMseqs2/14-7e284 Kalign/2.04 cuda/11.6 cudnn/11.x-v8.7.0.84 test/ColabFold
maestro-submit:~ > srun -p gpu -q gpu --gres=gpu:1 --mem=128G colabfold_batch msas_dir folding_prediction_dir
Example using sbatch
Code Block (text)
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --gres=gpu:1 # remember 1 GPU
#SBATCH --mem=128G # Requested Memory
#SBATCH -o %j.out
#SBATCH -e %j.err
#---- load required modules
. /opt/gensoft/adm/etc/profile.d/modules.sh
module load MMseqs2/14-7e284 Kalign/2.04 cuda/11.6 cudnn/11.x-v8.7.0.84 test/ColabFold/1.5.2
#---- do the job
colabfold_batch msas_dir folding_prediction_dir
Comparison with Alphafold#
Tests ran on local node with 64 proc//512GB of RAM and a Quadro RTX 5000 card (with 16GB of memory), both tests using 64 threads.
Alphafold ran using alphafold_runner.sh and colabfold ran through a simple batch script combining the 2 steps. Basically it performs the following:
colabfold_search input DB msas_dir && colabfold_batch msas_dir folding_prediction_dir
Code Block (text)
module load alphafold/2.3.1
time alphafold_runner.sh `pwd`/T1059 2>&1 | tee alphafold.log
alphafold_runner.sh `pwd`/T1059.fasta 2>&1 12665.19s user 2120.28s system 537% cpu 45:50.72 total
Code Block (text)
module load MMseqs2/14-7e284 Kalign/2.04 cuda/11.6 cudnn/11.x-v8.7.0.84 test/ColabFold/1.5.2
time colabfold_search `pwd`/T1059 $COLABFOLD_DB T1059_msas && colabfold_batch T1059_msas T1059_colabfold_fold./colabfod_runner.sh T1059.fasta 32.80s user 118.87s system 66% cpu 3:47.16 total
You can see a significant difference in exectution times, with MMseqs leaving over jakhmmer//hhblits.
plus ColabFold just predict "unrelaxed" structures while alphafold predicts "relaxed" and "unrelaxed" structures. (NB this is no longer true with alphafold/2.3.2 as one can skip the relaxation step using --models_to_relax option) Meanwhile structures are really similars. see:
| colabfold | alphafold | colabfod vs alphafold |
|---|---|---|