RoseTTAFold
!!!!! This page is a work in progress. it will evolve in the next few days. !!!!!
!!!!! RoseTTAFold is currently available for testing purpose !!!!!
!!!!! ONLY monomer prediction was currently tested !!!!!
!!!!! complex structure prediction may (or not) works...... !!!!!
Alphafold is available on maestro cluster under module name RoseTTAFold/1.0.0
we provide the reference data downladed accordingly with the RoseTTAFold Readme (point 5: Download sequence and structure databases)
environnement description#
Once RoseTTAFold/1.0.0 is loaded you can access the RoseTTAFold data through ROSETTAFOLD_DATA environment variable.
As always see module show RoseTTAFold/1.0.0 command output to display environment changes made by loading the module.
Code Block (bash)
maestro-submit:~ > module show test/RoseTTAFold/1.0.0
-------------------------------------------------------------------
/opt/gensoft/modules/test/RoseTTAFold/1.0.0:
module-whatis {Set environnement for RoseTTAFold (1.0.0)}
prepend-path PATH /opt/gensoft/exe/RoseTTAFold/1.0.0/bin
setenv ROSETTAFOLD_DATA /opt/gensoft/data/RoseTTAFold/1.0.0
setenv HHTOOLS_MEMORY 64
-------------------------------------------------------------------
- ROSETTAFOLD_DATA: location of the ROSETTAFOLD_DATA data
- HHTOOLS_MEMORY: requested memory in GB for the hhsuite tools used by RoseTTAFold (hhblits, hhsearch)
You can change HHTOOLS_MEMORY environment variable value to suit your needs BUT keep in mind that on Maestro it should complies with the memory requested on your slurm allocation via the --mem slurm option (run_e2e_ver.sh.sh, run_pyrosetta_ver.sh and corresponding cpu version uses 64GB as default value).
On maestro we provide RoseTTAFold useable in GPU or CPU mode. The choice is made by the name of the tool. For example '*_cpu.sh' tools will run in CPU-only mode.
In order to use the standard GPU mode you must request an allocation for the GPU capable queue//nodes. See bellow
| GPU mode | CPU mode | description |
|---|---|---|
run_e2e_ver.sh |
run_e2e_ver_cpu.sh |
monomer structure prediction with a single PDB output having estimated residue-wise CA-lddt at the B-factor column |
run_pyrosetta_ver.sh |
run_pyrosetta_ver_cpu.sh |
monomer structure prediction with five final models having estimated CA rms error at the B-factor column |
provided tools description#
in order to know the tools provided by RoseTTAFold/1.0.0 module use the module help command.
Code Block (text)
maestro-submit:~ > module help test/RoseTTAFold/1.0.0
-------------------------------------------------------------------
Module Specific Help for /opt/gensoft/modules/test/RoseTTAFold/1.0.0:
This modulefile defines the requisite environement
needed to use package: RoseTTAFold version (1.0.0)
official implementation of RoseTTAFold - Accurate prediction of protein structures and interactions using a 3-track network.
URL: https://github.com/RosettaCommons/RoseTTAFold
package provides following commands:
hhfilter
RosettaTR.py
make_joint_MSA_bacterial.py
make_msa.sh
make_ss.sh
predict_complex.py
predict_e2e.py
predict_pyRosetta.py
run_e2e_ver.sh
run_e2e_ver_cpu.sh
run_pyrosetta_ver.sh
run_pyrosetta_ver_cpu.sh
-------------------------------------------------------------------
we provide both tools of RoseTTAFold for convinience plus 2 inhouse tools. the main tools to run are:
- run_pyrosetta_ver.sh and run_pyrosetta_ver_cpu.sh: monomer structure prediction with five final models having estimated CA rms error at the B-factor column (gpu and cpu version respectively)
- run_e2e_ver.sh and run_e2e_ver_cpu.sh: monomer structure prediction with a single PDB output having estimated residue-wise CA-lddt at the B-factor column gpu and cpu version respectively)
- make_joint_MSA_bacterial.py: Make paired alignments for complex pretein predictions (see bellow)
- hhfilter: Filter the paired alignment using hhfilter used for complex pretein predictions (see bellow)
- predict_complex.py: Run complex structure prediction
NB our installation is run through a singularity container.
by default the container will bind mount the following paths:
/pasteur: giving access to the pasteur tree, eg: projects, scratch, homes etc etc$HOME/local/databases: giving access to data banks/local/scratch: is bind mounted to/tmpin the container to respect the temporary files location policy used on maestro${ROSETTAFOLD_DATA}:of course alphafold data are bind mounted on the container
NB: you can bind mount other volumes on the container using the SINGULARITY_BINDPATH environment variable.
IMPORTANT NOTES
- RoseTTAFold uses only 1 GPUs. So be sure to ask for only 1 (ONE) gpu in your job allocations
- remember to request memory accordingly to HHTOOLS_MEMORY velue, in GB else you may be killed by slurm as default allocation grants 40Gb/CPU.
RoseTTAFoldruns some threaded jobs (hhblits, hhsearch) or run jobs in parallel (8 threads // parallel jobs by default). On Maestro this number of threads // jobs is automatically set toSLURM_CPUS_PER_TASK(aka the number of CPUs requested in your allocation through the-c//--cpus-per-taskslurm option)RoseTTAFoldMUST be run trough a slurm allocationsbatch//sallocorsrun.
Running monomer RoseTTAFoldprediction via srun#
for the example we will run RoseTTAFold monomer prediction providing 5 models. One can choose to run RoseTTAFold monomer prediction providing 1 model, replacing run_pyrosetta_ver.sh with run_e2e_ver.sh in the follwoing command lines
-
running on GPU mode:#
Code Block (bash)
maestro-submit:~ > module load test/RoseTTAFold
maestro-submit:~ > srun -p gpu --qos=gpu --gres=gpu:A100:1 --cpus-per-task=8 --mem=${HHTOOLS_MEMORY}G run_pyrosetta_ver.sh input.fa output_dir
-
running on CPU mode#
Code Block (bash)
maestro-submit:~ > module load test/RoseTTAFold
maestro-submit:~ > srun --cpus-per-task=8 --mem=${HHTOOLS_MEMORY}G run_pyrosetta_ver.sh input.fa output_dir
Running monomer RoseTTAFold prediction via sbatch#
-
running on GPU mode:#
Code Block (bash)
#!/bin/bash
#SBATCH --partition=gpu --qos=gpu
#SBATCH --gres=gpu:1 # remember 1 GPU
#SBATCH --constraint='A100:1' # target gpu card choose between A100 or V100
#SBATCH --cpus-per-task=8 # hhblitss, hhsearch default requirement
#SBATCH --mem=${HHTOOLS_MEMORY}G # required memory in GB
#---- Job Name
#SBATCH -J 1STU_rf_job
INPUT_FASTA=/pasteur/appa/scratch/public/edeveaud/1STU.fasta
OUTPUT_DIR=/pasteur/appa/scratch/public/edeveaud/1STU_out
PREDICTION=run_pyrosetta_ver.sh # kind of prediction to run chose beetween run_pyrosetta_ver.sh or run_e2e_ver.sh
#---- do the job
${PREDICTION} ${INPUT_FASTA} ${OUTPUT_DIR}
-
running on CPU mode:#
Code Block (bash)
#!/bin/bash
#SBATCH --cpus-per-task=8 # hhblitss, hhsearch default requirement
#SBATCH --mem=${HHTOOLS_MEMORY}G # required memory in GB
#---- Job Name
#SBATCH -J 1STU_rf_job
INPUT_FASTA=/pasteur/appa/scratch/public/edeveaud/1STU.fasta
OUTPUT_DIR=/pasteur/appa/scratch/public/edeveaud/1STU_out
PREDICTION=run_pyrosetta_ver_cpu.sh # kind of prediction to run chose beetween run_pyrosetta_ver_cpu.sh or run_e2e_ver_cpu.sh
#---- do the job
${PREDICTION} ${INPUT_FASTA} ${OUTPUT_DIR}
complex_preparation
running complex protein RoseTTAFold prediction.#
How to prepare inputs for complex modeling
- Generate multiple sequence alignments for each subunit
- Make paired alignments
- It is important to generate "good" paired sequence alignments to extract coevolution signal properly.
- For bacterial proteins, you may use make_joint_MSA_bacterial.py script to generate paired alignments. It pairs the sequences having similar UniProt accession code.
- For eukaryotes, there's no easy way to generate paired alignments. Good luck!
- NB runs ins CPU mode only, so no need to request a GPU node
- Filter the paired alignment using hhfilter (use hmodule load hsuite/3.3.0 module to access hhfilter command)
- hhfilter -i paired.a3m -o filtered.a3m -id 90 (or 95) -cov 75 (or 50)
- NB runs ins CPU mode only, so no need to request a GPU node
- If you want to incorporate complex template information, please make npz file that contains input template features for RoseTTAFold. It should contain the following keys ("xyz_t", "t1d", "t0d").
- xyz_t: N, CA, C coordinates of complex templates (# of templ, # of residue, 3 (for N, Ca, C), 3 (xyz coord)) For the unaligned region, it should be NaN.
- t1d: 1-D features from HHsearch results (score, SS, probab column from atab file) (T, L, 3). For the unaligned region, it should be zeros
- t0d: 0-D features from HHsearch (Probability/100.0, Ideintities/100.0, Similarity fro hhr file) (T, 3)
- Run complex structure prediction
- predict_complex.py -i filtered.a3m -o complex -Ls 218 310
-
- Set the numeric parameters after -Ls argument to the lengths of each individual subunit, in the order that they were paired
- n this example, the aa length of subunit1 was 218, subunit2 was 310.
- NB this script can be run in CPU mode only using the --cpu command line option
-
- You may want to run Rosetta fastrelax w/ coordinate restraints to add sidechains.
some numerical values#
- test were ran on a A100-SXM4-40GB GPU card on 96 core node.
- using Tsp1 Trichoderma virens protein sequence (138 residues)
- using S-layer protein A, Sulfolobus acidocaldarius protein (1424 residues)
- playing with the number of threads//parallel jobs run by RoseTTAFold through SLURM_CPUS_PER_TASK
here are the timings of the different jobs.
run on Tsp1.fa#
| run_pyrosetta GPU mode | run_pyrosetta CPU mode | run_e2e GPU mode | run_e2e CPU mode | |
|---|---|---|---|---|
| HHTOOLS_MEMORY=64 SLURM_CPUS_PER_TASK=8 | real 80m40.917s user 322m47.901s sys 1m57.070s | real 78m35.863s user 355m1.903s sys 13m13.849s | real 28m22.496s user 161m23.120s sys 1m9.866s | real 34m11.120s user 509m5.386s sys 72m14.860s |
| HHTOOLS_MEMORY=64 SLURM_CPUS_PER_TASK=95 | real 80m10.435s user 1760m30.599s sys 8m33.578s | real 77m53.783s user 1662m48.939s sys 18m15.265s | real 36m51.639s user 1608m41.954s sys 7m18.744s | real 39m33.383s user 1920m18.416s sys 68m7.570s |
run on S-layer.fa#
| run_pyrosetta GPU mode | run_pyrosetta CPU mode | run_e2e GPU mode | run_e2e CPU mode | |
|---|---|---|---|---|
| HHTOOLS_MEMORY=64 SLURM_CPUS_PER_TASK=8 | real 856m20.154s user 4942m42.875s sys 15m32.150s | real 1164m27.392s user 25830m2.842s sys 5418m47.984s | CUDA out of memory Tried to allocate 8.89 GiB (GPU 0; 19.62 GiB total capacity; 10.48 GiB already allocated; 5.66 GiB free; 12.53 GiB reserved in total by PyTorch) | to run |
| HHTOOLS_MEMORY=64 SLURM_CPUS_PER_TASK=95 | real 514m36.122s user 6138m27.414s sys 25m24.294s | to run | CUDA out of memory Tried to allocate 8.89 GiB (GPU 0; 19.62 GiB total capacity; 10.48 GiB already allocated; 5.67 GiB free; 12.53 GiB reserved in total by PyTorch) | to run |