RoseTTAFold

!!!!! This page is a work in progress. it will evolve in the next few days. !!!!!

!!!!! RoseTTAFold is currently available for testing purpose !!!!!

!!!!! ONLY monomer prediction was currently tested !!!!!

!!!!! complex structure prediction may (or not) works...... !!!!!

Alphafold is available on maestro cluster under module name RoseTTAFold/1.0.0
we provide the reference data downladed accordingly with the RoseTTAFold Readme (point 5: Download sequence and structure databases)

environnement description#

Once RoseTTAFold/1.0.0 is loaded you can access the RoseTTAFold data through ROSETTAFOLD_DATA environment variable.

As always see module show RoseTTAFold/1.0.0 command output to display environment changes made by loading the module.

Code Block (bash)

maestro-submit:~ > module show test/RoseTTAFold/1.0.0 
-------------------------------------------------------------------
/opt/gensoft/modules/test/RoseTTAFold/1.0.0:

module-whatis   {Set environnement for RoseTTAFold (1.0.0)}
prepend-path    PATH /opt/gensoft/exe/RoseTTAFold/1.0.0/bin
setenv          ROSETTAFOLD_DATA /opt/gensoft/data/RoseTTAFold/1.0.0
setenv          HHTOOLS_MEMORY 64
-------------------------------------------------------------------

ROSETTAFOLD_DATA: location of the ROSETTAFOLD_DATA data
HHTOOLS_MEMORY: requested memory in GB for the hhsuite tools used by RoseTTAFold (hhblits, hhsearch)

You can change HHTOOLS_MEMORY environment variable value to suit your needs BUT keep in mind that on Maestro it should complies with the memory requested on your slurm allocation via the --mem slurm option (run_e2e_ver.sh.sh, run_pyrosetta_ver.sh and corresponding cpu version uses 64GB as default value).

On maestro we provide RoseTTAFold useable in GPU or CPU mode. The choice is made by the name of the tool. For example '*_cpu.sh' tools will run in CPU-only mode.

In order to use the standard GPU mode you must request an allocation for the GPU capable queue//nodes. See bellow

GPU mode	CPU mode	description
`run_e2e_ver.sh`	`run_e2e_ver_cpu.sh`	monomer structure prediction with a single PDB output having estimated residue-wise CA-lddt at the B-factor column
`run_pyrosetta_ver.sh`	`run_pyrosetta_ver_cpu.sh`	monomer structure prediction with five final models having estimated CA rms error at the B-factor column

provided tools description#

in order to know the tools provided by RoseTTAFold/1.0.0 module use the module help command.

Code Block (text)

maestro-submit:~ > module help test/RoseTTAFold/1.0.0
-------------------------------------------------------------------
Module Specific Help for /opt/gensoft/modules/test/RoseTTAFold/1.0.0:

This modulefile defines the requisite environement
needed to use package: RoseTTAFold version (1.0.0)

official implementation of RoseTTAFold - Accurate prediction of protein structures and interactions using a 3-track network.
URL: https://github.com/RosettaCommons/RoseTTAFold


package provides following commands:
        hhfilter
        RosettaTR.py
        make_joint_MSA_bacterial.py
        make_msa.sh
        make_ss.sh
        predict_complex.py
        predict_e2e.py
        predict_pyRosetta.py
        run_e2e_ver.sh
        run_e2e_ver_cpu.sh
        run_pyrosetta_ver.sh
        run_pyrosetta_ver_cpu.sh

-------------------------------------------------------------------

we provide both tools of RoseTTAFold for convinience plus 2 inhouse tools. the main tools to run are:

run_pyrosetta_ver.sh and run_pyrosetta_ver_cpu.sh: monomer structure prediction with five final models having estimated CA rms error at the B-factor column (gpu and cpu version respectively)
run_e2e_ver.sh and run_e2e_ver_cpu.sh: monomer structure prediction with a single PDB output having estimated residue-wise CA-lddt at the B-factor column gpu and cpu version respectively)
make_joint_MSA_bacterial.py: Make paired alignments for complex pretein predictions (see bellow)
hhfilter: Filter the paired alignment using hhfilter used for complex pretein predictions (see bellow)
predict_complex.py: Run complex structure prediction

NB our installation is run through a singularity container.
by default the container will bind mount the following paths:

/pasteur: giving access to the pasteur tree, eg: projects, scratch, homes etc etc
$HOME
/local/databases: giving access to data banks
/local/scratch: is bind mounted to /tmp in the container to respect the temporary files location policy used on maestro
${ROSETTAFOLD_DATA}: of course alphafold data are bind mounted on the container

NB: you can bind mount other volumes on the container using the SINGULARITY_BINDPATH environment variable.

IMPORTANT NOTES

RoseTTAFold uses only 1 GPUs. So be sure to ask for only 1 (ONE) gpu in your job allocations
remember to request memory accordingly to HHTOOLS_MEMORY velue, in GB else you may be killed by slurm as default allocation grants 40Gb/CPU.
RoseTTAFold runs some threaded jobs (hhblits, hhsearch) or run jobs in parallel (8 threads // parallel jobs by default). On Maestro this number of threads // jobs is automatically set to SLURM_CPUS_PER_TASK (aka the number of CPUs requested in your allocation through the -c//--cpus-per-task slurm option)
RoseTTAFold MUST be run trough a slurm allocation sbatch//salloc or srun.

Running monomer RoseTTAFoldprediction via srun#

for the example we will run RoseTTAFold monomer prediction providing 5 models. One can choose to run RoseTTAFold monomer prediction providing 1 model, replacing run_pyrosetta_ver.sh with run_e2e_ver.sh in the follwoing command lines

running on GPU mode:#

Code Block (bash)

maestro-submit:~ > module load test/RoseTTAFold
maestro-submit:~ > srun -p gpu --qos=gpu --gres=gpu:A100:1 --cpus-per-task=8 --mem=${HHTOOLS_MEMORY}G run_pyrosetta_ver.sh input.fa output_dir

running on CPU mode#

Code Block (bash)

maestro-submit:~ > module load test/RoseTTAFold
maestro-submit:~ > srun --cpus-per-task=8 --mem=${HHTOOLS_MEMORY}G run_pyrosetta_ver.sh input.fa output_dir

Running monomer RoseTTAFold prediction via sbatch#

running on GPU mode:#

Code Block (bash)

#!/bin/bash

#SBATCH --partition=gpu  --qos=gpu
#SBATCH --gres=gpu:1               # remember 1 GPU
#SBATCH --constraint='A100:1'      # target gpu card choose between A100 or V100
#SBATCH --cpus-per-task=8          # hhblitss, hhsearch default requirement
#SBATCH --mem=${HHTOOLS_MEMORY}G   # required memory in GB         

#---- Job Name
#SBATCH -J 1STU_rf_job

INPUT_FASTA=/pasteur/appa/scratch/public/edeveaud/1STU.fasta
OUTPUT_DIR=/pasteur/appa/scratch/public/edeveaud/1STU_out
PREDICTION=run_pyrosetta_ver.sh # kind of prediction to run chose beetween run_pyrosetta_ver.sh or run_e2e_ver.sh

#---- do the job
${PREDICTION} ${INPUT_FASTA} ${OUTPUT_DIR}

running on CPU mode:#

Code Block (bash)

#!/bin/bash

#SBATCH --cpus-per-task=8         # hhblitss, hhsearch default requirement
#SBATCH --mem=${HHTOOLS_MEMORY}G  # required memory in GB

#---- Job Name
#SBATCH -J 1STU_rf_job

INPUT_FASTA=/pasteur/appa/scratch/public/edeveaud/1STU.fasta
OUTPUT_DIR=/pasteur/appa/scratch/public/edeveaud/1STU_out
PREDICTION=run_pyrosetta_ver_cpu.sh # kind of prediction to run chose beetween run_pyrosetta_ver_cpu.sh or run_e2e_ver_cpu.sh

#---- do the job
${PREDICTION} ${INPUT_FASTA} ${OUTPUT_DIR}

complex_preparation

running complex protein RoseTTAFold prediction.#

How to prepare inputs for complex modeling

Generate multiple sequence alignments for each subunit
Make paired alignments
It is important to generate "good" paired sequence alignments to extract coevolution signal properly.
For bacterial proteins, you may use make_joint_MSA_bacterial.py script to generate paired alignments. It pairs the sequences having similar UniProt accession code.
For eukaryotes, there's no easy way to generate paired alignments. Good luck!
NB runs ins CPU mode only, so no need to request a GPU node
Filter the paired alignment using hhfilter (use hmodule load hsuite/3.3.0 module to access hhfilter command)
hhfilter -i paired.a3m -o filtered.a3m -id 90 (or 95) -cov 75 (or 50)
NB runs ins CPU mode only, so no need to request a GPU node
If you want to incorporate complex template information, please make npz file that contains input template features for RoseTTAFold. It should contain the following keys ("xyz_t", "t1d", "t0d").
xyz_t: N, CA, C coordinates of complex templates (# of templ, # of residue, 3 (for N, Ca, C), 3 (xyz coord)) For the unaligned region, it should be NaN.
t1d: 1-D features from HHsearch results (score, SS, probab column from atab file) (T, L, 3). For the unaligned region, it should be zeros
t0d: 0-D features from HHsearch (Probability/100.0, Ideintities/100.0, Similarity fro hhr file) (T, 3)
Run complex structure prediction
predict_complex.py -i filtered.a3m -o complex -Ls 218 310
- - Set the numeric parameters after -Ls argument to the lengths of each individual subunit, in the order that they were paired
- n this example, the aa length of subunit1 was 218, subunit2 was 310.
- NB this script can be run in CPU mode only using the --cpu command line option
You may want to run Rosetta fastrelax w/ coordinate restraints to add sidechains.

some numerical values#

test were ran on a A100-SXM4-40GB GPU card on 96 core node.
using Tsp1 Trichoderma virens protein sequence (138 residues)
using S-layer protein A, Sulfolobus acidocaldarius protein (1424 residues)
playing with the number of threads//parallel jobs run by RoseTTAFold through SLURM_CPUS_PER_TASK

here are the timings of the different jobs.

run on Tsp1.fa#

	run_pyrosetta GPU mode	run_pyrosetta CPU mode	run_e2e GPU mode	run_e2e CPU mode
HHTOOLS_MEMORY=64 SLURM_CPUS_PER_TASK=8	real 80m40.917s user 322m47.901s sys 1m57.070s	real 78m35.863s user 355m1.903s sys 13m13.849s	real 28m22.496s user 161m23.120s sys 1m9.866s	real 34m11.120s user 509m5.386s sys 72m14.860s
HHTOOLS_MEMORY=64 SLURM_CPUS_PER_TASK=95	real 80m10.435s user 1760m30.599s sys 8m33.578s	real 77m53.783s user 1662m48.939s sys 18m15.265s	real 36m51.639s user 1608m41.954s sys 7m18.744s	real 39m33.383s user 1920m18.416s sys 68m7.570s

run on S-layer.fa#

	run_pyrosetta GPU mode	run_pyrosetta CPU mode	run_e2e GPU mode	run_e2e CPU mode
HHTOOLS_MEMORY=64 SLURM_CPUS_PER_TASK=8	real 856m20.154s user 4942m42.875s sys 15m32.150s	real 1164m27.392s user 25830m2.842s sys 5418m47.984s	CUDA out of memory Tried to allocate 8.89 GiB (GPU 0; 19.62 GiB total capacity; 10.48 GiB already allocated; 5.66 GiB free; 12.53 GiB reserved in total by PyTorch)	to run
HHTOOLS_MEMORY=64 SLURM_CPUS_PER_TASK=95	real 514m36.122s user 6138m27.414s sys 25m24.294s	to run	CUDA out of memory Tried to allocate 8.89 GiB (GPU 0; 19.62 GiB total capacity; 10.48 GiB already allocated; 5.67 GiB free; 12.53 GiB reserved in total by PyTorch)	to run

RoseTTAFold

environnement description#

provided tools description#

Running monomer RoseTTAFoldprediction via srun#

running on GPU mode:#

running on CPU mode#

Running monomer RoseTTAFold prediction via sbatch#

running on GPU mode:#

running on CPU mode:#

running complex protein RoseTTAFold prediction.#

some numerical values#

run on Tsp1.fa#

run on S-layer.fa#