Introduction#
Bracken/2.6.2 (a companion tool of Kraken) is installed and available on Maestro.
Bracken, Bayesian Reestimation of Abundance with KrakEN, is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
This program requires some tweaks to be usable, as it tries to generate its own DBs (database.kraken, database<N>mers.kmer_distrib and datbase<N>maers.kraken) in the same location as kraken databases.
On Maestro Kraken databases are stored in a read-only directory that you can get the location of using the following command:
Code Block (text)
maestro-submit:~ > module show kraken
-------------------------------------------------------------------
/opt/gensoft/modules/kraken/2.1.1:
module-whatis {Set environnement for kraken (2.1.1)}
module-whatis topic_0637
module-whatis topic_0091
module-whatis operation_3460
prepend-path PATH /opt/gensoft/exe/kraken/2.1.1/bin
prepend-path PATH /opt/gensoft/exe/kraken/2.1.1/scripts
setenv KRAKEN2_DB_PATH /local/databases/index/kraken
setenv KRAKEN2_DEFAULT_DB minikraken
-------------------------------------------------------------------
Furthermore, users can generate Bracken DBs for specific kmer and read lengths, that we cannot provide in a centralized manner because of the different requirements.
Choosing the database#
Kraken on Maestro provides a homebrew tool kraken_list_dbs that display all the useable Kraken DBs and their locations:
Code Block (text)
maestro-submit:~ > kraken_list_dbs
-------- /local/databases/index/kraken
greengenes
rdpii
minikraken
nt
kraken_standard
silva_ssu
We will use greengenes as an example.
Generate Bracken DBs#
First, create symlinks for all the files from the kraken DBs you want to work within a location you have write access to. For example:
Code Block (text)
maestro-submit:~ > mkdir ${HOME}/greengenes
maestro-submit:~ > pushd ${HOME}/greengenes
maestro-submit:~/greengenes > ln -s ${KRAKEN2_DB_PATH}/greengenes/* .
maestro-submit:~/greengenes > popd
You can also do it on ${APPASCRATCH}/${USER} instead of ${HOME} if you don't have space.
Next, generate the required Bracken DBs with the parameters that suit your needs. eg kmer length = 100, read length = 100 using the bracken-build command.
Code Block (text)
maestro-submit:~ > bracken-build -d ${HOME}/greengenes -k 100 -l 100
>> Selected Options:
kmer length = 100
read length = 100
database = /pasteur/appa/homes/XXXX/greengenes
threads = 1
>> Checking for Valid Options...
>> Creating database.kraken [if not found]
>> kraken2 --db /pasteur/appa/homes/XXXX/greengenes --threads 1 <( find -L /pasteur/appa/homes/XXXX/greengenes/library \( -name *.fna -o -name *.fa -o -name *.fasta \) -exec cat {} + ) > /pasteur/appa/homes/XXXX/greengenes/database.kraken
greengenes/database.kraken
Loading database information... done.
1262986 sequences (1769.52 Mbp) processed in 123.921s (611.5 Kseq/m, 856.76 Mbp/m).
1262986 sequences classified (100.00%)
0 sequences unclassified (0.00%)
Finished creating database.kraken [in DB folder]
>> Creating database100mers.kmer_distrib
>>STEP 0: PARSING COMMAND LINE ARGUMENTS
Taxonomy nodes file: /pasteur/appa/homes/XXXX/greengenes/taxonomy/nodes.dmp
Seqid file: /pasteur/appa/homes/XXXX/greengenes/seqid2taxid.map
Num Threads: 1
Kmer Length: 100
Read Length: 100
>>STEP 1: READING SEQID2TAXID MAP
1262986 total sequences read
>>STEP 2: READING NODES.DMP FILE
3094 total nodes read
>>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
100mers, with a database built using 100mers
1262988 sequences converted (finished: )484811)
Time Elaped: 0 minutes, 48 seconds, 0.00000 microseconds
=============================
PROGRAM START TIME: 06-02-2022 07:49:46
...2900 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 06-02-2022 07:49:59
Finished creating database100mers.kraken and database100mers.kmer_distrib [in DB folder]
*NOTE: to create read distribution files for multiple read lengths,
rerun this script specifying the same database but a different read length
Bracken build complete.
Bracken analysis#
Now you are ready for the final step, where you run your analysis on Kraken report file. Note that Kraken version 2 requires the --report option to generate this one.
Code Block (text)
maestro-submit:~ > bracken -d ${HOME}/greengenes -i sample_output_bracken.report -o out.bracken