Introduction#

Bracken/2.6.2 (a companion tool of Kraken) is installed and available on Maestro.

Bracken, Bayesian Reestimation of Abundance with KrakEN, is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.

This program requires some tweaks to be usable, as it tries to generate its own DBs (database.krakendatabase<N>mers.kmer_distrib and datbase<N>maers.kraken) in the same location as kraken databases.
On Maestro Kraken databases are stored in a read-only directory that you can get the location of using the following command:

Code Block (text)

maestro-submit:~ > module show kraken
-------------------------------------------------------------------
/opt/gensoft/modules/kraken/2.1.1:

module-whatis   {Set environnement for kraken (2.1.1)}
module-whatis   topic_0637
module-whatis   topic_0091
module-whatis   operation_3460
prepend-path    PATH /opt/gensoft/exe/kraken/2.1.1/bin
prepend-path    PATH /opt/gensoft/exe/kraken/2.1.1/scripts
setenv          KRAKEN2_DB_PATH /local/databases/index/kraken
setenv          KRAKEN2_DEFAULT_DB minikraken
-------------------------------------------------------------------

Furthermore, users can generate Bracken DBs for specific kmer and read lengths, that we cannot provide in a centralized manner because of the different requirements.

Choosing the database#

Kraken on Maestro provides a homebrew tool kraken_list_dbs that display all the useable Kraken DBs and their locations:

Code Block (text)

maestro-submit:~ > kraken_list_dbs 
-------- /local/databases/index/kraken
  greengenes
  rdpii
  minikraken
  nt
  kraken_standard
  silva_ssu

We will use greengenes as an example.

Generate Bracken DBs#

First, create symlinks for all the files from the kraken DBs you want to work within a location you have write access to. For example:

Code Block (text)

maestro-submit:~ > mkdir ${HOME}/greengenes
maestro-submit:~ > pushd ${HOME}/greengenes
maestro-submit:~/greengenes > ln -s ${KRAKEN2_DB_PATH}/greengenes/* .
maestro-submit:~/greengenes > popd

You can also do it on ${APPASCRATCH}/${USER} instead of ${HOME} if you don't have space.

Next, generate the required Bracken DBs with the parameters that suit your needs. eg kmer length = 100, read length = 100 using the bracken-build command.

Code Block (text)

maestro-submit:~ > bracken-build -d ${HOME}/greengenes  -k 100 -l 100
 >> Selected Options:
       kmer length = 100
       read length = 100
       database    = /pasteur/appa/homes/XXXX/greengenes
       threads     = 1
 >> Checking for Valid Options...
 >> Creating database.kraken [if not found]
      >> kraken2 --db /pasteur/appa/homes/XXXX/greengenes --threads 1 <( find -L /pasteur/appa/homes/XXXX/greengenes/library \( -name *.fna -o -name *.fa -o -name *.fasta \) -exec cat {} + ) > /pasteur/appa/homes/XXXX/greengenes/database.kraken
greengenes/database.kraken
Loading database information... done.
1262986 sequences (1769.52 Mbp) processed in 123.921s (611.5 Kseq/m, 856.76 Mbp/m).
  1262986 sequences classified (100.00%)
  0 sequences unclassified (0.00%)
          Finished creating database.kraken [in DB folder]
 >> Creating database100mers.kmer_distrib 
    >>STEP 0: PARSING COMMAND LINE ARGUMENTS
        Taxonomy nodes file: /pasteur/appa/homes/XXXX/greengenes/taxonomy/nodes.dmp
        Seqid file:          /pasteur/appa/homes/XXXX/greengenes/seqid2taxid.map
        Num Threads:         1
        Kmer Length:         100
        Read Length:         100
    >>STEP 1: READING SEQID2TAXID MAP
        1262986 total sequences read
    >>STEP 2: READING NODES.DMP FILE
        3094 total nodes read
    >>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
        100mers, with a database built using 100mers
        1262988 sequences converted (finished: )484811)
    Time Elaped: 0 minutes, 48 seconds, 0.00000 microseconds
    =============================
PROGRAM START TIME: 06-02-2022 07:49:46
...2900 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 06-02-2022 07:49:59
          Finished creating database100mers.kraken and database100mers.kmer_distrib [in DB folder]
          *NOTE: to create read distribution files for multiple read lengths, 
                 rerun this script specifying the same database but a different read length

Bracken build complete.

Bracken analysis#

Now you are ready for the final step, where you run your analysis on Kraken report file. Note that  Kraken version 2 requires the --report option to generate this one.

Code Block (text)

maestro-submit:~ > bracken -d ${HOME}/greengenes -i sample_output_bracken.report -o out.bracken