ATAC-Seq Pipeline

ENCODE ATAC-seq pipeline https://github.com/ENCODE-DCC/atac-seq-pipeline is a Cromwell/WDL pipeline designed for automated end-to-end quality control and processing of ATAC-seq and DNase-seq data.

due to technical and structural limitations on maestro it is not possible to run it directly on maestro.

in order to be abble to use this pipeline, we propose a module for atac-seq-pipeline that provides a wraper script to launch and run the analysis, plus already downloaded reference genomes data

this pipeline takes as input a json formated file describing the data (reference genome and reads) to work with see input file format description.

however compute nodes does not have access to internet, so it is not possible to use commonly used urls (s3://, gs:// and http(s)://) to reference genomes, reads to work with. only absolute path are allowed.

prerequisite.#

you must run caper init once before any use of ATAC-seq pipeline. NB this step must be run on submit node as it requires internet access

where platform is one of the 2 allowed values on maestro

local. all ATAC-seq pipeline steps will be run on the same compute node
slurm. ATAC-seq pipeline steps will be run trough slurm inside your allocation. possibly unsing multiples compute nodes.

using atac_runner wrapper (easy way) solution#

load atac-seq-pipeline module and his dependencies (caper, singularity, graalvm).
set up your JSON input file specification. see example bellow
run through an allocation atac_runner -i input.json

hand made solution#

download atac.wdl cromwell definition file
download singularity image to use on a local location. (must be available on all compute nodes)
edit atac.wdl to replace singularity image URL with yours location on disk. (remember full path)
set up your JSON input file specification. see example bellow
run caper run --singulartiy -i input.json

why --singularity instead of --conda ?

we recommend running the pipeline through singularity image instead of conda on maestro.

input.json example#

let's say we want to run the example ENCSR356KRQ_subsampled.json

reference genomes location.#

reference genomes (V4) are already downloaded and availbale in a public location on maestro.

you will find the following reference genomes

human genomes : hg19 hg38
human mitochondrial: hg38_chr19_chrM
mouse genomes: mm9 mm10
mouse mitochondrial mm10_chr19_chrM

all thoses reference genomes where saved to /opt/gensoft/data/atac-seq-pipeline/2.1.0

let's say you have saved the reads you want to deal with in a convenient location (should be available on all nodes). eg /pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep1 and /pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2

you now have to edit your input.json file to point to reads on local folder (remember no URL allowed)

our input.json file will looks like

Code Block (text)

{
    "atac.pipeline_type" : "atac",
    "atac.genome_tsv" : "/opt/gensoft/data/atac-seq-pipeline/2.1.0/hg38/hg38.tsv"
,
    "atac.fastqs_rep1_R1" : [
        "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF341MYG.subsampled.400.fastq.gz",
        "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF106QGY.subsampled.400.fastq.gz"
    ],
    "atac.fastqs_rep1_R2" : [
        "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF248EJF.subsampled.400.fastq.gz",
        "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF368TYI.subsampled.400.fastq.gz"
    ],
    "atac.fastqs_rep2_R1" : [
        "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF641SFZ.subsampled.400.fastq.gz",
        "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF751XTV.subsampled.400.fastq.gz",
        "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF927LSG.subsampled.400.fastq.gz",
        "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF859BDM.subsampled.400.fastq.gz",
        "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF193RRC.subsampled.400.fastq.gz",
        "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF366DFI.subsampled.400.fastq.gz"
    ],
    "atac.fastqs_rep2_R2" : [
         "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF031ARQ.subsampled.400.fastq.gz",
         "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF590SYZ.subsampled.400.fastq.gz",
         "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF734PEQ.subsampled.400.fastq.gz",
         "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF007USV.subsampled.400.fastq.gz",
         "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF886FSC.subsampled.400.fastq.gz",
         "/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF573UXK.subsampled.400.fastq.gz"
    ],
    "atac.paired_end" : true,
    "atac.auto_detect_adapter" : true,
    "atac.enable_xcor" : true,
    "atac.title" : "ENCSR356KRQ (subsampled 1/400)",
    "atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}

you can now run the ATAC-seq pipeline using one of the above methods.