ATAC-Seq Pipeline
ENCODE ATAC-seq pipeline https://github.com/ENCODE-DCC/atac-seq-pipeline is a Cromwell/WDL pipeline designed for automated end-to-end quality control and processing of ATAC-seq and DNase-seq data.
due to technical and structural limitations on maestro it is not possible to run it directly on maestro.
in order to be abble to use this pipeline, we propose a module for atac-seq-pipeline that provides a wraper script to launch and run the analysis, plus already downloaded reference genomes data
this pipeline takes as input a json formated file describing the data (reference genome and reads) to work with see input file format description.
- https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/input_short.md
- https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/input.md
however compute nodes does not have access to internet, so it is not possible to use commonly used urls (s3://, gs:// and http(s)://) to reference genomes, reads to work with. only absolute path are allowed.
prerequisite.#
you must run caper init
where platform is one of the 2 allowed values on maestro
- local. all ATAC-seq pipeline steps will be run on the same compute node
- slurm. ATAC-seq pipeline steps will be run trough slurm inside your allocation. possibly unsing multiples compute nodes.
using atac_runner wrapper (easy way) solution#
- load atac-seq-pipeline module and his dependencies (caper, singularity, graalvm).
- set up your JSON input file specification. see example bellow
- run through an allocation atac_runner -i input.json
hand made solution#
- download atac.wdl cromwell definition file
- download singularity image to use on a local location. (must be available on all compute nodes)
- edit atac.wdl to replace singularity image URL with yours location on disk. (remember full path)
- set up your JSON input file specification. see example bellow
- run caper run --singulartiy -i input.json
why --singularity instead of --conda ?
we recommend running the pipeline through singularity image instead of conda on maestro.
input.json example#
let's say we want to run the example ENCSR356KRQ_subsampled.json
reference genomes location.#
reference genomes (V4) are already downloaded and availbale in a public location on maestro.
you will find the following reference genomes
- human genomes : hg19 hg38
- human mitochondrial: hg38_chr19_chrM
- mouse genomes: mm9 mm10
- mouse mitochondrial mm10_chr19_chrM
all thoses reference genomes where saved to /opt/gensoft/data/atac-seq-pipeline/2.1.0
let's say you have saved the reads you want to deal with in a convenient location (should be available on all nodes). eg /pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep1 and /pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2
you now have to edit your input.json file to point to reads on local folder (remember no URL allowed)
our input.json file will looks like
Code Block (text)
{
"atac.pipeline_type" : "atac",
"atac.genome_tsv" : "/opt/gensoft/data/atac-seq-pipeline/2.1.0/hg38/hg38.tsv"
,
"atac.fastqs_rep1_R1" : [
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF341MYG.subsampled.400.fastq.gz",
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF106QGY.subsampled.400.fastq.gz"
],
"atac.fastqs_rep1_R2" : [
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF248EJF.subsampled.400.fastq.gz",
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF368TYI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R1" : [
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF641SFZ.subsampled.400.fastq.gz",
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF751XTV.subsampled.400.fastq.gz",
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF927LSG.subsampled.400.fastq.gz",
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF859BDM.subsampled.400.fastq.gz",
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF193RRC.subsampled.400.fastq.gz",
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF366DFI.subsampled.400.fastq.gz"
],
"atac.fastqs_rep2_R2" : [
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF031ARQ.subsampled.400.fastq.gz",
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF590SYZ.subsampled.400.fastq.gz",
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF734PEQ.subsampled.400.fastq.gz",
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF007USV.subsampled.400.fastq.gz",
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF886FSC.subsampled.400.fastq.gz",
"/pasteur/project/ATAC-seq/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF573UXK.subsampled.400.fastq.gz"
],
"atac.paired_end" : true,
"atac.auto_detect_adapter" : true,
"atac.enable_xcor" : true,
"atac.title" : "ENCSR356KRQ (subsampled 1/400)",
"atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}
you can now run the ATAC-seq pipeline using one of the above methods.