A pipeline to process Nanopore reads and transfer the results to the end users.
git clone [email protected]:maxplanck-ie/nanoporeReads_dataTransfer.git
cd nanoporeReads_dataTransfer
pip install .
Note that the workflow requires conda to function, as some rules run in their own conda environments.
The key functionality is achieved using snakemake workflows. From version 2.0.0 two different snakemake rule sets are supported which are centered around two different basecallers:
rules_dorado
: a dorado-based workflow.
A wrapper python script (ont.py
) implements
- the continuous screening of the source directory,
- the generation of a flowcell-specific configuration file, and
- the communication with enduser (emails etc.)
The main configuration file (config.yaml
) specifies:
- the paths for the rule set be used (
rulesPath: rules
orrules_dorado
), - the overall directory structure (see below)
- organism-specific paths (e.g. genome and transcriptome locations)
- communication settings (email, Parkour LIMS, sambahost)
- generic parameters (basecalling, mapping)
Notice that the generic configuration defined by this file is expanded by project-specific entries for each incoming flowcell
Additional configuration files are:
env.yaml
(for conda installation of all dependencies)multiqc_config.yaml
(to customize multiqc output)
ont -c config.yaml
The workflow connects and relies on three main data locations:
- A source directory (
offloadDir
) is screened for the arrival of new and unprocessed flowcells - A work directory (
outputDir
) is used for various processing steps (merging, basecalling, demultiplexing, alignment, quality controls) - The target directory (
groupDir
) receives the analysis results in a project-wise manner.
The details are rule-set dependent. Annotated examples for rules_dorado
is given below
This directory is generated by the sequencing machine and may change in response to technological developments.
../path/to/flowcell/
.
├── bam_pass # from fast basecalling
├── barcode_alignment_PAS33554_6b0029ab_a0fbcf5b.tsv
├── fastq_pass # from fast basecalling
├── final_summary_PAS33554_6b0029ab_a0fbcf5b.txt
├── other_reports
├── pod5_pass # pod5 format
├── pore_activity_PAS33554_6b0029ab_a0fbcf5b.csv
├── report_PAS33554_20230928_1016_6b0029ab.html
├── report_PAS33554_20230928_1016_6b0029ab.json
├── report_PAS33554_20230928_1016_6b0029ab.md
├── SampleSheet.csv # sample sheet information
├── sample_sheet_PAS33554_20230928_1016_6b0029ab.csv
├── sequencing_summary_PAS33554_6b0029ab_a0fbcf5b.txt
└── throughput_PAS33554_6b0029ab_a0fbcf5b.csv
../path/to/flowcell
.
├── analysis.done # flag to signal that this folowcell has been fully processed
├── bam # output from basecalling in bam format (including modificaytion calls)
├── bam_demux # demulitplex samples (empty if no barcoding)
├── benchmarks # benchmarks for each rule
├── benchmarks_combined.tsv # combined benchmark file
├── flags # directory with flags from snakemake rules
├── log # log files (rule-specific)
├── pipeline_config.yaml # configfile (snakemake & more)
├── pod5 # directory with merged pod5 file (from offloadDir)
├── reports # directory with reports and SampleSheet.csv (from offloadDir)
├── summary # summary files (DAG, disk status)
└── transfer # analysis output that will be transferred)
transfer/
└── Project_projectID_User_Group
├── Analysis_mouse_dna # analysis directory (exists only if genome is known)
│ ├── 23L000329_WT_rep1.align.bam # alignment
│ ├── 23L000329_WT_rep1.align.bam.bai # index
│ └── 23L000329_WT_rep1.align.bed.gz # modification calls
├── Data
│ ├── 23L000329_WT_rep1.bam # basecalled sequences
│ ├── 23L000329_WT_rep1.fastq.gz # basecalled sequences (fastq - deprecated)
│ ├── 23L000329_WT_rep1_porechop.fastq.gz # adaptors, barcodes trimmed
│ └── 23L000329_WT_rep1.seqsum # sequencing summaries (for pycoQC etc )
└── QC
├── multiqc
│ ├── multiqc_data
│ └── multiqc_report.html # multiqc report
├── sample_names.tsv # dictionary sampleID-sampleName
└── Samples # samples-wise quality controls
├── 23L000329_WT_rep1.align.flagstat
├── 23L000329_WT_rep1.align_pycoqc.html
├── 23L000329_WT_rep1.align_pycoqc.json
├── 23L000329_WT_rep1_fastqc.html
├── 23L000329_WT_rep1_fastqc.zip
├── 23L000329_WT_rep1_kraken.report
├── 23L000329_WT_rep1_porechop.info
├── 23L000329_WT_rep1_pycoqc.html
├── 23L000329_WT_rep1_pycoqc.json
├── all_porechop.best_end
├── all_porechop.best_start
└── all_porechop.trimmed
../user_path/to/flowcell/ (identical to outputDir/transfer)
.
├── metadata.yaml
└── Project_projectID_User_Group
├── Analysis_mouse_dna
├── Data
└── QC