Skip to content

maxplanck-ie/nanoporeReads_dataTransfer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nanoporeReads_dataTransfer

A pipeline to process Nanopore reads and transfer the results to the end users.

Installation

git clone [email protected]:maxplanck-ie/nanoporeReads_dataTransfer.git
cd nanoporeReads_dataTransfer
pip install .

Note that the workflow requires conda to function, as some rules run in their own conda environments.

Implementation

The key functionality is achieved using snakemake workflows. From version 2.0.0 two different snakemake rule sets are supported which are centered around two different basecallers:

  • rules_dorado: a dorado-based workflow.

A wrapper python script (ont.py) implements

  • the continuous screening of the source directory,
  • the generation of a flowcell-specific configuration file, and
  • the communication with enduser (emails etc.)

Configurations

The main configuration file (config.yaml) specifies:

  • the paths for the rule set be used (rulesPath: rules or rules_dorado),
  • the overall directory structure (see below)
  • organism-specific paths (e.g. genome and transcriptome locations)
  • communication settings (email, Parkour LIMS, sambahost)
  • generic parameters (basecalling, mapping)

Notice that the generic configuration defined by this file is expanded by project-specific entries for each incoming flowcell

Additional configuration files are:

  • env.yaml (for conda installation of all dependencies)
  • multiqc_config.yaml (to customize multiqc output)

Usage

ont -c config.yaml

Directory structures

The workflow connects and relies on three main data locations:

  1. A source directory (offloadDir) is screened for the arrival of new and unprocessed flowcells
  2. A work directory (outputDir) is used for various processing steps (merging, basecalling, demultiplexing, alignment, quality controls)
  3. The target directory (groupDir) receives the analysis results in a project-wise manner.

The details are rule-set dependent. Annotated examples for rules_dorado is given below

Example input path (offloadDir)

This directory is generated by the sequencing machine and may change in response to technological developments.

../path/to/flowcell/
.
├── bam_pass            # from fast basecalling
├── barcode_alignment_PAS33554_6b0029ab_a0fbcf5b.tsv
├── fastq_pass          # from fast basecalling
├── final_summary_PAS33554_6b0029ab_a0fbcf5b.txt
├── other_reports
├── pod5_pass           # pod5 format
├── pore_activity_PAS33554_6b0029ab_a0fbcf5b.csv
├── report_PAS33554_20230928_1016_6b0029ab.html
├── report_PAS33554_20230928_1016_6b0029ab.json
├── report_PAS33554_20230928_1016_6b0029ab.md
├── SampleSheet.csv     # sample sheet information
├── sample_sheet_PAS33554_20230928_1016_6b0029ab.csv
├── sequencing_summary_PAS33554_6b0029ab_a0fbcf5b.txt
└── throughput_PAS33554_6b0029ab_a0fbcf5b.csv

Example output path during processing (outputDir)

../path/to/flowcell
.
├── analysis.done            # flag to signal that this folowcell has been fully processed
├── bam                      # output from basecalling in bam format (including modificaytion calls)
├── bam_demux                # demulitplex samples (empty if no barcoding)
├── benchmarks               # benchmarks for each rule
├── benchmarks_combined.tsv  # combined benchmark file
├── flags                    # directory with flags from snakemake rules
├── log                      # log files (rule-specific)
├── pipeline_config.yaml     # configfile (snakemake & more)
├── pod5                     # directory with merged pod5 file (from offloadDir)
├── reports                  # directory with reports and SampleSheet.csv (from offloadDir)
├── summary                  # summary files (DAG, disk status)
└── transfer                 # analysis output that will be transferred)

transfer/
└── Project_projectID_User_Group
    ├── Analysis_mouse_dna                    # analysis directory (exists only if genome is known)
    │   ├── 23L000329_WT_rep1.align.bam       # alignment
    │   ├── 23L000329_WT_rep1.align.bam.bai   # index
    │   └── 23L000329_WT_rep1.align.bed.gz    # modification calls
    ├── Data
    │   ├── 23L000329_WT_rep1.bam             # basecalled sequences
    │   ├── 23L000329_WT_rep1.fastq.gz        # basecalled sequences (fastq - deprecated)
    │   ├── 23L000329_WT_rep1_porechop.fastq.gz # adaptors, barcodes trimmed
    │   └── 23L000329_WT_rep1.seqsum            # sequencing summaries (for pycoQC etc )
    └── QC
        ├── multiqc
        │   ├── multiqc_data
        │   └── multiqc_report.html            # multiqc report
        ├── sample_names.tsv                   # dictionary sampleID-sampleName
        └── Samples                            # samples-wise quality controls
            ├── 23L000329_WT_rep1.align.flagstat
            ├── 23L000329_WT_rep1.align_pycoqc.html
            ├── 23L000329_WT_rep1.align_pycoqc.json
            ├── 23L000329_WT_rep1_fastqc.html
            ├── 23L000329_WT_rep1_fastqc.zip
            ├── 23L000329_WT_rep1_kraken.report
            ├── 23L000329_WT_rep1_porechop.info
            ├── 23L000329_WT_rep1_pycoqc.html
            ├── 23L000329_WT_rep1_pycoqc.json
            ├── all_porechop.best_end
            ├── all_porechop.best_start
            └── all_porechop.trimmed

Example output path for an end user (groupDir)

../user_path/to/flowcell/  (identical to outputDir/transfer)
.
├── metadata.yaml
└── Project_projectID_User_Group
    ├── Analysis_mouse_dna
    ├── Data
    └── QC

About

A pipeline to transfer the Nanopore reads to end users

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 11

Languages