Skip to content

Conversation

@bentsherman
Copy link
Member

@bentsherman bentsherman commented Mar 27, 2023

Redo of #3463 . Includes the following changes:

  • TaskDAG class which tracks the task DAG as the pipeline is executed
  • dag.type config option which allows the user to render the task DAG instead of the process DAG (currently only supported for Mermaid format)
  • .command.meta.json file which is written on task completion and contains the task hash, input files, and output files

To test the DAG rendering, launch a pipeline with the following extra config:

dag.enabled = true
dag.type = 'task'
dag.file = 'pipeline.mmd'
dag.overwrite = true

To test the task metadata file, simply launch a pipeline normally and inspect the task directories for .command.meta.json files.

Here is an example meta file from rnaseq-nf:

{
    "hash": "4fb5c35e186526f64ee5a5cc2720e824",
    "inputs": [
        {
            "name": "ggal_gut",
            "path": "/work/ab/bc682dd08c72ded3c25640d3cb05ef/ggal_gut",
            "predecessor": "abbc682dd08c72ded3c25640d3cb05ef"
        },
        {
            "name": "fastqc_ggal_gut_logs",
            "path": "/work/c0/6030412e5508bc9c2f2d7b053eb882/fastqc_ggal_gut_logs",
            "predecessor": "c06030412e5508bc9c2f2d7b053eb882"
        },
        {
            "name": "multiqc",
            "path": "/.nextflow/assets/nextflow-io/rnaseq-nf/multiqc",
            "predecessor": null
        }
    ],
    "outputs": [
        {
            "name": "multiqc_report.html",
            "path": "/work/4f/b5c35e186526f64ee5a5cc2720e824/multiqc_report.html",
            "size": 1204127,
            "checksum": "a41d49c28135c51b7c92fb70cac70a66"
        }
    ]
}

@bentsherman bentsherman requested a review from pditommaso March 27, 2023 19:10
@bentsherman
Copy link
Member Author

bentsherman commented Mar 27, 2023

I wrote a little Python script to scrape the metadata files and render a task DAG with output files. Here's what the DAG looks like for rnaseq-nf:

flowchart TD
    t1["[88/25db6c] RNASEQ:FASTQC (FASTQC on ggal_gut)"]
    i1(( )) -->|ggal_gut_1.fq| t1
    i2(( )) -->|ggal_gut_2.fq| t1
    t3["[17/b42485] MULTIQC"]
    t2 -->|ggal_gut| t3
    t1 -->|fastqc_ggal_gut_logs| t3
    i3(( )) -->|multiqc| t3
    t2["[b7/c4c160] RNASEQ:QUANT (ggal_gut)"]
    t0 -->|index| t2
    i4(( )) -->|ggal_gut_1.fq| t2
    i5(( )) -->|ggal_gut_2.fq| t2
    t0["[33/4091b1] RNASEQ:INDEX (ggal_1_48850000_49020000)"]
    i6(( )) -->|ggal_1_48850000_49020000.Ggal71.500bpflank.fa| t0
    t3 -->|multiqc_report.html| o1(( ))
Loading

I think I will extend the task DAG renderer in this PR to also include the output files like this.

EDIT: added input files

@robsyme
Copy link
Collaborator

robsyme commented Mar 28, 2023

What happens when the input is a value? Does the DAG json print some serialized form of the value? Do we use the Kryo serialization when available?

@bentsherman
Copy link
Member Author

Currently it only tracks files. I haven't tried to track things like value and env because they can't be saved as workflow outputs. I guess I'm not clear what the argument is for tracking them -- annotating the task DAG? storing them in the metadata json?

@bentsherman
Copy link
Member Author

Cloud executor tests are failing because some of the file operations I'm doing aren't supported, I'll have to think more carefully about how to handle remote files.

@bentsherman
Copy link
Member Author

Thinking about how the task graph intersects with the automatic cleanup. If we save the task inputs and outputs to the .nextflow cache, so that we can resume the task even if the outputs are deleted, maybe then we don't need to save a JSON file for each task. Tower could retrieve this information from the .nextflow cache, it could even use Nextflow's CacheDB class to do it natively, no need to run a nextflow command.

I'm also thinking this because the automatic cleanup could delete the entire task directory, not just the output files, in that case the JSON file is pointless because it will just be deleted.

@netlify
Copy link

netlify bot commented Jul 7, 2023

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 501bce7
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/651daafed6783d000899ad5e

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 81f7cb7 to 8a43489 Compare August 20, 2023 20:13
@bentsherman bentsherman changed the title Add task graph and cache task inputs/outputs metadata Task provenance Aug 22, 2023
@bentsherman bentsherman marked this pull request as draft September 13, 2023 20:15
@marcodelapierre
Copy link
Contributor

For reference, this discussion: #3447

@bentsherman
Copy link
Member Author

The CID store is far enough along now that I think it covers this effort. As for the ETags, I mentioned in #4729 that I don't think they will provide a complete solution. The task DAG can already be rendered by nf-prov.

This PR added the task provenance to the cache db, whereas the CID store currently adds it to a separate data store. That will give us more flexibility as we develop, and in a future iteration we can swap in the CID store as an alternative cache.

Closing in favor of #5715

@bentsherman bentsherman deleted the ben-task-graph branch March 13, 2025 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants