Add initial task graph (push model) #3463

bentsherman · 2022-12-02T15:00:09Z

First pass at producing a task graph. Whenever a task is submitted (or discovered from cache), it is added as a node to the task graph with its hash, name, and list of predecessors. Each task has a list of input files, and the file paths point to their originating task.

Currently, you can produce the task graph by setting dag.type = 'task' in the config and using the Mermaid renderer:

nextflow run rnaseq-nf -with-docker -with-dag rnaseq-nf.mmd

Some lingering questions:

This task graph only tracks processes and files, it does not track operators or Groovy objects. My reasoning is that a pipeline can only output files, and operators are typically only used to organize files rather than edit their contents. Also, tracking operators and Groovy objects would be more complicated (see Full provenance of tasks executions #3447), so we should only do it if we actually need it.

Since operators don't have working directories or hashes, I think it would be better to say "you should only create/edit files within processes if you want to have a complete provenance graph".

How to expose the task graph? Currently we're thinking through the TraceObserver events (so that plugins can access it), and also in the .command.trace file (see also TraceRecord) produced by each task, which would become a JSON file to better handle things like input/output files.

Signed-off-by: Ben Sherman <[email protected]>

modules/nextflow/src/main/groovy/nextflow/dag/DagRenderer.groovy

pditommaso · 2022-12-05T14:21:38Z

Nice, this is a nice start but as mentioned already I think we should go beyond the tracking via file path hashes.

Also, I think it would be desirable to keep this graph independent by current process DAG. The first it's usually to have graph resolved ahead of the execution. The task graph to determine the execution provenance.

Regarding point 2, there could be two choice:

a) each task reports the upstream tasks in the TraceRecord
b) each task creates a new .command meta json meta file that list all inputs and outputs (files and values) and for each of the input corresponding task id

pditommaso · 2022-12-05T14:48:52Z

A possible JSON meta file could like this

{

    "inputs": [
       {
        "name": "some.bam"
        "size": 1000
        "checksum": "766878799"
        "upstreamTask": "task-1324234"
      },

    ], 

    "outputs": {
        "name": "some.bam"
        "size": 1000
        "checksum": "766878799"
      },
      .. 
    }
  
  }

Signed-off-by: Ben Sherman <[email protected]>

bentsherman · 2022-12-12T22:27:43Z

Okay, I added the input/output files to the trace record with some basic metadata. I didn't need to do anything with the .command.trace file because the TaskHandler already has this information when it creates the trace record. Since the tower plugin just sends a JSON object of the entire trace record, it will also send the new inputs/outputs metadata.

achristofferson-bbi · 2022-12-19T20:09:44Z

Will this feature be compatible with the -resume option?

For example, if
step 1) is sharded mutect2 by chromosome
step 2) is merging the sharded chromosome calls, once finished sharded files are deleted.
step 3) filter merged mutect2 calls.

Now lets say we updated the the filter in step 3) but do not want to have to start from step 1) or step 2) because nothing with those tasks have changed will it be able to pick up right at step 3)?

bentsherman · 2022-12-20T02:17:23Z

The task graph should work just fine with resume. It simply receives tasks from the task processor and it doesn't care whether they are new or cached.

bentsherman · 2023-03-27T19:00:27Z

Closing this PR because it uses a push model (Nextflow pushes task metadata to Tower during execution) whereas we want to use a pull model (Tower pulls task metadata from work directory after pipeline execution). I will create a new PR for the new approach.

Add initial task graph [ci fast]

15d489a

Signed-off-by: Ben Sherman <[email protected]>

pditommaso reviewed Dec 5, 2022

View reviewed changes

modules/nextflow/src/main/groovy/nextflow/dag/DagRenderer.groovy Show resolved Hide resolved

pditommaso force-pushed the master branch 2 times, most recently from e2b4a93 to f32ea0b Compare December 8, 2022 15:16

bentsherman mentioned this pull request Dec 12, 2022

Automatically delete files marked as temp as soon as not needed anymore #452

Open

bentsherman added 2 commits December 12, 2022 15:40

Revert DagObserver to interface [ci fast]

4939d28

Signed-off-by: Ben Sherman <[email protected]>

Add input and output files to trace record

1c0a918

Signed-off-by: Ben Sherman <[email protected]>

pditommaso force-pushed the master branch 2 times, most recently from cefb067 to e523afd Compare December 22, 2022 20:43

Merge branch 'master' into ben/task-graph

e06b9d4

pditommaso force-pushed the master branch 2 times, most recently from 0d59b4c to b93634e Compare March 11, 2023 11:20

Merge branch 'master' into ben/task-graph

f32e54b

bentsherman changed the title ~~Add initial task graph~~ Add initial task graph (push model) Mar 27, 2023

bentsherman closed this Mar 27, 2023

bentsherman mentioned this pull request Mar 27, 2023

Task provenance #3802

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add initial task graph (push model) #3463

Add initial task graph (push model) #3463

Uh oh!

bentsherman commented Dec 2, 2022

Uh oh!

Uh oh!

pditommaso commented Dec 5, 2022 •

edited

Loading

Uh oh!

pditommaso commented Dec 5, 2022

Uh oh!

bentsherman commented Dec 12, 2022

Uh oh!

achristofferson-bbi commented Dec 19, 2022

Uh oh!

bentsherman commented Dec 20, 2022

Uh oh!

bentsherman commented Mar 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add initial task graph (push model) #3463

Add initial task graph (push model) #3463

Uh oh!

Conversation

bentsherman commented Dec 2, 2022

Uh oh!

Uh oh!

pditommaso commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pditommaso commented Dec 5, 2022

Uh oh!

bentsherman commented Dec 12, 2022

Uh oh!

achristofferson-bbi commented Dec 19, 2022

Uh oh!

bentsherman commented Dec 20, 2022

Uh oh!

bentsherman commented Mar 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pditommaso commented Dec 5, 2022 •

edited

Loading