Skip to content

Conversation

@bentsherman
Copy link
Member

First pass at producing a task graph. Whenever a task is submitted (or discovered from cache), it is added as a node to the task graph with its hash, name, and list of predecessors. Each task has a list of input files, and the file paths point to their originating task.

Currently, you can produce the task graph by setting dag.type = 'task' in the config and using the Mermaid renderer:

nextflow run rnaseq-nf -with-docker -with-dag rnaseq-nf.mmd

Some lingering questions:

  1. This task graph only tracks processes and files, it does not track operators or Groovy objects. My reasoning is that a pipeline can only output files, and operators are typically only used to organize files rather than edit their contents. Also, tracking operators and Groovy objects would be more complicated (see Full provenance of tasks executions #3447), so we should only do it if we actually need it.

Since operators don't have working directories or hashes, I think it would be better to say "you should only create/edit files within processes if you want to have a complete provenance graph".

  1. How to expose the task graph? Currently we're thinking through the TraceObserver events (so that plugins can access it), and also in the .command.trace file (see also TraceRecord) produced by each task, which would become a JSON file to better handle things like input/output files.

@pditommaso
Copy link
Member

pditommaso commented Dec 5, 2022

Nice, this is a nice start but as mentioned already I think we should go beyond the tracking via file path hashes.

Also, I think it would be desirable to keep this graph independent by current process DAG. The first it's usually to have graph resolved ahead of the execution. The task graph to determine the execution provenance.

Regarding point 2, there could be two choice:

a) each task reports the upstream tasks in the TraceRecord
b) each task creates a new .command meta json meta file that list all inputs and outputs (files and values) and for each of the input corresponding task id

@pditommaso
Copy link
Member

A possible JSON meta file could like this

{

    "inputs": [
       {
        "name": "some.bam"
        "size": 1000
        "checksum": "766878799"
        "upstreamTask": "task-1324234"
      },

    ], 

    "outputs": {
        "name": "some.bam"
        "size": 1000
        "checksum": "766878799"
      },
      .. 
    }
  
  }

@bentsherman
Copy link
Member Author

Okay, I added the input/output files to the trace record with some basic metadata. I didn't need to do anything with the .command.trace file because the TaskHandler already has this information when it creates the trace record. Since the tower plugin just sends a JSON object of the entire trace record, it will also send the new inputs/outputs metadata.

@achristofferson-bbi
Copy link

Will this feature be compatible with the -resume option?

For example, if
step 1) is sharded mutect2 by chromosome
step 2) is merging the sharded chromosome calls, once finished sharded files are deleted.
step 3) filter merged mutect2 calls.

Now lets say we updated the the filter in step 3) but do not want to have to start from step 1) or step 2) because nothing with those tasks have changed will it be able to pick up right at step 3)?

@bentsherman
Copy link
Member Author

The task graph should work just fine with resume. It simply receives tasks from the task processor and it doesn't care whether they are new or cached.

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from cefb067 to e523afd Compare December 22, 2022 20:43
@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 0d59b4c to b93634e Compare March 11, 2023 11:20
@bentsherman bentsherman changed the title Add initial task graph Add initial task graph (push model) Mar 27, 2023
@bentsherman
Copy link
Member Author

Closing this PR because it uses a push model (Nextflow pushes task metadata to Tower during execution) whereas we want to use a pull model (Tower pulls task metadata from work directory after pipeline execution). I will create a new PR for the new approach.

@bentsherman bentsherman mentioned this pull request Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants