-
Couldn't load subscription status.
- Fork 738
Task provenance #3802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task provenance #3802
Conversation
Signed-off-by: Ben Sherman <[email protected]>
|
I wrote a little Python script to scrape the metadata files and render a task DAG with output files. Here's what the DAG looks like for flowchart TD
t1["[88/25db6c] RNASEQ:FASTQC (FASTQC on ggal_gut)"]
i1(( )) -->|ggal_gut_1.fq| t1
i2(( )) -->|ggal_gut_2.fq| t1
t3["[17/b42485] MULTIQC"]
t2 -->|ggal_gut| t3
t1 -->|fastqc_ggal_gut_logs| t3
i3(( )) -->|multiqc| t3
t2["[b7/c4c160] RNASEQ:QUANT (ggal_gut)"]
t0 -->|index| t2
i4(( )) -->|ggal_gut_1.fq| t2
i5(( )) -->|ggal_gut_2.fq| t2
t0["[33/4091b1] RNASEQ:INDEX (ggal_1_48850000_49020000)"]
i6(( )) -->|ggal_1_48850000_49020000.Ggal71.500bpflank.fa| t0
t3 -->|multiqc_report.html| o1(( ))
I think I will extend the task DAG renderer in this PR to also include the output files like this. EDIT: added input files |
|
What happens when the input is a value? Does the DAG json print some serialized form of the value? Do we use the Kryo serialization when available? |
|
Currently it only tracks files. I haven't tried to track things like |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
|
Cloud executor tests are failing because some of the file operations I'm doing aren't supported, I'll have to think more carefully about how to handle remote files. |
modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy
Outdated
Show resolved
Hide resolved
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
|
Thinking about how the task graph intersects with the automatic cleanup. If we save the task inputs and outputs to the I'm also thinking this because the automatic cleanup could delete the entire task directory, not just the output files, in that case the JSON file is pointless because it will just be deleted. |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Signed-off-by: Ben Sherman <[email protected]>
81f7cb7 to
8a43489
Compare
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
|
For reference, this discussion: #3447 |
5a93547 to
27345a6
Compare
|
The CID store is far enough along now that I think it covers this effort. As for the ETags, I mentioned in #4729 that I don't think they will provide a complete solution. The task DAG can already be rendered by nf-prov. This PR added the task provenance to the cache db, whereas the CID store currently adds it to a separate data store. That will give us more flexibility as we develop, and in a future iteration we can swap in the CID store as an alternative cache. Closing in favor of #5715 |
Redo of #3463 . Includes the following changes:
TaskDAGclass which tracks the task DAG as the pipeline is executeddag.typeconfig option which allows the user to render the task DAG instead of the process DAG (currently only supported for Mermaid format).command.meta.jsonfile which is written on task completion and contains the task hash, input files, and output filesTo test the DAG rendering, launch a pipeline with the following extra config:
To test the task metadata file, simply launch a pipeline normally and inspect the task directories for
.command.meta.jsonfiles.Here is an example meta file from
rnaseq-nf:{ "hash": "4fb5c35e186526f64ee5a5cc2720e824", "inputs": [ { "name": "ggal_gut", "path": "/work/ab/bc682dd08c72ded3c25640d3cb05ef/ggal_gut", "predecessor": "abbc682dd08c72ded3c25640d3cb05ef" }, { "name": "fastqc_ggal_gut_logs", "path": "/work/c0/6030412e5508bc9c2f2d7b053eb882/fastqc_ggal_gut_logs", "predecessor": "c06030412e5508bc9c2f2d7b053eb882" }, { "name": "multiqc", "path": "/.nextflow/assets/nextflow-io/rnaseq-nf/multiqc", "predecessor": null } ], "outputs": [ { "name": "multiqc_report.html", "path": "/work/4f/b5c35e186526f64ee5a5cc2720e824/multiqc_report.html", "size": 1204127, "checksum": "a41d49c28135c51b7c92fb70cac70a66" } ] }