-
Couldn't load subscription status.
- Fork 738
Data lineage tracking (aka CID store) #5715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Paolo Di Tommaso <[email protected]>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
5a93547 to
27345a6
Compare
|
@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me |
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
modules/nextflow/src/main/groovy/nextflow/cli/CmdLineage.groovy
Outdated
Show resolved
Hide resolved
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After playing around with the lineage command, I am skeptical about how much we are overloading this lid pseudo-filesystem. I thought it was just a nice add-on that we could experiment with, but now I think it's just getting in the way.
Currently there are three main uses for lid paths:
lid://<hash>[#props]: returns a metadata record or sub-path. This has no practical utility in a Nextflow script, not even for workflow outputs. Now that#outputsis a list, I can't access an output by name (e.g.#outputs.samples), which means I can't usechannel.fromPath()to access an LID output in the same way as a samplesheet. So the LID output is no longer a drop-in replacement for samplesheets.
On the command line, it would be simpler to just provide the hash and use jq:
# before
# oops, forgot to escape the #...
nextflow li describe lid://<hash>#params
# after
nextflow li describe <hash> | jq .paramsIn a web interface like the platform, you'll use a graphical interface to navigate this metadata, so the fragment syntax is not needed there.
lid:///?<name>=<value>&...: used by thefindcommand to retrieve a collection of metadata records. This also has no utility in a Nextflow script, because it is unrelated to domain-specific data like#outputs. It is only used by thefindcommand, so the URI syntax is just getting in the way:
# before
# oops, forgot to escape the & ...
nextflow li find lid:///?type=DataOutput&workflowRun=lid://2265a814fd1c205ecc5b629070d759e2
# after
nextflow li find type=DataOutput workflowRun=2265a814fd1c205ecc5b629070d759e2lid://<hash>/<path>: returns a content-addressed file. This is the original use case and the only one that still makes sense as far as I can tell. I think this works perfectly both on the command line and in the Nextflow script/runtime.
Based on this analysis, I think we should ditch (1) and (2) entirely and use lid:// only to refer to files.
Maybe we could use the fragment to refer to a specific output, e.g. lid://<hash>#samples. That would at least restore the original use case of passing a workflow output as input to a downstream pipeline.
The |
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
|
Based on previous comments, I have pushed some minor changes:
|
Signed-off-by: jorgee <[email protected]>
|
TODOs from our discussion This PR:
Separate PR(s);
|
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
|
Ok, I've moved the h2 stuff in the corresponding repo https://github.com/nextflow-io/nf-lineage-h2 |
This is already changed in the current PR |
|
Let's move ahead, then |
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
|
|
||
| private static String HISTORY_FILE_NAME = ".history" | ||
| private static final String METADATA_FILE = '.data.json' | ||
| private static final String METADATA_PATH = '.meta' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgee what is the rationale for the .meta subfolder? It looks like the only thing that is created in the lineage folder. Do you intend to store other things under lineage as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bentsherman, it was storing the output data and the metadata in the first implementation, but currently this sub-folder is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, not urgent but something to consider in the final cleanup before 25.04

Tentative implementation for addressable data store (very basic POC so far).
Update on 1 Mar 2025 from #5787 by @jorgee
M1 Implementation of CID store for provenance
Changes:
Known Limitations: