Skip to content

Conversation

mishmosh
Copy link
Contributor

@mishmosh mishmosh commented Apr 3, 2025

Currently, CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID.

This proposal introduces profiles for IPFS CIDs. Profiles explicitly define CID version, hash algorithm, chunk size, DAG width, layout, and other parameters. They can be used to verify data across implementations, provide recommended settings depending on retrieval performance goals, and more.

@mishmosh mishmosh requested a review from a team as a code owner April 3, 2025 14:03
@mishmosh mishmosh changed the title Create ipip-0000.md: CID profiles IPIP 0499: CID Profiles Apr 3, 2025
lidel added a commit to ipfs/kubo that referenced this pull request Apr 15, 2025
lets make the fanout match the max links from files
and rename profile to `-wide`

this will make it easier to discuss in ipfs/specs#499
lidel and others added 2 commits April 15, 2025 23:41
Import.* config params for controlling DAG width were added in:
ipfs/kubo#10774
@lidel
Copy link
Member

lidel commented Apr 15, 2025

Thank you for kicking this off, and filling initial state.

I've incorporated specific "dag width" settings for File, Directory and HAMTDirectory nodes,
and updated the table to reflect state from ipfs/kubo#10774
and profiles that exist in Kubo master branch: legacy-cid-v0, test-cid-v1 and test-cid-v1-wide:

Next:

  • agree what "cid-2025" profile should look like
    • this will be new default in "Kubo v1.0"
    • we have test-cid-v1 and test-cid-v1-wide in Kubo as potential candidates
  • switch to PR from local branch (so we have build preview)
  • figure out how to render the information (currently the table is not supported by https://github.com/ipfs/spec-generator)

@SethDocherty

This comment was marked as off-topic.

@2color
Copy link
Member

2color commented Aug 12, 2025

I pushed a bunch of edits to move the conversation forward. This is sorely needed in the ecosystem, and the hope is that by building consensus we can improve developer experience when working with UnixFS and the overall health of the UnixFS ecosystem.

Feedback is always appreciated.

1. UnixFS DAG layout (e.g. balanced, trickle)
1. UnixFS DAG width (max number of links per `File` node)
1. `HAMTDirectory` fanout (must be a power of 2)
1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this number is dynamic based on the lengths of the actual link entries in the dag, we will need to specify what algorithm that estimation follows. I would put such things in a special "ipfs legacy" profile to be honest, along with cidv0, non-raw leaves etc. We probably should heavily discourage coming up with profiles that do weird things, like dynamically setting params or not using raw-leaves for things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, each layout would have its own set of layout-params:

  • balanced:
    • max-links: N
  • trickle:
    • max-leaves-per-level: N

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should heavily discourage coming up with profiles that do weird things, like dynamically setting params or not using raw-leaves for things.

Yeah, that's exactly what we're doing by defining this profile.

Comment on lines 57 to 58
1. Whether empty directories are included in the DAG
- Some implementations apply filtering before merkleizing filesystem entries in the DAG.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird, because then we need to consider empty files, hidden files, unreadable files, symlinks and symlink follows, so probably need to mention all those as part of the profile too?

Copy link
Member

@2color 2color Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is motivated by Git's default behaviour which ignores empty directories.

But we can mention here the rest.


### Compatibility

UnixFS Data encoded with the profiles defined in this IPIP is fully compatible with existing implementations, as it is fully compliant with the UnixFS specification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot be compliant with details that are not specified as of today..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contingent on #331

1. UnixFS chunk size
1. UnixFS DAG layout (e.g. balanced, trickle)
1. UnixFS DAG width (max number of links per `File` node)
1. `HAMTDirectory` fanout (must be a power of 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can alternatively be called "bitwidth" and you just use the number of bits for this, it's what we're doing in all the other hamts we have. So the default bitwidth is 8 = 256 leaves, bitwidth of 5 would be 32, etc.

1. Leaf Envelope: either `dag-pb` or `raw`
1. Whether empty directories are included in the DAG
- Some implementations apply filtering before merkleizing filesystem entries in the DAG.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of other things to consider?

  • Directory wrapping at the top level (for just files, kubo has an option to wrap in a directory so you get file metadata)
  • Presence and accurate setting of Tsize - at one point we were going to deprecate this field for some cases, although I think all our encoders now do it properly, you could just mandate this in the spec though -- all valid profiles must properly encode Tsize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this as a parameter.

According to the latest version of https://github.com/ipfs/specs/pull/331/files, the calculation is done as follows:

To compute the Tsize of a child DAG, sum the length of the dag-pb outside message binary length and the blocksizes of all nodes in the child DAG.

If calculated according to this, does it make accurate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds about right, I remember there being some nuance in exactly what's included in the size calculation, making it not super stable if you get it slightly wrong (as we did for some variants in go-unixfsnode for a while)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants