Skip to content

Conversation

@rok
Copy link
Owner

@rok rok commented Aug 18, 2025

This proposes adding type annotation to pyarrow by adopting pyarrow-stubs into pyarrow. To do so we copy a subset of pyarrow-subs's stubfiles into pyarrow tree. We then add annotation checks for some stubsfiles and some test files and make sure checks pass. Annotation checks should be expanded until all (or most) project files are covered in future work.

PR introduces:

  1. adds pyarrow-stubs into arrow/python/pyarrow/
  2. fixes pyarrow-stubs to pass mypy and pyright check
  3. adds mypy and pyright check to CI (crudely)
  4. adds a tool (update_stub_docstrings.py) to keep annotation docstrings in sync with source docstrings
  5. adds docstring sync check to CI (crudely)

@rok rok force-pushed the gradual_pyarrow_stubs branch 2 times, most recently from 3548797 to 674f0cd Compare August 19, 2025 20:25
@rok rok force-pushed the gradual_pyarrow_stubs branch 2 times, most recently from db51538 to 67b4698 Compare August 21, 2025 17:55
Repository owner deleted a comment from github-actions bot Aug 21, 2025
@rok rok force-pushed the gradual_pyarrow_stubs branch 3 times, most recently from 3e78327 to ebb90a2 Compare August 21, 2025 19:29
@rok rok changed the title Gradual pyarrow stubs [Python] Gradually add pyarrow-stubs to Arrow Aug 21, 2025
@rok rok changed the title [Python] Gradually add pyarrow-stubs to Arrow [Python] Gradually add type checks to Arrow, initial step Aug 21, 2025
Repository owner deleted a comment from github-actions bot Aug 21, 2025
Repository owner deleted a comment from github-actions bot Aug 21, 2025
@rok rok marked this pull request as ready for review August 21, 2025 20:21
@rok rok requested a review from wjones127 as a code owner August 21, 2025 20:21
memory_pool: MemoryPool | None = None,
) -> BooleanArray: ...
@overload
def array(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this approach scalable? I notice this doesn't seem to include some extension types such as tensor arrays?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking about annotating constructors per type? Overload approach is nice because it expresses the relationship between the argument and return types. We can use a single parameter type union and a single return union, but we then lose the relationship.

FixedShapeTensorArray is on line 4578, I'm not sure about other extension types. If we use type checkers to validate our tests we should get errors if our symbols are not annotated or if annotations disagree with tests. Our test coverage is pretty good and we can assume every symbol is tested at least once.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overloads are now removed and docstring sync is automated.
We should also add an automated alert for when symbols don't have annotations but should.

from .array import StringArray, StringViewArray

class StringBuilder(_Weakrefable):
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are docstrings duplicated in the stubs files? Do they need to be kept in sync? Why (and how)?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my reply here. I am not yet sure we want to include docstrings here, but if we do we should definitely script their extraction. stubgen --include-docstrings does this but I suspect it doesn't read cython-only symbols.

@pitrou
Copy link

pitrou commented Aug 25, 2025

Is the __lib_pxi/xxx.pyi naming scheme documented anywhere?

@rok
Copy link
Owner Author

rok commented Sep 2, 2025

Is the __lib_pxi/xxx.pyi naming scheme documented anywhere?

It was adopted from pyarrow-stubs. Not really documented and we might want to change it.

@rok
Copy link
Owner Author

rok commented Sep 12, 2025

@pitrou

Is the __lib_pxi/xxx.pyi naming scheme documented anywhere?

Would __lib be a good naming scheme? Something else perhaps?

@pitrou
Copy link

pitrou commented Sep 15, 2025

Is the __lib_pxi/xxx.pyi naming scheme documented anywhere?

Would __lib be a good naming scheme? Something else perhaps?

Since we already have lib.pyi, why not also io.pyi etc.?

OTOH, we could put everything in a subdirectory (_typestubs or something). Would it be compatible with how type checkers work?

@pitrou
Copy link

pitrou commented Sep 15, 2025

As for docstrings: do they need to be in the repo, or can they be generated when building a wheel file?

@rok
Copy link
Owner Author

rok commented Sep 15, 2025

Is the __lib_pxi/xxx.pyi naming scheme documented anywhere?

Would __lib be a good naming scheme? Something else perhaps?

Since we already have lib.pyi, why not also io.pyi etc.?

OTOH, we could put everything in a subdirectory (_typestubs or something). Would it be compatible with how type checkers work?

I'll give these a shot, starting with a new folder approach.

As for docstrings: do they need to be in the repo, or can they be generated when building a wheel file?

I think that's doable and would make for nicer diffs. Tradeof would be docstring generation at build time would make wheel building less robust. Docsting generation is not terribly complex so I'd say it's worth it.

@rok rok force-pushed the gradual_pyarrow_stubs branch from 1870692 to a0ce53c Compare September 20, 2025 19:40
@rok
Copy link
Owner Author

rok commented Sep 20, 2025

Closing this in favor of apache#47609

@rok rok closed this Sep 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants