Skip to content

GH-46546: [CI][Dev][Python] Use pre-commit for numpydoc #46595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kou
Copy link
Member

@kou kou commented May 27, 2025

Rationale for this change

We want to migrate to pre-commit from archery lint.

What changes are included in this PR?

Use pre-commit for numpydoc but this doesn't support Cython files.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@kou kou requested review from raulcd, AlenkaF and rok as code owners May 27, 2025 03:06
Copy link

⚠️ GitHub issue #46546 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member Author

kou commented May 27, 2025

archery lint has our own Cython integration code:

class NumpyDoc:
IGNORE_VALIDATION_ERRORS_FOR_TYPE = {
# Enum function signatures should never be documented
EnumMeta: ["PR01"]
}
def __init__(self, symbols=None):
if not have_numpydoc:
raise RuntimeError(
'Numpydoc is not available, install with command: '
'pip install numpydoc==1.1.0'
)
self.symbols = set(symbols or {'pyarrow'})
def traverse(self, fn, obj, from_package):
"""Apply a function on publicly exposed API components.
Recursively iterates over the members of the passed object. It omits
any '_' prefixed and thirdparty (non pyarrow) symbols.
Parameters
----------
fn : callable
A function to apply on all traversed objects.
obj : Any
The object to start from.
from_package : string
Predicate to only consider objects from this package.
"""
todo = [obj]
seen = set()
while todo:
obj = todo.pop()
if obj in seen:
continue
else:
seen.add(obj)
fn(obj)
for name in dir(obj):
if name.startswith('_'):
continue
member = getattr(obj, name)
module = _get_module(member)
if module is None or not module.startswith(from_package):
continue
# Is it a Cython-generated method? If so, try to detect
# whether it only has a implicitly-generated docstring,
# and no user-defined docstring following it.
# The generated docstring would lack description of method
# parameters and therefore fail Numpydoc validation.
if hasattr(member, '__objclass__'):
doc = getattr(member, '__doc__', None)
# The Cython-generated docstring would be a one-liner,
# such as "ReadOptions.equals(self, ReadOptions other)".
if (doc and '\n' not in doc and f'.{name}(' in doc):
continue
todo.append(member)
@contextmanager
def _apply_patches(self):
"""
Patch Docstring class to bypass loading already loaded python objects.
"""
orig_load_obj = Docstring._load_obj
orig_signature = inspect.signature
@staticmethod
def _load_obj(obj):
# By default it expects a qualname and import the object, but we
# have already loaded object after the API traversal.
if isinstance(obj, str):
return orig_load_obj(obj)
else:
return obj
def signature(obj):
# inspect.signature tries to parse __text_signature__ if other
# properties like __signature__ doesn't exists, but cython
# doesn't set that property despite that embedsignature cython
# directive is set. The only way to inspect a cython compiled
# callable's signature to parse it from __doc__ while
# embedsignature directive is set during the build phase.
# So path inspect.signature function to attempt to parse the first
# line of callable.__doc__ as a signature.
try:
return orig_signature(obj)
except Exception as orig_error:
try:
return inspect_signature(obj)
except Exception:
raise orig_error
try:
Docstring._load_obj = _load_obj
inspect.signature = signature
yield
finally:
Docstring._load_obj = orig_load_obj
inspect.signature = orig_signature
def validate(self, from_package='', allow_rules=None,
disallow_rules=None):
results = []
def callback(obj):
try:
result = validate(obj)
except OSError as e:
symbol = f"{_get_module(obj, default='')}.{obj.__name__}"
logger.warning(f"Unable to validate `{symbol}` due to `{e}`")
return
errors = []
for errcode, errmsg in result.get('errors', []):
if allow_rules and errcode not in allow_rules:
continue
if disallow_rules and errcode in disallow_rules:
continue
if any(isinstance(obj, obj_type) and errcode in errcode_list
for obj_type, errcode_list
in NumpyDoc.IGNORE_VALIDATION_ERRORS_FOR_TYPE.items()):
continue
errors.append((errcode, errmsg))
if len(errors):
result['errors'] = errors
results.append((obj, result))
with self._apply_patches():
for symbol in self.symbols:
try:
obj = Docstring._load_obj(symbol)
except (ImportError, AttributeError):
print(f'{symbol} is not available for import')
else:
self.traverse(callback, obj, from_package=from_package)
return results

But we can't use it with numpydoc lint.

Do we need the Cython integration?

^python/pyarrow/
exclude: >-
(
?^python/pyarrow/interchange/from_dataframe\.py|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFYI: other files contains $ character like ?^python/pyarrow/interchange/from_dataframe\.py$

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
I've added $.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 27, 2025
@kou
Copy link
Member Author

kou commented May 28, 2025

@pitrou Is Cython docstring validation by numpydoc important for us?

It was implemented by #12698 only for us. If it's important, can we implement it in upstream instead of implementing it in Archery?

@pitrou
Copy link
Member

pitrou commented May 28, 2025

@pitrou Is Cython docstring validation by numpydoc important for us?

We want to validate docstrings regardless of how the functions are generated, yes.

It was implemented by #12698 only for us. If it's important, can we implement it in upstream instead of implementing it in Archery?

We can probably do that. However, we may have to wait for cython/cython#6904 to be released first.

cc @raulcd

@kou
Copy link
Member Author

kou commented May 30, 2025

OK. I've opened #46641 for it.

I'll merge this if nobody objects it. Because we can remove Archery based lint from our lint CI job by merging this. It will solve #46528.

@pitrou
Copy link
Member

pitrou commented May 30, 2025

The problem is we don't know if numpydoc will accept a contribution for it.

We can still use Archery-based lint for numpydoc. I don't think there's any urgency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants