Skip to content

Conversation

@DimitriPapadopoulos
Copy link
Contributor

Copy link
Member

@rcap107 rcap107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the PR.

Overall, the missing asserts are useful (really wondering why they're missing in the first place). Other changes do not look like they improve the code in a meaningful way (B007, B004, B010); we can stil implement the changes, however.

What I am really concerned with is actually enforcing the bugbear rules. Looking at the docs, a lot of them look fairly esoteric and definitely not something most people (especially new contributors) would be familiar with. Enforcing those rules would make contributing far more complicated than it already is.

As a result, those rules should not be added to the default linting. I'll add them to my own ruleset, but it should not be in the default pyproject.


__all__ = ["ApplyToCols", "SingleColumnTransformer", "RejectColumn"]

_SELECTORS = selectors.all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selctors.all() isn't selecting all columns, it's selecting the all() selector, which selects all columns.

Suggested change
_SELECTORS = selectors.all()
# By default, select all columns
_SELECT_ALL_COLUMNS = selectors.all()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same change in all the other files


__all__ = ["ApplyToCols", "SingleColumnTransformer", "RejectColumn"]

_SELECTORS = selectors.all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same change in all the other files

from ._subsampling import SubsamplePreviews, env_with_subsampling
from ._utils import KFOLD_5, NULL, attribute_error

_SELECTORS = selectors.all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_SELECTORS = selectors.all()
_SELECT_ALL_COLUMNS = selectors.all()

@rcap107
Copy link
Member

rcap107 commented Nov 13, 2025

As a more general comment for this and all the other PRs that you opened, formatting rules should be chosen by the maintainers of the project, rather than added in a dozen of separate PRs that need to be reviewed one by one.

In this PR, a large part of the diff is "whatever", the _SELECTORS change was made while misunderstanding the code, and while reviewing it I ended up misunderstanding the code myself, meaning that if it had been a more critical change it would have introduced a bug in the code.

The suggested rules are way too restrictive for new contributors, so that part of the PR is also not going to be merged, and should definitely have been discussed in an issue rather than added here.

And this is only one of 20 very similar PRs that have been opened, and that I have to spend time on reviewing, making sure that none of the small changes that have been made across a bunch of files can lead to issues of some kind.

In the future, please consider the time and effort that goes into reviewing all of this. I will still try to review the PRs that have already been opened, and track the suggested changes in an issue for discussion and possible implementation.

The proper way of contributing to the repo would be opening a meta-issue that collects a series of sub-issues, so that it is possible to decide whether specific changes are warranted. This would take less time than reviewing PRs, and would not risk introducing bugs in the code.

If it does not look like future PRs take this comment into consideration, I don't exclude closing any new PR like this without even looking at what's inside.

@rcap107 rcap107 marked this pull request as draft November 13, 2025 15:21
@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Nov 15, 2025

  • I would recommend keeping B004, it does result in more correct code, although this should seldom be an issue in practice:

    Using hasattr is an unreliable mechanism for testing if an object is callable. If obj implements a custom __getattr__, or if its __call__ is itself not callable, you may get misleading results.

  • B007 is indeed debatable.
  • B010 improves code consistency and readability, but I can understand you'd rather ignore it to avoid frustrating new contributors.

@DimitriPapadopoulos
Copy link
Contributor Author

Other changes do not look like they improve the code in a meaningful way (B007, B004, B010); we can stil implement the changes, however.

I have kept the changes in this PR, but these rules will be ignored in the future.

@DimitriPapadopoulos
Copy link
Contributor Author

Still need to fix this new error in a test:

FAILED skrub/tests/test_multi_agg_joiner.py::test_default_cols[pandas-nullable-dtypes] - AssertionError: assert [['rating', '...', 'movieId']] == [['rating', '...ng', 'genre']]
  
  At index 1 diff: ['rating', 'genre', 'movieId'] != ['movieId', 'rating', 'genre']

Columns should be sorted in the same order as in the initial main_table, shouldn't they?

@pytest.fixture
def main_table():
df = pd.DataFrame(
{
"userId": [1, 1, 1, 2, 2, 2],
"movieId": [1, 3, 6, 318, 6, 1704],
"rating": [4.0, 4.0, 4.0, 3.0, 2.0, 4.0],
"genre": ["drama", "drama", "comedy", "sf", "comedy", "sf"],
}
)
return df

@DimitriPapadopoulos
Copy link
Contributor Author

DimitriPapadopoulos commented Nov 25, 2025

Note that Ruff can automatically apply some rules, and pre-commit will enforce that without any additional work from contributors. Therefore, rules with automatic fixes are not a barrier, they result in consistent and readable code which can only help maintenance and new contributions.

Ruff provides automatic fixes for many B rules, including B004, B007, B010.

B004 Using `hasattr(x, "__call__")` to test if x is callable is unreliable.
     Use `callable(x)` for consistent results.
B005 Using `.strip()` with multi-character strings is misleading
B006 Do not use mutable data structures for argument defaults
B007 Loop control variable not used within loop body
B008 Do not perform function call `selectors.all` in argument defaults;
     instead, perform the call within the function, or read the default
     from a module-level singleton variable
B010 Do not call `setattr` with a constant attribute value.
     It is not any safer than normal property access.
B015 Pointless comparison.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants