Skip to content

Commit

Permalink
[SPARK-35510][PYTHON] Fix and reenable test_stats_on_non_numeric_colu…
Browse files Browse the repository at this point in the history
…mns_should_be_discarded_if_numeric_only_is_true

### What changes were proposed in this pull request?

This PR proposes to fix and reenable `test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` that was disabled when we upgrade Python 3.9 in CI at apache#32657.

Seems like this is because of the latest NumPy's behaviour change, see also `https://github.com/numpy/numpy/pull/16273#discussion_r641264085`.

pandas inherits this behaviour but it doesn't make sense when `numeric_only` is set to `True` in pandas. I will track and follow the status of the issue between pandas and NumPy.

For the time being, I propose to exclude boolean case alone in percentile/quartile test case

### Why are the changes needed?

To keep the test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

I roughly locally tested. But it should pass in CI.

Closes apache#32690 from HyukjinKwon/SPARK-35510.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
  • Loading branch information
HyukjinKwon committed May 28, 2021
1 parent 2de19e4 commit 7eb7448
Showing 1 changed file with 15 additions and 9 deletions.
24 changes: 15 additions & 9 deletions python/pyspark/pandas/tests/test_stats.py
Original file line number Diff line number Diff line change
Expand Up @@ -375,15 +375,21 @@ def test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_tru
self.assert_eq(len(psdf.kurtosis(numeric_only=True)), len(pdf.kurtosis(numeric_only=True)))
self.assert_eq(len(psdf.skew(numeric_only=True)), len(pdf.skew(numeric_only=True)))

# TODO(SPARK-35510): This fails with Python 3.9. We should fix and reenable it.
# self.assert_eq(
# len(psdf.quantile(q=0.5, numeric_only=True)),
# len(pdf.quantile(q=0.5, numeric_only=True)),
# )
# self.assert_eq(
# len(psdf.quantile(q=[0.25, 0.5, 0.75], numeric_only=True)),
# len(pdf.quantile(q=[0.25, 0.5, 0.75], numeric_only=True)),
# )
# Boolean was excluded because of a behavior change in NumPy
# https://github.com/numpy/numpy/pull/16273#discussion_r641264085 which pandas inherits
# but this behavior is inconsistent in pandas context.
# Boolean column in quantile tests are excluded for now.
# TODO(SPARK-35555): track and match the behavior of quantile to pandas'
pdf = pd.DataFrame({"i": [0, 1, 2], "s": ["x", "y", "z"]})
psdf = ps.from_pandas(pdf)
self.assert_eq(
len(psdf.quantile(q=0.5, numeric_only=True)),
len(pdf.quantile(q=0.5, numeric_only=True)),
)
self.assert_eq(
len(psdf.quantile(q=[0.25, 0.5, 0.75], numeric_only=True)),
len(pdf.quantile(q=[0.25, 0.5, 0.75], numeric_only=True)),
)

def test_numeric_only_unsupported(self):
pdf = pd.DataFrame({"i": [0, 1, 2], "b": [False, False, True], "s": ["x", "y", "z"]})
Expand Down

0 comments on commit 7eb7448

Please sign in to comment.