Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add example using the comparison operator to assign a new boolean column #1

Open
skilfullycurled opened this issue Dec 26, 2017 · 3 comments

Comments

@skilfullycurled
Copy link

Hello,

First, thank you so much for these great tutorials. There are a number of warnings regarding the usage of "just the indexing operator" for quite a while and the explanation of .loc and .iloc were tremendously helpful.

I'm writing to recommend that you add an example of assigning a new column from a boolean selection that returns a boolean series in the article on assignment. Take for example, the following:

criteria = df[‘some_col’] > sum_number
criteria.head()

0     True
1    False
2     True
4     True
6    False

Using just the assignment operator...

df['new_col'] = df['some-col'] > some_number

...works but yields the warning:

Try using .loc[row_indexer,col_indexer] = value instead

The closest example I've found in your article is this one:

last_name = pd.Series(data=['Smith', 'Jones', 'Williams', 'Green', 'Brown', 'Simpson', 'Peters'],
                      index=['Tom', 'Niko', 'Penelope', 'Aria', 'Sofia', 'Dean', 'Zach'])
last_name
df['last_name'] = last_name

However, at least in Pandas 0.19.2, this will still yield the same error. After searching around a bit I found this stack overflow discussion which states that after Pandas 0.16.0, the best way to do this is to use the assign function in the following manner:

criteria = df[‘some_col’] > sum_number
df_three.assign(new_col_name = criteria) #note: no quotes on new_col_name

Which seems to work well for me.

Alternatively, I suppose you can simply add which version the tutorial was written under.

Thanks again for this wonderful guide!

@tdpetrou
Copy link
Owner

Hey @skilfullycurled thanks for the response. I really appreciate any comments that can make this tutorial better.

Assigning a new column directly from the same dataframe should not produce any warning. Let's take a concrete example.

>>> df = pd.DataFrame({'a': [1,2,3], 'b': [10, 4, 6]})
>>> df
   a   b
0  1  10
1  2   4
2  3   6

If we create a new boolean column directly from column a should not yield a warning.

>>> df['new_col'] = df['a'] > 2
>>> df
   a   b  new_col
0  1  10    False
1  2   4    False
2  3   6     True

I noticed that your example criteria has an index missing the label 5, so maybe this DataFrame was already a subset of another?

Can you provide all of the code that produces the warning? And yes, assign is nice and will be covered in one of the next tutorials.

@skilfullycurled
Copy link
Author

Ah. I think I see the problem now. After a day of still getting warnings under various circumstances, I'm finally getting to wrap my mind around it.

It's the issue of assigning to a view vs. a copy. Exactly as you wrote:

...When pandas selects a single column from a DataFrame, pandas creates a view and not a copy. A view just means that no new object has been created. df['score'] references the score column in the original DataFrame.

This is analogous to the list example where we assigned an entire list to a new variable. No new object is created, just a new reference to the one already in existence.

Since no new data has been created, the assignment will modify the original DataFrame.

Given my experience of finally coming to understand this difference, I'll reframe my suggestion: Depending on who your audience is, you might think about adding an example that throws this error at the beginning of the post before getting into the details. I think sometimes it's helpful to begin with a practical example before going into the details. Right now the order is sort of:

  1. What is chained indexing?
  2. Why is chained indexing bad?
  3. SettingWithCopy warning on assignment
  4. Remember the examples of lists and pandas?
  5. This is what causes the warning
  6. That's why it's bad

A different logic that might help contextualize chained indexing:

  1. Do you ever run into this problem?
  2. Example from pandas that throws SettingWithCopy warning?
  3. To understand why this happens we need to understand chained indexing
  4. What is chained indexing?
  5. Now lets return to the SettingWithCopy warning
  6. Remember the examples of lists and pandas?
  7. This is what causes the warning
  8. That's why it's bad

Again, this is a great tutorial. Understanding the selection of data in such a deep but accessible way really clarified the issue. As you noted at the end, the warning is really poorly written.

@skilfullycurled
Copy link
Author

Oh. What I quoted above brings up one more thing.

At the top you wrote:

Since no new data has been created, the assignment will modify the original DataFrame.

Then later in the example you wrote:

The assignment completed correctly for the intermediate DataFrame but not for our original.

When I was researching the issue, I read that in some circumstances the issue is that the change will not propagate backwards to the original, and it other circumstances it will not propagate forwards to the intermediate.

See firelynx's and Raphanns' response. Raphanns example is not about selecting subsets so perhaps it does not apply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants