Avoiding SettingsWithCopyWarning from pandas when adding a column as function of other columns

Adding a new column as a function of other columns can hit performance issues, and more seriously can be unreliable if implemented incorrectly.

For simple cases, a vector based filter can efficiently add the values of the new column, via loc():

df.loc[df['eri_white']==1,'race_label'] = 'White'

However more complicated cases can require a function that checks each cell value.

Example:

def get_label(text):
  if is_css(text):
    return 'css'
  if is_json(text):
    return 'json'
  return 'other'

Applying this function via 'apply' and then using that result to assign to a new column, triggers the SettingWithCopyWarning warning from pandas:

dfSourceLanguages['label'] = dfSourceLanguages.apply(lambda r: get_label(r['text']), axis=1)

The SettingWithCopyWarning of pandas catches this and other cases where the code is not efficient and might not be reliable:

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy dfSourceLanguages['label'] = dfSourceLanguages.apply(lambda r: get_label(r['text']), axis=1)


This warning also occurs when using map():

dfSourceLanguages['label'] = dfSourceLanguages['text'].map(get_label)

---

Another approach: use df.assign()


# Assigns a new column 'label' via the lambda function
# - applies _get_label() to the series 'row.text'
df = df.assign(
    label = lambda row: row.text.apply(_get_label)
)

Comments