Adding a new column as a function of other columns can hit performance issues, and more seriously can be unreliable if implemented incorrectly.
For simple cases, a vector based filter can efficiently add the values of the new column, via loc():
However more complicated cases can require a function that checks each cell value.
Example:
def get_label(text):
if is_css(text):
return 'css'
if is_json(text):
return 'json'
return 'other'
Applying this function via 'apply' and then using that result to assign to a new column, triggers the SettingWithCopyWarning warning from pandas:
dfSourceLanguages['label'] = dfSourceLanguages.apply(lambda r: get_label(r['text']), axis=1)
The SettingWithCopyWarning of pandas catches this and other cases where the code is not efficient and might not be reliable:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy dfSourceLanguages['label'] = dfSourceLanguages.apply(lambda r: get_label(r['text']), axis=1)
This warning also occurs when using map():
dfSourceLanguages['label'] = dfSourceLanguages['text'].map(get_label)
---
Another approach: use df.assign()
# Assigns a new column 'label' via the lambda function
# - applies _get_label() to the series 'row.text'
df = df.assign(
label = lambda row: row.text.apply(_get_label)
)
Comments
Post a Comment