Using the Pandas apply function to add columns to DataFrames
Running Python functions on every row in a DataFrame
Pandas is a wonderful library for manipulating tabular data with Python. Out of the box Pandas offers many ways of adding, removing, and updating columns and rows, but sometimes you need a bit more power.
In this article we’ll explore the apply
function and show how it can be used to run an operation against every row (or column) in your DataFrame - and why you might want to do that.
Why would you need Pandas apply?
In Pandas it’s fairly easy to add a new column to a DataFrame or update an existing one:
# Add a release_month column calculated from the existing release_date column
df['release_month'] = pd.DatetimeIndex(df['release_date']).month
Just by using the indexer on the DataFrame we can add or update a column to have a new value for every row in the DataFrame.
However, in doing this we are limited to expressions that are simple enough to easily express on the right of the assignment operator.
Thankfully, the apply
function exists on Pandas DataFrames and lets us run custom functions for every row.
Applying Python Functions to DataFrame Rows using Apply
We can use the Pandas apply
function to apply a single function to every row (or column) in a DataFrame. This allows us to run complex calculations and use those calculations to set column values.
For example, let’s say we had a DataFrame with a keyword_json
column containing some JSON representing tags. We might want to parse this JSON and generate a comma separated value list of keywords. This list of keywords could then be set into a keywords
column.
First, we declare an extract_keywords
function that can be called for every row:
def extract_keywords(row):
"""
This function takes in a row, gets some JSON representing keywords out of
its keyword_json column, and then builds a comma-separated list of values
that gets set into a new keywords column.
"""
# Grab our JSON for the keywords we want to process
data = row['keyword_json']
# additional JSON cleaning logic omitted for brevity
# Start with an empty list of keywords
keywords = ''
# Loop over all loaded keywords and append them to the string
loaded_keywords = json.loads(data)
for item in loaded_keywords:
keywords = keywords + item['name'] + ','
# Add the keywords column with the final calculated string
# If keywords already existed, its value would be replaced
row['keywords'] = keywords
# Return the modified row
return row
Next, we call apply
on our Pandas DataFrame to invoke that function once per row.
Important Note: By default apply
will operate on each column instead of each row, so we specify axis=1
to work with rows instead.
df = df.apply(extract_keywords, axis=1)
This calls the function once per row and replaces the row with the returned value.
Like almost everything else in Pandas DataFrames, the apply
function does not modify the original DataFrame, but returns a new one instead.
Closing Thoughts
The apply
function is fairly slow to invoke, but it has a lot of power to allow you to do complex operations on your dataset.
Additionally, storing complex logic in functions instead of trying to do everything inline can improve the readability of your code. Improving readability usually improves maintainability, so this can be a very good thing.
While I always try to avoid apply
if I can, the apply
function can solve a large number of problems for you as you perform feature engineering and data wrangling in Python code using Pandas.