Pandas is a wonderful library for manipulating tabular data with Python. Out of the box Pandas offers many ways of adding, removing, and updating columns and rows, but sometimes you need a bit more power.
In this article we’ll explore the
apply function and show how it can be used to run an operation against every row (or column) in your DataFrame - and why you might want to do that.
Why would you need Pandas apply?
In Pandas it’s fairly easy to add a new column to a DataFrame or update an existing one:
# Add a release_month column calculated from the existing release_date column df['release_month'] = pd.DatetimeIndex(df['release_date']).month
Just by using the indexer on the DataFrame we can add or update a column to have a new value for every row in the DataFrame.
However, in doing this we are limited to expressions that are simple enough to easily express on the right of the assignment operator.
apply function exists on Pandas DataFrames and lets us run custom functions for every row.
Applying Python Functions to DataFrame Rows using Apply
We can use the Pandas
apply function to apply a single function to every row (or column) in a DataFrame. This allows us to run complex calculations and use those calculations to set column values.
For example, let’s say we had a DataFrame with a
keyword_json column containing some JSON representing tags. We might want to parse this JSON and generate a comma separated value list of keywords. This list of keywords could then be set into a
First, we declare an
extract_keywords function that can be called for every row:
def extract_keywords(row): """ This function takes in a row, gets some JSON representing keywords out of its keyword_json column, and then builds a comma-separated list of values that gets set into a new keywords column. """ # Grab our JSON for the keywords we want to process data = row['keyword_json'] # additional JSON cleaning logic omitted for brevity # Start with an empty list of keywords keywords = '' # Loop over all loaded keywords and append them to the string loaded_keywords = json.loads(data) for item in loaded_keywords: keywords = keywords + item['name'] + ',' # Add the keywords column with the final calculated string # If keywords already existed, its value would be replaced row['keywords'] = keywords # Return the modified row return row
Next, we call
apply on our Pandas DataFrame to invoke that function once per row.
Important Note: By default
apply will operate on each column instead of each row, so we specify
axis=1 to work with rows instead.
df = df.apply(extract_keywords, axis=1)
This calls the function once per row and replaces the row with the returned value.
Like almost everything else in Pandas DataFrames, the
apply function does not modify the original DataFrame, but returns a new one instead.
apply function is fairly slow to invoke, but it has a lot of power to allow you to do complex operations on your dataset.
Additionally, storing complex logic in functions instead of trying to do everything inline can improve the readability of your code. Improving readability usually improves maintainability, so this can be a very good thing.
While I always try to avoid
apply if I can, the
apply function can solve a large number of problems for you as you perform feature engineering and data wrangling in Python code using Pandas.