Using the Pandas apply function to add columns to DataFrames

Using the Pandas apply function to add columns to DataFrames

Running Python functions on every row in a DataFrame

Pandas is a wonderful library for manipulating tabular data with Python. Out of the box Pandas offers many ways of adding, removing, and updating columns and rows, but sometimes you need a bit more power.

In this article we’ll explore the apply function and show how it can be used to run an operation against every row (or column) in your DataFrame - and why you might want to do that.

Why would you need Pandas apply?

In Pandas it’s fairly easy to add a new column to a DataFrame or update an existing one:

# Add a release_month column calculated from the existing release_date column
df['release_month'] = pd.DatetimeIndex(df['release_date']).month

Just by using the indexer on the DataFrame we can add or update a column to have a new value for every row in the DataFrame.

However, in doing this we are limited to expressions that are simple enough to easily express on the right of the assignment operator.

Thankfully, the apply function exists on Pandas DataFrames and lets us run custom functions for every row.

Applying Python Functions to DataFrame Rows using Apply

We can use the Pandas apply function to apply a single function to every row (or column) in a DataFrame. This allows us to run complex calculations and use those calculations to set column values.

For example, let’s say we had a DataFrame with a keyword_json column containing some JSON representing tags. We might want to parse this JSON and generate a comma separated value list of keywords. This list of keywords could then be set into a keywords column.

First, we declare an extract_keywords function that can be called for every row:

def extract_keywords(row):
    """
    This function takes in a row, gets some JSON representing keywords out of 
    its keyword_json column, and then builds a comma-separated list of values
    that gets set into a new keywords column.
    """

    # Grab our JSON for the keywords we want to process
    data = row['keyword_json']
    # additional JSON cleaning logic omitted for brevity

    # Start with an empty list of keywords
    keywords = ''

    # Loop over all loaded keywords and append them to the string
    loaded_keywords = json.loads(data)
    for item in loaded_keywords:
        keywords = keywords + item['name'] + ','

    # Add the keywords column with the final calculated string
    # If keywords already existed, its value would be replaced
    row['keywords'] = keywords

    # Return the modified row
    return row

Next, we call apply on our Pandas DataFrame to invoke that function once per row.

Important Note: By default apply will operate on each column instead of each row, so we specify axis=1 to work with rows instead.

df = df.apply(extract_keywords, axis=1)

This calls the function once per row and replaces the row with the returned value.

Like almost everything else in Pandas DataFrames, the apply function does not modify the original DataFrame, but returns a new one instead.

Closing Thoughts

The apply function is fairly slow to invoke, but it has a lot of power to allow you to do complex operations on your dataset.

Additionally, storing complex logic in functions instead of trying to do everything inline can improve the readability of your code. Improving readability usually improves maintainability, so this can be a very good thing.

While I always try to avoid apply if I can, the apply function can solve a large number of problems for you as you perform feature engineering and data wrangling in Python code using Pandas.