Extracting git repository data with PyDriller

Extracting git repository data with PyDriller

Generating CSV datasets from git repositories for data analysis

In early 2022 I debuted a talk called “Visualizing Code” that used data visualization to explore patterns in open source projects. This article is the first in a new series that walks you through how to discover and analyze patterns found in your own repositories.

Specifically, this article will cover how to take any public repository on GitHub and extract a CSV file full of commit history information. Once you have a data file available, the process for analyzing this dataset is fairly flexible and can be done in a variety of ways including Python code, Tableau, Power BI, or even Excel.

The code presented in this article assumes that you have pydriller and pandas installed. PyDriller is required for the content of this article while Pandas simply helps visualize the loaded data and export it to a CSV file. See installing PyDriller for more information on getting started, but typically pip install pydriller is all you need.

What is PyDriller?

PyDriller is an open-source Python library that allows you to “drill into” git repositories.

According to its GitHub repository, “PyDriller is a Python framework that helps developers in analyzing Git repositories. With PyDriller you can easily extract information about commits, developers, modified files, diffs, and source code.".

Using PyDriller we will be able to extract information from any public GitHub repository including:

  • Individual commits
  • Commit authors
  • Commit dates, times, and time zones
  • Files modified by each commit
  • The number of lines added and removed
  • Related commits
  • Code complexity metrics

Let’s take a look at how it works

Connecting to the Repository

In order to grab information from a repository, we must first create a Repository object from a given GitHub URL.

The code for this is fairly simple:

# We need PyDriller to pull git repository information
from pydriller import Repository

# Replace this path with your own repository of interest
path = 'https://github.com/dotnet/machinelearning'
repo = Repository(path)

This code doesn’t actually analyze the repository, but it gets us to a state where we can traverse the commits that are part of the git repository.

We actually inspect these commits by calling traverse_commits() on our Repository object and looping over the results.

Important Note: looping over repository commits takes a long time for large repositories. It took 52 minutes to analyze the ML.NET repository this code example refers to, which had 2,681 commits at the time of analysis on February 25th, 2023.

The code below will loop over all commits and for each commit:

  • Build a list of files that are modified by that commit
  • Extract basic commit information
  • Calculate code metrics using PyDriller’s Open Source Delta Maintainability Model (OS-DMM)

As each commit is read, it is added to a list of commits that serves as the final byproduct of the loading process.

The code listing follows:

# Loop over each PyDriller commit to transform it to a commit usable for analysis later
# NOTE: This can take a LONG time if there are many commits

commits = []
for commit in repo.traverse_commits():

    hash = commit.hash

    # Gather a list of files modified in the commit
    files = []
    try:
        for f in commit.modified_files:
            if f.new_path is not None:
                files.append(f.new_path) 
    except Exception:
        print('Could not read files for commit ' + hash)
        continue

    # Capture information about the commit in object format so I can reference it later
    record = {
        'hash': hash,
        'message': commit.msg,
        'author_name': commit.author.name,
        'author_email': commit.author.email,
        'author_date': commit.author_date,
        'author_tz': commit.author_timezone,
        'committer_name': commit.committer.name,
        'committer_email': commit.committer.email,
        'committer_date': commit.committer_date,
        'committer_tz': commit.committer_timezone,
        'in_main': commit.in_main_branch,
        'is_merge': commit.merge,
        'num_deletes': commit.deletions,
        'num_inserts': commit.insertions,
        'net_lines': commit.insertions - commit.deletions,
        'num_files': commit.files,
        'branches': ', '.join(commit.branches), # Comma separated list of branches the commit is found in
        'files': ', '.join(files), # Comma separated list of files the commit modifies
        'parents': ', '.join(commit.parents), # Comma separated list of parents
        # PyDriller Open Source Delta Maintainability Model (OS-DMM) stat. See https://pydriller.readthedocs.io/en/latest/deltamaintainability.html for metric definitions
        'dmm_unit_size': commit.dmm_unit_size,
        'dmm_unit_complexity': commit.dmm_unit_complexity,
        'dmm_unit_interfacing': commit.dmm_unit_interfacing,
    }
    # Omitted: modified_files (list), project_path, project_name
    commits.append(record)

You’ll note that the code above is in a try / except. This is because GitHub responded unexpectedly to some requests PyDriller made for commit details. Knowing this to be a valid repository, I felt the best strategy was to log the error occurred with the commit hash and exclude those commits from the final result set.

Validating the Load Process

Once the data is loaded (which could take some time), it’s time to ensure it appears valid.

I chose to do this by using the popular Pandas library for tabular data analysis tasks.

While Pandas is typically used to analyze, sift, clean, and otherwise manipulate tabular data sources, our use in this phase of the project is fairly basic: load data into a tabular DataFrame, display a small preview of it, and then save it to disk.

The code to load and preview the dataset is as follows:

import pandas as pd

# Translate this list of commits to a Pandas data frame
df_commits = pd.DataFrame(commits)

# Display the first 5 rows of the DataFrame
df_commits.head()

The final line’s df_commits.head() call will display something like the following result if run in a Jupyter Notebook:

Pandas DataFrame

Important Note: displaying the first 5 rows of the DataFrame via the .head() call will only work if this code is executed as part of a Jupyter notebook and that line is the last line in a code cell. This step is optional, however, as its only purpose is to allow you a peek at the output.

Exporting the Data to a CSV File

Finally, you can save the contents of the data frame in a CSV file with the following code:

df_commits.to_csv('Commits.csv')

This will save the file to disk in the current directory under the name Commits.csv.

Once the file has been written to disk, you can import it into Excel, Tableau, Power BI, or another data analysis tool.

Alternatively, you could load it up again with Pandas and visualize it with Python code as we’ll explore in a future article.

Breaking Down by Individual File

If you wanted to have a separate breakdown of git commits by file, you could do so with a slight modification of the earlier code:

commits = []

for commit in repo.traverse_commits():
    hash = commit.hash
    try:
        for f in commit.modified_files:
            record = {
                'hash': hash,
                'message': commit.msg,
                'author_name': commit.author.name,
                'author_email': commit.author.email,
                'author_date': commit.author_date,
                'author_tz': commit.author_timezone,
                'committer_name': commit.committer.name,
                'committer_email': commit.committer.email,
                'committer_date': commit.committer_date,
                'committer_tz': commit.committer_timezone,
                'in_main': commit.in_main_branch,
                'is_merge': commit.merge,
                'num_deletes': commit.deletions,
                'num_inserts': commit.insertions,
                'net_lines': commit.insertions - commit.deletions,
                'num_files': commit.files,
                'branches': ', '.join(commit.branches),
                'filename': f.filename,
                'old_path': f.old_path,
                'new_path': f.new_path,
                'project_name': commit.project_name,
                'project_path': commit.project_path, 
                'parents': ', '.join(commit.parents),
            }
            # Omitted: modified_files (list), project_path, project_name
            commits.append(record)
    except Exception:
        print('Problem reading commit ' + hash)
        continue        

# Save it to FileCommits.csv
df_file_commits = pd.DataFrame(commits)
df_file_commits.to_csv('FileCommits.csv')

This is remarkably similar to the earlier steps but allows you to perform slightly different data analysis tasks because the filename is a single filename instead of a comma separated list of files.

It is also worth noting that the code above is even slower than the earlier implementation.

Limitations and Next Steps

The code I’ve provided here is useful for generating a CSV file of commits from a public repository on GitHub.

This will be helpful if you want to visualize trends in your commit history by doing further manual data analysis or plugging the data into a data visualization tool.

However, this process does have a few limitations:

First, it does not work on non-public repositories on GitHub or on git repositories not on GitHub.

Secondly, I observed that 10 of the 2,681 commits I attempted to interpret had some form of a persistent error retrieving their data from GitHub via PyDriller. I’ve only seen this issue on one repository, but you may encounter it in your own repositories.

Third, this process takes a significant amount of time to process repository history. In my experiments I saw performance of processing about 1 commit every 800 milliseconds, which means that most repositories will take a non-trivial amount of time to process.

Finally, this process only tracks high-level information about git changes and does not include details on the individual lines or code changes that they contained, though files modified should be included.


All told, PyDriller is an excellent utility to easily get git repository data prepared and ready for further analysis.

Stay tuned for future articles showing how to work with the data this captures from your git repository.