Identifying Code Churn With AskGit SQL

by Patrick DeVivo

AskGit is a tool we’ve been building that makes it possible to run SQL queries against data in git repositories. Recently, we added support for a stats table, which tracks lines of code added and removed to a file, for every commit (in the current history). Think git log --stat, but as a table that can be queried with SQL.

AskGit commit stats table

We can use this table to find areas of “code churn” in a repository.

Nicolas Carlo of Understand Legacy Code has a great article that advises focusing on hotspots in a codebase in order to proactively address areas of technical debt. In particular, he looks at comparing code complexity with code churn to find which areas (files) are worth refactoring or re-examining.

He uses the following shell command to identify the top 50 “churning” files in the past year of a codebase:

git log --format=format: --name-only --since=12.month \
| egrep -v '^$' \
| sort \
| uniq -c \
| sort -nr \
| head -50

This is exactly the type of “query” that AskGit hopes to make expressible in SQL. Here’s what it might look like using AskGit:

Both commands assume that “code churn” means “the files modified by the most number of commits in a given time period (the past year).”

This definition is very practical for the use case in the article, but some questions come to mind when thinking about identifying code churn more generally:

  1. Why does commit count alone matter? Should lines of code added or removed per commit be incorporated? Maybe a commit should only be counted if it has a minimum number of changes.

No matter your definition of “code churn,” or what specifically you’re looking to get out of mining your git data, AskGit hopes to make it possible to ask questions of git history in a very flexible way. I think this is where the tool really shines, and I hope to show in a follow up article how the above SQL can be augmented to include some of the above considerations.

Until then, take it for a spin!

Building tools for software and data engineers