Identifying Code Churn With AskGit SQL
AskGit is a tool we’ve been building that makes it possible to run SQL queries against data in git repositories. Recently, we added support for a stats
table, which tracks lines of code added and removed to a file, for every commit (in the current history). Think git log --stat
, but as a table that can be queried with SQL.
We can use this table to find areas of “code churn” in a repository.
Nicolas Carlo of Understand Legacy Code has a great article that advises focusing on hotspots in a codebase in order to proactively address areas of technical debt. In particular, he looks at comparing code complexity with code churn to find which areas (files) are worth refactoring or re-examining.
He uses the following shell command to identify the top 50 “churning” files in the past year of a codebase:
git log --format=format: --name-only --since=12.month \
| egrep -v '^$' \
| sort \
| uniq -c \
| sort -nr \
| head -50
This is exactly the type of “query” that AskGit hopes to make expressible in SQL. Here’s what it might look like using AskGit:
Both commands assume that “code churn” means “the files modified by the most number of commits in a given time period (the past year).”
This definition is very practical for the use case in the article, but some questions come to mind when thinking about identifying code churn more generally:
- Why does commit count alone matter? Should lines of code added or removed per commit be incorporated? Maybe a commit should only be counted if it has a minimum number of changes.
- Why do I care about files that may have been churning 8 months ago but have since stabilized? Shouldn’t I prioritize more recent churning? Can I discount older commits in my final sorting?
- What about files that have been removed or renamed? The output above includes them in the final list, even though they’re no longer in the latest source tree, adding some noise.
- What if individual files is too granular a view? Maybe I care more about certain directory levels if that’s how my features are organized. Maybe I want to ignore certain directories and file extensions altogether (or only look at certain ones).
- Can I incorporate commit authors into my churn analysis? Maybe I only want to look at churn for certain authors or groups of authors (such as a team). Churn across a large monorepo may be valuable generally, but what if I only care about the touch points my team has had with the code?
No matter your definition of “code churn,” or what specifically you’re looking to get out of mining your git data, AskGit hopes to make it possible to ask questions of git history in a very flexible way. I think this is where the tool really shines, and I hope to show in a follow up article how the above SQL can be augmented to include some of the above considerations.
Until then, take it for a spin!