Many of Kubernetes’ 2k+ TODO Comments seem to be Forgotten
by Patrick DeVivo
Kubernetes is a big project. Not only because it’s a big deal, but also in terms of its source code. At the time of writing, there are 86k+ commits, 2k+ contributors, 2k+ open issues, 1k+ open PRs, and 61k+ stars. This is accessible from the project’s Github page.
scc counts 4.3M+ lines of go source code (5.2M+ total lines), 3M+ lines of “actual” vs. 700k+ lines of comments. 16k+ files in total. This includes the
We decided to point our little TODO finder at the Kubernetes source code to see what would turn up. Here are some of the results.
tickgit against source code from commit 9bf52c2. The CSV output was then imported into SQLite to run queries against. Note that the tool only finds TODOs in the tree of the checked-out commit; it will not account for TODOs that were added and subsequently removed. Therefore, the numbers reflect only the TODOs still “live” in the code, at the commit.
Totals (for 9bf52c2)
- 2,380 TODOs across 1,230 files from 363 distinct authors
- 460 TODOs with an assignee e.g.
// TODO (patrickdevivo) Fix the ...
- 489 TODOs were added in 2019 so far
- 860 days (or 2.3 years) is the average age of a TODO
- The oldest TODO is from Jun 6, 2014 (from “First commit”)
- The most recent TODO is from Dec 9, 2019
- This file has the most TODOs at 33
- deads2k has added the most (current) TODOs (git blame) at 147
- This commit added the most TODOs (that are still in the source) at 64
Conclusions and Questions
These results are from a fairly off-the-cuff look at what TODO comments in the Kubernetes source code look like. We get a sense of the top TODO creators, which tracks more or less with the top contributors to the project.
We also see that for “large” source code, developer behavior around TODO comments doesn’t seem to be out of the norm, there’s just more of it.
An important observation is that there are more TODO comments than there are Github issues. This is interesting, in that it indicates a significant amount of latent “work”…or to-do items, which are not easily accessible unless you spend time in the source code itself.
Core contributors likely have a good idea of their area of the codebase and strong intuitions about their own TODOs and “latent work.” This is fairly opaque to outside observers, though. Github issues (or other public ticket trackers) are more easily accessible to those not “in the weeds” of the project.
As most developers understand, software projects “live and breathe.” There’s frequent change, continuous improvement, constant imperfection and lots of discussions. Workflow and process are very important because good code requires continual reflection. We see a part of this in action through the use of TODO comments in the Kubernetes source. Without a benchmark, though, an average TODO age of 2.3 years does seem quite high. Those closer to the code will be much better able to pass judgment; perhaps it would be interesting to see how this source code compares to that of other big open source projects.
A more in-depth analysis of a codebase’s TODOs might involve a look at all of the TODOs in the history, not just the ones currently in the source code.
- What’s the rate at which TODOs are closed over time?
- What’s the average lifetime of a TODO comment?
- How do popular codebases compare to one another?
Does it Matter?
TODO comments typically cover the type of work that might be too small for a ticket, but important enough to note and describe in a code comment (though plenty of TODOs will reference issues/tickets). Since they are part of the code, they are often “closer” to the work that needs to get done. They are easy to add, but, it seems, just as easy to lose (there are 1.8k+ TODOs added prior to 2019 still in the Kubernetes’ source).
We hope that by creating a tool that surfaces metadata about code, we can make it easier for software developers to get work done, in projects of any size. Surfacing TODOs is just one piece of that.