Code review comments on style and best practices make up at least 20% of software development time reviewing code . However, is this the optimal proportion of time to spend towards this area of code review?
But code review can be ineffective.
Your effectiveness to spot defects (bugs) is reduced after reviewing 200 lines of code in 1 hour.
This previous paper also makes this obvious: if you’re reviewing 50 lines of code you will find many issues; if you’re reviewing 1M lines of code this is you:
Another way they can be ineffective is by being over concerned about style and best practices in your code review comments.
After many conversations with engineers, I have a slight feeling we are spending too much time focusing on style, format and best practices.
So this is our question:
What percentage of comments are about code style, formatting issues and best practices?
To find interesting data points, I followed this process:
- Download a month’s worth of Github’s open source activity
- Extract all pull request comments
- Identify some patterns on a smaller sample
- Try to count those patterns in the whole data set
So in essence, we’re searching for patterns in Github’s activity.
Github kindly offers their data to do analysis on it through githubarchive.org.
I downloaded March activity and extracted all the pull request comments done on open source projects. You can find 273416 pull request comments here.
Then, I tried to find the most recurring phrases in a subset of smaller comments. I filtered the data set per size and compared every string against each other using Jaccard’s Index.
Here are the emerging patterns from the selection:
List of comments and the number of times they appeared similarly on the sampled data setThis is interesting because it shows our formatting issues on the top.
This is a very limitative view. The reason why is because I filtered the data set greatly to achieve results quickly. The complexity of the operations is O(N²) and I didn’t want to wait for 74B comparisons. So the data set was filtered to have only strings between 20 and 30 characters.
Smaller comments may also be more targeted towards styling and formatting issues.
So, while tempting, we shouldn’t take any conclusions from this table.
However, this table gave us an enough interesting picture to proceed.
We see that there are certain words that these comments are using that evidence the nature of the comment.
By counting the words that can reflect the intention of referring to style, format or best practices, we can have a better insight into how many of these comments exist.
And so we selected expressions from the previous findings and counted the number of comments that contained them.
We see that the number of comments with matches amount to 20% of the total number of comments.
There are limitations to this analysis.
Any word can appear more than once in a comment. Given the nature of the words analyzed, I think this could be a good enough approximation.
I stopped my word counts after getting the round number of 20 percentage but this could in fact be much higher. There are many best practices comments that are not represented by the keywords they contain.
I wanted to study Github pull request comments and how many of these comments are related to styling issues.
Finding 20% of these comments being related to styling and best practices is a good evidence that we’re concerned about the way our code looks.
It is my opinion we should move towards complete automation and reduce the time invested in enforcing these rules.
Also, we just published an ebook: “The Ultimate Guide to Code Review” based on a survey of 680+ developers. Enjoy!
Codacy is used by thousands of developers to analyze billions of lines of code every day!
Getting started is easy – and free! Just use your GitHub, Bitbucket or Google account to sign up.