Code review comments: should 20% be about style and best practices?

In this article:
Subscribe to our blog:

Code review comments on style and best practices make up at least 20% of software development time reviewing code [1]. However, is this the optimal proportion of time to spend towards this area of code review?

If you are not currently dedicating the time to code review in general, you should be. This is because code reviews have become an integral part of our modern development workflow [2],[3].

But code review can be ineffective.
Your effectiveness to spot defects (bugs) is reduced after reviewing 200 lines of code in 1 hour[4].
This previous paper also makes this obvious: if you’re reviewing 50 lines of code you will find many issues; if you’re reviewing 1M lines of code this is you:

Another way they can be ineffective is by being over concerned about style and best practices in your code review comments.

After many conversations with engineers, I have a slight feeling we are spending too much time focusing on style, format and best practices.

So this is our question:

What percentage of comments are about code style, formatting issues and best practices?

Process

To find interesting data points, I followed this process:

  1. Download a month’s worth of Github’s open source activity
  2. Extract all pull request comments
  3. Identify some patterns on a smaller sample
  4. Try to count those patterns in the whole data set

So in essence, we’re searching for patterns in Github’s activity.

Github kindly offers their data to do analysis on it through githubarchive.org.

I downloaded March activity and extracted all the pull request comments done on open source projects. You can find 273416 pull request comments here.

Then, I tried to find the most recurring phrases in a subset of smaller comments. I filtered the data set per size and compared every string against each other using Jaccard’s Index.

Data

Here are the emerging patterns from the selection:

emerging patterns from the selection

List of comments and the number of times they appeared similarly on the sampled data setThis is interesting because it shows our formatting issues on the top.

This is a very limitative view. The reason why is because I filtered the data set greatly to achieve results quickly. The complexity of the operations is O(N²) and I didn’t want to wait for 74B comparisons. So the data set was filtered to have only strings between 20 and 30 characters.
Smaller comments may also be more targeted towards styling and formatting issues.
So, while tempting, we shouldn’t take any conclusions from this table.

However, this table gave us an enough interesting picture to proceed.

Table targeting smaller comments

We see that there are certain words that these comments are using that evidence the nature of the comment.

By counting the words that can reflect the intention of referring to style, format or best practices, we can have a better insight into how many of these comments exist.

And so we selected expressions from the previous findings and counted the number of comments that contained them.

We see that the number of comments with matches amount to 20% of the total number of comments.

Limitations

There are limitations to this analysis.

Any word can appear more than once in a comment. Given the nature of the words analyzed, I think this could be a good enough approximation.

I stopped my word counts after getting the round number of 20 percentage but this could in fact be much higher. There are many best practices comments that are not represented by the keywords they contain.

Conclusion

I wanted to study Github pull request comments and how many of these comments are related to styling issues.

Finding 20% of these comments being related to styling and best practices is a good evidence that we’re concerned about the way our code looks.

It is my opinion we should move towards complete automation and reduce the time invested in enforcing these rules.


For more blogs on code review check out How Code Review Increases Developer Productivity and Best Practices.

Also, we just published an ebook: “The Ultimate Guide to Code Review” based on a survey of 680+ developers. Enjoy!


References

1: http://www.quora.com/How-much-per-day-or-week-do-engineers-spend-doing-code-review-at-companies-such-as-Google-Facebook-GitHub-Twitter-Foursquare-etc
2: http://blog.codinghorror.com/code-reviews-just-do-it/
3: http://blogs.atlassian.com/2014/03/every-team-needs-kick-ass-code-reviews/
4: http://www.pitt.edu/~ckemerer/PSP_Data.pdf


About Codacy

Codacy is used by thousands of developers to analyze billions of lines of code every day!

Getting started is easy – and free! Just use your  GitHub, Bitbucket or Google account to sign up.

GET STARTED

RELATED
BLOG POSTS

Code Reviews: Best Practices
Because code reviews are a great tool to achieve higher quality code in a software development project, we will provide an overview and discuss best...
How To Create The Perfect Code Review Checklist
Nobody’s perfect — not even the world’s most experienced programmer. Everyone who writes code makes mistakes, and it’s important to catch them before...
Top 10 ways to perform fast code review
We always want to be fast at code review.. How frequent is it for you to be reviewing code at 3am?When code reviewing, do you find yourself thinking:...

Automate code
reviews on your commits and pull request

Group 13