1

New Research Report - Exploring the 2024 State of Software Quality

Group 370
2

Codacy Product Showcase October 8th - Sign Up to Learn About Platform Updates

Group 370
3

Join us at Manchester Tech Festival on October 30th

Group 370

Code review comments: should 20% be about style and best practices?

In this article:
Subscribe to our blog:

Code review comments on style and best practices make up at least 20% of software development time reviewing code [1]. However, is this the optimal proportion of time to spend towards this area of code review?

If you are not currently dedicating the time to code review in general, you should be. This is because code reviews have become an integral part of our modern development workflow [2],[3].

But code review can be ineffective.
Your effectiveness to spot defects (bugs) is reduced after reviewing 200 lines of code in 1 hour[4].
This previous paper also makes this obvious: if you’re reviewing 50 lines of code you will find many issues; if you’re reviewing 1M lines of code this is you:

Another way they can be ineffective is by being over concerned about style and best practices in your code review comments.

After many conversations with engineers, I have a slight feeling we are spending too much time focusing on style, format and best practices.

So this is our question:

What percentage of comments are about code style, formatting issues and best practices?

Process

To find interesting data points, I followed this process:

  1. Download a month’s worth of Github’s open source activity
  2. Extract all pull request comments
  3. Identify some patterns on a smaller sample
  4. Try to count those patterns in the whole data set

So in essence, we’re searching for patterns in Github’s activity.

Github kindly offers their data to do analysis on it through githubarchive.org.

I downloaded March activity and extracted all the pull request comments done on open source projects. You can find 273416 pull request comments here.

Then, I tried to find the most recurring phrases in a subset of smaller comments. I filtered the data set per size and compared every string against each other using Jaccard’s Index.

Data

Here are the emerging patterns from the selection:

emerging patterns from the selection

List of comments and the number of times they appeared similarly on the sampled data setThis is interesting because it shows our formatting issues on the top.

This is a very limitative view. The reason why is because I filtered the data set greatly to achieve results quickly. The complexity of the operations is O(N²) and I didn’t want to wait for 74B comparisons. So the data set was filtered to have only strings between 20 and 30 characters.
Smaller comments may also be more targeted towards styling and formatting issues.
So, while tempting, we shouldn’t take any conclusions from this table.

However, this table gave us an enough interesting picture to proceed.

Table targeting smaller comments

We see that there are certain words that these comments are using that evidence the nature of the comment.

By counting the words that can reflect the intention of referring to style, format or best practices, we can have a better insight into how many of these comments exist.

And so we selected expressions from the previous findings and counted the number of comments that contained them.

We see that the number of comments with matches amount to 20% of the total number of comments.

Limitations

There are limitations to this analysis.

Any word can appear more than once in a comment. Given the nature of the words analyzed, I think this could be a good enough approximation.

I stopped my word counts after getting the round number of 20 percentage but this could in fact be much higher. There are many best practices comments that are not represented by the keywords they contain.

Conclusion

I wanted to study Github pull request comments and how many of these comments are related to styling issues.

Finding 20% of these comments being related to styling and best practices is a good evidence that we’re concerned about the way our code looks.

It is my opinion we should move towards complete automation and reduce the time invested in enforcing these rules.


For more blogs on code review check out How Code Review Increases Developer Productivity and Best Practices.

Also, we just published an ebook: “The Ultimate Guide to Code Review” based on a survey of 680+ developers. Enjoy!


References

1: http://www.quora.com/How-much-per-day-or-week-do-engineers-spend-doing-code-review-at-companies-such-as-Google-Facebook-GitHub-Twitter-Foursquare-etc
2: http://blog.codinghorror.com/code-reviews-just-do-it/
3: http://blogs.atlassian.com/2014/03/every-team-needs-kick-ass-code-reviews/
4: http://www.pitt.edu/~ckemerer/PSP_Data.pdf


About Codacy

Codacy is used by thousands of developers to analyze billions of lines of code every day!

Getting started is easy – and free! Just use your  GitHub, Bitbucket or Google account to sign up.

GET STARTED

RELATED
BLOG POSTS

Code Review vs. Testing
Among coding best practices, code review vs. testing are often compared. Here’s what you need to know about each.
Secure Code Review Using Codacy
When you first sign up for Codacy we ask for numerous permissions, yet, want to ensure the most secure code review process. Depending on your ...
New project quality settings to improve Codacy code check
With the new ability to define project quality settings, we help you perform better code quality checks using Codacy automated code review tool. The...

Automate code
reviews on your commits and pull request

Group 13