Home All Posts Regular Expressions: Best Practices

Regular Expressions: Best Practices

Codacy Platform

30/03/2016

In this article:

Subscribe to our blog:

Codacy

6 mins read

Pretty much every main programming language supports Regular Expressions, and many static analysis tools have patterns that relate to regular expressions.

So before you look, tell us: would you expect these patterns to be the same from language to language? Totally different?

Here’s what I found:

Ruby

Starting with Ruby, we can easily find two patterns from two different tools:

Ambiguous Regex Literal (Rubocop)
Prohibits unsafe regexes (Brakeman)

The first one refers to a possible ambiguity where the usage of parentheses removes said ambiguity.

The second one is a ReDOS vulnerability derived from using user-controlled input in a regular expression.

So far, so good; they both make sense.

(you can find more about Ruby Static Analysis tools in this post)

JavaScript

In a completely different environment, we can also find regular expression rules in ESLint:

Prohibit Control Characters in Regular Expressions
Prohibit malformed Regular Expressions
Prohibit Invalid Regular Expressions
Prohibit Spaces in Regular Expressions
Prohibit Regex like division
Require Regex Literals to be Wrapped

So what are they?

The first three refer to possible typos; the fourth one suggests the usage of

var re = /foo {3}bar/;

instead of:

var re = /foo   bar/;

(seriously, can you immediately tell how many spaces are there?)

The two last ones are to remove ambiguity of the slash character in certain cases (in /=foo/, is /= the beginning of a regular expression or the division operator?)

So we’re basically looking for typos, removing ambiguity and making a regular expression easier to read (hopefully).

Java

Let’s look at the Java patterns:

Invalid syntax for regular expression (FindBugs)
File.separator used for regular expression (FindBugs)
“.” or “|” used for regular expression (FindBugs)
Regex DOS (ReDOS) (Find Security Bugs)

Here we find invalid syntax, a problem with using File.separator on a regular expression, the possible typo of using . or | as a regular expression in a string function (such as split, for instance) and, again, a ReDOS vulnerability.

(you can find more about Java Static Analysis tools in this post)

Perl

It’s hard to talk about regular expressions without mentioning Perl.

Here are some regular expression patterns from Perl::Critic:

Right away, you see there are more patterns here than for the other languages in this post (in fact, there are more rules here than in the other languages combined).

While the first of these rules concern a possible problem in the code, the vast majority of them are, in fact, related to readability, with some of them also touching performance issues (single char alternation and unused captures, for instance).

It’s also interesting to note that the last three refer to the usage of the /s, /x and /m modifiers, also for the sake of clarity. You can read more about these modifiers on Perl’s documentation, but for brevity:

/s makes . also match n
/x makes whitespace allowed in a regular expression
/m, with a multiline string, makes ^ and $ match the start and end of each line, and not just the string’s

If you’re looking to see what a badly concocted regular expression might look like (as if you’ve never seen one before), here’s an example that breaks 7 of these rules (we’re looking for two slashes followed by an alphanumeric character and a digit 1 or 2):

m#//([A-Za-z0-9_])(1|2)#;

This expression uses the # character as a delimiter, captures groups that are never used (trust us, we didn’t use them), alternates two chars instead of putting them in a class and disregards the existence of the w named character class using a verbose version of it instead; all that, of course, combined with the absence of the three recommended modifiers.

A much cleaner version of this expression would be:

m{ //w[12] }sxm;

All the problems have now been solved, and, let’s face it, the expression is much easier on the eyes.

If you just start using regular expressions in Perl, this cheat sheet can greatly help you. It contains the different classes, characters, and modifiers used in the regular expression, with explanations.

Best Practices for Regular Expressions

The vast majority of these rules are connected to four different things:

readability
performance
possible typos
ReDOS vulnerabilities

Learning about these rules will help you write better regular expressions.

Oddly, and while some patterns are present for different languages, many aren’t; in some cases, it makes sense, as they refer to intricacies of the languages; in others, it might just be that there’s not enough demand for them or, perhaps, it could be just a matter of time until someone implements them.

In any case, and regardless of the language(s) you’re using, it is highly recommended that you use a static analysis tool in your code to improve it and prevent these and other problems; or, better still, to have a tool that combines the advantages of different tools without disrupting your workflow, such as Codacy.

Edit: We just published an ebook: “The Ultimate Guide to Code Review” based on a survey of 680+ developers. Enjoy!

About Codacy

Codacy is used by thousands of developers to analyze billions of lines of code every day!

Getting started is easy – and free! Just use your GitHub, Bitbucket or Google account to sign up.

GET STARTED

Platform

Regular Expressions: Best Practices

Ruby

JavaScript

Java

Perl

Best Practices for Regular Expressions

RELATED
BLOG POSTS

A Guide to Popular Java Static Analysis Tools

Programming languages: comparison of Best Practices, comments

Cross Programming Languages Best Practices

Automate code
reviews on your commits and pull request

Sign up to receive our newsletter

Why Codacy

Pricing

Platform

Resources

Codacy Pioneers

About us

Platform

Regular Expressions: Best Practices

Ruby

JavaScript

Java

Perl

Best Practices for Regular Expressions

RELATED BLOG POSTS

A Guide to Popular Java Static Analysis Tools

Programming languages: comparison of Best Practices, comments

Cross Programming Languages Best Practices

Automate code reviews on your commits and pull request

Sign up to receive our newsletter

Why Codacy

Pricing

Platform

Resources

Codacy Pioneers

About us

RELATED
BLOG POSTS

Automate code
reviews on your commits and pull request