Readable Regular Expressions#

My main point of focus at work lately has been promoting maintainable code. One of the key tenets is readable code. The single responsibility principle and a low cyclomatic complexity are important, but if you are still using cryptic, prefixed, acronymed, and highly abbreviated identifiers, it is still going to be a chore for the reader to decipher. My slogan: "let's take the code out of source code".

I was just listening to Roy Osherove talk about regular expressions on .NET Rocks. A recurring theme brought up was how hard regular expressions are to deal with. Not necessarily creating them - you can do a lot by just knowing the basics - but dealing with them after they've been written. As they mentioned on the show, your source code ends up looking like a cartoon character swearing, which is the likely response you'll get from the poor maintenance developer that has to deal with it. Regular expressions are often referred to as a "write-only" language.

It got me thinking that this was a problem worth solving. Regular expressions are too powerful to ignore. For a certain set of problems, a regular expression can eliminate a LOT of potentially error-prone code. I cannot justify advocating avoiding regular expressions, no matter how much I value source readability. So what if we could make regular expressions readable?

Inspired by the Ayende's Rhino.Mocks syntax, I created a library that provides a better way to define regular expressions in your source code. The easiest way to describe it is to show it in action. Suppose we want to check for social security numbers. You might write code like this:

    Regex socialSecurityNumberCheck = new Regex(@"^\d{3}-?\d{2}-?\d{4}$");

Using ReadableRex (not settled on the name yet...), it would look like:

    Regex socialSecurityNumberCheck = new Regex(Pattern.With.AtBeginning

        .Digit.Repeat.Exactly(3)

        .Literal("-").Repeat.Optional

        .Digit.Repeat.Exactly(2)

        .Literal("-").Repeat.Optional

        .Digit.Repeat.Exactly(4)

        .AtEnd);

You could argue that the second example is actually harder to read, because the reader is bogged down with the details of how a social security number check is performed. It may be a bad example, because the algorithm for detecting a SSN is both well-known (in the US, at least) and unlikely to change. Consider a situation where the expected match is not well-known, and very likely to change: screen scraping HTML. In that case, being able to read through the algorithm, and easily identify which parts need to change becomes very important. To illustrate, I dug up some old code that was used to scrape basketball scores from espn.com. It's a good example of an ugly pattern that had to be maintainable, since the HTML layout could change at any time.

    const string findGamesPattern = @"<div\s*class=""game""\s*id=""(?<gameID>\d+)-game""(?<content>.*?)<!--gameStatus\s*=\s*(?<gameState>\d+)-->";

Using ReadableRex:

    Pattern findGamesPattern = Pattern.With.Literal(@"<div")

        .WhiteSpace.Repeat.ZeroOrMore

        .Literal(@"class=""game""").WhiteSpace.Repeat.ZeroOrMore.Literal(@"id=""")

        .NamedGroup("gameId", Pattern.With.Digit.Repeat.OneOrMore)

        .Literal(@"-game""")

        .NamedGroup("content", Pattern.With.Anything.Repeat.Lazy.ZeroOrMore)

        .Literal(@"<!--gameStatus")

        .WhiteSpace.Repeat.ZeroOrMore.Literal("=").WhiteSpace.Repeat.ZeroOrMore

        .NamedGroup("gameState", Pattern.With.Digit.Repeat.OneOrMore)

        .Literal("-->");

I think this would be much easier to maintain.  Note that this library doesn't actually perform an regular expression operations - it simply provides another way to define regular expression patterns. You still need to use the System.Text.RegularExpression.Regex object with the pattern you create. Since the Pattern type has an implicit conversion to System.String, so you can easily pass it to the the methods/constructors on Regex.

What do you think? Download the code or just the assembly DLL, give it a try, and tell me what you think. None of the method/property names are set in stone, so the syntax may change, but the approach will remain the same.

Saturday, October 07, 2006 12:15:27 PM (Central Standard Time, UTC-06:00) #    Comments [25]  | 

 

Saturday, October 07, 2006 2:06:26 PM (Central Standard Time, UTC-06:00)
Great work. I'm definitely going to try this out!
Friday, October 13, 2006 1:47:15 PM (Central Standard Time, UTC-06:00)
Great Job! How about calling the project as ReadEx for READable regular EXpression. I like shorther names :D
Tuesday, October 17, 2006 8:30:52 PM (Central Standard Time, UTC-06:00)
Whoa... I never would have thought of trying that. But I'll certainly be trying your library out.

I think you're on to something!
Monday, October 23, 2006 8:28:37 AM (Central Standard Time, UTC-06:00)
Dammit, I was working on this exact same blog post! :)

Someone needs to hire you away from Hell.com and put the rest of your creativity to effective work on a day-to-day basis.
Monday, October 23, 2006 9:05:18 AM (Central Standard Time, UTC-06:00)
I found this announcement[^] of a readable regular expressions library...
Monday, October 23, 2006 10:45:10 AM (Central Standard Time, UTC-06:00)
I like it, even if I prefer the more terse regex "language."

One suggestion:

Instead of

{Digit|WhiteSpace}.Repeat.ZeroOrMore ==> {Digit|Whitespace}.Optional
{Digit|WhiteSpace}.Repeat.OneOrMore ==> {Digit|WhiteSpace}.Required

Or at least, instead of ZeroOrMore or OneOrMore use Optional or Required respectively.
Monday, October 23, 2006 11:42:12 AM (Central Standard Time, UTC-06:00)
Great idea...
But how about those really hard to craft/understand expressions?
For example, in my project we use the following expressions:

^((?!my string).)*$
\A((?!my string).)*$\Z

Any idea what they do?

*SPOILER*
They're actually a NOT operators, matching text that does not contain the "my string" phrase. The first one is for single line search, and the second is for multiline.
*END SPOILER*

I'd sure love to see a fluent expression that describes those expressions.
Wednesday, October 25, 2006 4:04:54 AM (Central Standard Time, UTC-06:00)
I think you're wrong personally. Before I read anything about your article, I tried to read your easy to read expression, it was easier to read, but not easier to understand.

I think people just have to find a nice guide to learning regular expressions. They're very powerful and useful and if you put the effort in, it pays off. You *could* learn to do it this way, but I think in the end it would be just as hard trying to remember what does what as it is to remember what a character does.

Nice idea though!
Wednesday, October 25, 2006 4:47:41 AM (Central Standard Time, UTC-06:00)
A nice idea in principle, but from the example in this post it strikes me that a user of this library needs to learn a fairly complex syntax which is almost as far from "plain english" as regex, when they could simply learn how to do regex.

'Literal("-").Repeat.Optional' does not automatically imply the same in my mind as '-?', but instead something more like '-*' or '-+'.

In my opinion a better system would be a regex compiler whereby you can communicate what you want in a far more English way, something along the lines of <a href="http://blogs.msdn.com/ericgu/archive/2003/07/07/52362.aspx">Regular Expression Workbench</a> but perhaps even simpler.
Wednesday, October 25, 2006 4:49:32 AM (Central Standard Time, UTC-06:00)
May I suggest "Some html is allowed" be elaborated on? :( Never mind, I expect those reading the comment can extract the URL anyhow... maybe they can construct a regex to do it for them ;)
Wednesday, October 25, 2006 10:18:31 PM (Central Standard Time, UTC-06:00)
Thanks for the feedback, everyone.

Omer: Yes, there are some edge cases that my proof-of-concept code does not cover. That doesn't mean they can't be done, it just means I didn't want to spend the time to cover every case before asking for feedback. Thanks for providing a real-world case that I can use for testing when I get to that. Any suggestions on how you think it should look?

John & Chris: I know there are other ways that would help make "creating" regular expressions easier (learning the syntax, or using a workbench tool). My attempt was to make "reading" regular expressions easier. Of course I understand that is completely subjective. If you think the one character symbols and punctuation are easier to understand, who am I to disagree?
I completely expected disagreements about the method names that I chose - that's why I said none of the names are set in stone and was asking for feedback. It sounds like the choice of "Repeat.Optional" for '0 or 1' was not a good choice, as a couple people have mentioned it. Any alternative suggestions? Would you prefer a "wrapping" appearance for optional, like this:
Pattern.With.Digit.Optional(Pattern.With.Literal("-")); // renders as: \d-?

I originally did all the repetition using that wrapping syntax, but felt it didn't read as smoothly.

Thanks for the reminder Chris - I need to fix DasBlog so it doesn't show "Some html is allowed" when I have all HTML disabled. It normally lists the symbols you allow, but since I don't allow anything, it doesn't list anything...
Monday, October 30, 2006 12:38:36 AM (Central Standard Time, UTC-06:00)
Hi Joshua,

Very cool! You might also be interested in this approach, which is a way to get a similar result with a more concise syntax: http://dotnet.agilekiwi.com/blog/2006/10/shorthand-interfaces.html
Monday, October 30, 2006 2:41:22 PM (Central Standard Time, UTC-06:00)
Very cool!

I wrote about the need for a better RegEx syntax last year, but didn't have any ideas on how to implement it. This is a really cool solution to the "write only" RegEx syntax.

http://weblogs.asp.net/jgalloway/archive/2005/11/02/429218.aspx
Sunday, February 18, 2007 12:00:34 AM (Central Standard Time, UTC-06:00)
Sweet.

Maybe we could put a collection of common expression together using the lib?

Schneider
Wednesday, October 31, 2007 12:46:26 PM (Central Standard Time, UTC-06:00)
Absolutely ridiculous. If you can't read/write regex, stop coding. You turned 1 line of code into 10. Maintainability is also related to the number of lines of code. Working with .NET must have clouded your thinking. Objects are not the be all end all of the programming world. Just sit down and learn Perl, then this stuff won't be so hard for you.
Wednesday, October 31, 2007 5:16:43 PM (Central Standard Time, UTC-06:00)
Steve - if you can't read/write your code using 1s and 0s, please stop coding. Working with higher level languages like perl must have clouded your thinking.
Friday, November 02, 2007 7:42:09 AM (Central Standard Time, UTC-06:00)
The only thing that absolutely stinks about fluent interfaces is their debuggability. The compiler treats this as one line of code, so it's impossible to step into individual calls.

The only way I could help this was to create a debug visualizer, you might look into that. You could hover over the variable, and the visualizer could show you the resultant regex. Just a thought...
Sunday, November 04, 2007 2:59:00 PM (Central Standard Time, UTC-06:00)
Josh: I *can* read and write in hex. Well, a workable Z80 subset. However, you ignored the main point: ones and zeroes are in no way shorter than higher-level code.

I think there is a misconception of readability here: The problem is not that you need to know what the syntactical elements of the regexp mean. Rather you need to understand what the whole expression actually does, and the verbose form doesn't help here at all.

Also I thing the fluency suffers from being forced into the chained methods. There is no syntactical indication that 'Repeats' modifies the meaning of the previous entity while 'Digits' doesn't.

Jimmy: There is nothing to debug here anyway: The object incantation just produces a regular regexp.
Sunday, November 04, 2007 8:36:49 PM (Central Standard Time, UTC-06:00)
Andreas: good point on the 1s and 0s - they definitely would involve more typing. My attempt at humor failed miserably.
And I completely agree about the quantifiers and grouping being ambiguous in this version of the API. It was something I wanted to improve (and you can see in the comments above that I was playing with different ideas), but never followed up on.
Friday, November 16, 2007 11:45:29 AM (Central Standard Time, UTC-06:00)
Josh, what does an email address validator look like using Readable Regex?

p.s. thanks for this library, I've used it as a tool many times to generate regular expressions, especially in helping out in the CodeProject forums.
Saturday, May 10, 2008 9:17:26 AM (Central Standard Time, UTC-06:00)
For Java programmers, something similar has been in Hamcrest for a while.
Nat
Saturday, May 10, 2008 10:03:26 AM (Central Standard Time, UTC-06:00)
To the users who think this is crap - Well, that's why you'll remain in the worker class, or should I say, last layer. Can't you see the creativity? Regex is too good to be less frequently used, which I see in many projects. In today's RAD world, time is money, and if I can get my developers rolling fast, well.....figure it out if you are good at Regex. If Mr. Flanagan was on my team, he would be considered management material. Great work!
Saturday, May 10, 2008 11:06:11 AM (Central Standard Time, UTC-06:00)
Great work! and great idea!
Saturday, May 17, 2008 5:40:54 AM (Central Standard Time, UTC-06:00)
I think the idea is sound, but the syntax seems lacking; in particular I agree with the comment regarding ambiguity about whether any given operation modifies the preceding one or not.

Perhaps there should be more focus on building sub-expressions in the Regex (as with regular Expression<Func<T>>), rather than continuous chains which are little more than longhand for the existing syntax? Perhaps also more extensions which encapsulate common patterns, eg .Digit(3); .Digit(3.OrMore). Something highly composable.

Dunno.. difficult one to solve.
Saturday, May 17, 2008 8:22:47 AM (Central Standard Time, UTC-06:00)
Thanks for the feedback Keith. C# 3.0 wasn't released when I first wrote this - it may be worth revisiting to try and take advantage of the new syntactical conveniences.
Name
E-mail
Home page

Comment (HTML not allowed)  

Enter the code shown (prevents robots):

Live Comment Preview
All content © 2008, Joshua Flanagan
About this site
Send mail to the author(s) Contact me
Feed your aggregator (RSS 2.0)
Joshua Flanagan
I have been developing software professionally for 10 years; focusing on .NET since its release. I use this site to interact with, and contribute to, the .NET software development community.
Microsoft Certified Application Developer

On this page
Archives
Rest of the world

Acknowledgements

Powered by: newtelligence dasBlog 1.9.7170.677

The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

Site theme based on the essence design by Jelle Druyts