Readable Regular Expressions#

My main point of focus at work lately has been promoting maintainable code. One of the key tenets is readable code. The single responsibility principle and a low cyclomatic complexity are important, but if you are still using cryptic, prefixed, acronymed, and highly abbreviated identifiers, it is still going to be a chore for the reader to decipher. My slogan: "let's take the code out of source code".

I was just listening to Roy Osherove talk about regular expressions on .NET Rocks. A recurring theme brought up was how hard regular expressions are to deal with. Not necessarily creating them - you can do a lot by just knowing the basics - but dealing with them after they've been written. As they mentioned on the show, your source code ends up looking like a cartoon character swearing, which is the likely response you'll get from the poor maintenance developer that has to deal with it. Regular expressions are often referred to as a "write-only" language.

It got me thinking that this was a problem worth solving. Regular expressions are too powerful to ignore. For a certain set of problems, a regular expression can eliminate a LOT of potentially error-prone code. I cannot justify advocating avoiding regular expressions, no matter how much I value source readability. So what if we could make regular expressions readable?

Inspired by the Ayende's Rhino.Mocks syntax, I created a library that provides a better way to define regular expressions in your source code. The easiest way to describe it is to show it in action. Suppose we want to check for social security numbers. You might write code like this:

    Regex socialSecurityNumberCheck = new Regex(@"^\d{3}-?\d{2}-?\d{4}$");

Using ReadableRex (not settled on the name yet...), it would look like:

    Regex socialSecurityNumberCheck = new Regex(Pattern.With.AtBeginning

        .Digit.Repeat.Exactly(3)

        .Literal("-").Repeat.Optional

        .Digit.Repeat.Exactly(2)

        .Literal("-").Repeat.Optional

        .Digit.Repeat.Exactly(4)

        .AtEnd);

You could argue that the second example is actually harder to read, because the reader is bogged down with the details of how a social security number check is performed. It may be a bad example, because the algorithm for detecting a SSN is both well-known (in the US, at least) and unlikely to change. Consider a situation where the expected match is not well-known, and very likely to change: screen scraping HTML. In that case, being able to read through the algorithm, and easily identify which parts need to change becomes very important. To illustrate, I dug up some old code that was used to scrape basketball scores from espn.com. It's a good example of an ugly pattern that had to be maintainable, since the HTML layout could change at any time.

    const string findGamesPattern = @"<div\s*class=""game""\s*id=""(?<gameID>\d+)-game""(?<content>.*?)<!--gameStatus\s*=\s*(?<gameState>\d+)-->";

Using ReadableRex:

    Pattern findGamesPattern = Pattern.With.Literal(@"<div")

        .WhiteSpace.Repeat.ZeroOrMore

        .Literal(@"class=""game""").WhiteSpace.Repeat.ZeroOrMore.Literal(@"id=""")

        .NamedGroup("gameId", Pattern.With.Digit.Repeat.OneOrMore)

        .Literal(@"-game""")

        .NamedGroup("content", Pattern.With.Anything.Repeat.Lazy.ZeroOrMore)

        .Literal(@"<!--gameStatus")

        .WhiteSpace.Repeat.ZeroOrMore.Literal("=").WhiteSpace.Repeat.ZeroOrMore

        .NamedGroup("gameState", Pattern.With.Digit.Repeat.OneOrMore)

        .Literal("-->");

I think this would be much easier to maintain.  Note that this library doesn't actually perform an regular expression operations - it simply provides another way to define regular expression patterns. You still need to use the System.Text.RegularExpression.Regex object with the pattern you create. Since the Pattern type has an implicit conversion to System.String, so you can easily pass it to the the methods/constructors on Regex.

What do you think? Download the code or just the assembly DLL, give it a try, and tell me what you think. None of the method/property names are set in stone, so the syntax may change, but the approach will remain the same.

Saturday, October 07, 2006 1:15:27 PM (Central Daylight Time, UTC-05:00) #    Comments [30]  | 

 

Akismet support in DasBlog#

DasBlog 1.9 was recently released and contained many new features worth the upgrade. As a member of the development team, I've been running the nightly builds anyway, so I was surprised to read how much functionality people were missing out on by still running the previous public release.

One feature that I've never been satisfied with is the anti-spam support. We've had CAPTCHA support for some time, but it has been finicky, and as a frequent commenter myself, I find it an annoying solution. I've become spoiled by Thunderbird and its automatic spam filtering of my email, so I expect the same for my blog. When I became aware of Akismet, I was sold immediately. Akismet provides automatic spam filtering for blogs: you run comments to your site through their service and they tell you if they're junk. They provide a simple REST web service API to make it easy to integrate with your blog software of choice. Phil Haack made it even sweeter by wrapping the Akismet API in a nice C# object model for use in Subtext. By the glory of open source software, I was able to pluck the code out of the Subtext Subversion repository and drop it in the DasBlog repository. A little glue and UI polishing later, and DasBlog now supports intelligent, adapting, and automatic comment moderation.

To be clear, Akismet support was added after the 1.9 release. If you download the official release, the Akismet options will not be available. I expect we will put out a 1.9 point release sometime in the next couple months, but can make no guarantees. If you want this feature as much as I did, you can install one of the nightly builds from http://dasblog.info/dbftp/ (get 1.9.6276 or later). There has been very few changes since the official release, so I don't think you need be concerned about stability. Go check it out and let me know what you think!

Thursday, October 05, 2006 8:18:01 PM (Central Daylight Time, UTC-05:00) #    Comments [2]  | 

 

All content © 2010, josh
About this site
Send mail to the author(s) Contact me
Feed your aggregator (RSS 2.0)
Joshua Flanagan
I am a software developer focused on continuous improvement in the .NET community
Los Techies

On this page
Archives
Rest of the world

Acknowledgements

Powered by: newtelligence dasBlog 2.1.8209.14743

Special thanks to LosTechies.com

Site theme based on the essence design by Jelle Druyts

The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.