HTML and Regex

So one of our interns came up to me this morning. ”

Him: I’m parsing HTML using a SAX parser and it’s blowing up.
Me: Really? No surprise there.
Him: I want to use Regex to fix it.
Stupid Me: Hmm, prolly not the best idea, but let’s try.

Thirty minutes later, after gauging out my eyes, I looked for other options.  First off, HTML is NOT XML, and no matter how hard you try to make it XML, it just won’t happen.  Secondly, if you expect ANY kind of consistency in your HTML, you’re prone to kill yourself within a week.

The reason this is painful is because HTML is not a syntactfully concise language.  It’s a mish-mash markup format that is supposed to “just work”.  As soon as you’re happy with your regular expression for one task, you realize that the next task will undo the last.  It. Is. Maddening.  Turns out there are amazing tools for fixing just this issue!

Enter Tidy HTML!

Ah, such a glorious little package that even has a great PHP extension, making the lives of parser-writers that much better.  But there’s a problem: if Tidy isn’t installed and sysops won’t let you, you’re SOL.  The problem here is that Tidy is a third-party package that might not be trusted by some ops teams, and for good reason.  Running the package on a highly-tuned production environment sounds like you want to customize 40 production servers so that one tiny little line of code will work.  If you install a new package every time you need a new tool, you walk down a long road towards bloat and feature-creep.  So let’s go back to Regex right?

Enter StackOverflow (and others) for the win!

StackOverflow is full of amazing, hilarious people.  The above link is one of my favorites as the user attempts to explain that using regex to parse HTML might not be the best idea.  These people have already tried to solve the problem with regular expressions, and they came up lacking.  Their pain is a lesson to us all.

Enter HTMLLawed and HTMLPurifier!

Ah, here we finally have the memory-chomping, slow and bloated libraries we’d expect from an HTML parsing machine.  Are they perfect? No.  But they get the job done when you can’t edit your system.  They’re based on a mixture of character-based parsing and regex callbacks that iterate through markup in phases.  They are fully testable, in fact HTMLPurifier has a well-rounded test suite.  They aren’t nearly as fast as C-based parsers, but they can get the job done when you need it.

Although I didn’t comprehensively solve the interns problem, it was a fun journey fraught with excellent wordage.  Many thanks to those smarter than I for the words of wisdom and alternatives to the madness!