HTML and Regex

So one of our interns came up to me this morning. ”

Him: I’m parsing HTML using a SAX parser and it’s blowing up.
Me: Really? No surprise there.
Him: I want to use Regex to fix it.
Stupid Me: Hmm, prolly not the best idea, but let’s try.

Thirty minutes later, after gauging out my eyes, I looked for other options.  First off, HTML is NOT XML, and no matter how hard you try to make it XML, it just won’t happen.  Secondly, if you expect ANY kind of consistency in your HTML, you’re prone to kill yourself within a week.

The reason this is painful is because HTML is not a syntactfully concise language.  It’s a mish-mash markup format that is supposed to “just work”.  As soon as you’re happy with your regular expression for one task, you realize that the next task will undo the last.  It. Is. Maddening.  Turns out there are amazing tools for fixing just this issue!

Enter Tidy HTML!

Ah, such a glorious little package that even has a great PHP extension, making the lives of parser-writers that much better.  But there’s a problem: if Tidy isn’t installed and sysops won’t let you, you’re SOL.  The problem here is that Tidy is a third-party package that might not be trusted by some ops teams, and for good reason.  Running the package on a highly-tuned production environment sounds like you want to customize 40 production servers so that one tiny little line of code will work.  If you install a new package every time you need a new tool, you walk down a long road towards bloat and feature-creep.  So let’s go back to Regex right?

Enter StackOverflow (and others) for the win!

StackOverflow is full of amazing, hilarious people.  The above link is one of my favorites as the user attempts to explain that using regex to parse HTML might not be the best idea.  These people have already tried to solve the problem with regular expressions, and they came up lacking.  Their pain is a lesson to us all.

Enter HTMLLawed and HTMLPurifier!

Ah, here we finally have the memory-chomping, slow and bloated libraries we’d expect from an HTML parsing machine.  Are they perfect? No.  But they get the job done when you can’t edit your system.  They’re based on a mixture of character-based parsing and regex callbacks that iterate through markup in phases.  They are fully testable, in fact HTMLPurifier has a well-rounded test suite.  They aren’t nearly as fast as C-based parsers, but they can get the job done when you need it.

Although I didn’t comprehensively solve the interns problem, it was a fun journey fraught with excellent wordage.  Many thanks to those smarter than I for the words of wisdom and alternatives to the madness!

PHP and multi_curl_select()

I’m documenting this in hope it helps someone one day.

When using multi_curl_select($handle, $timeout) to wait for the next active/completed cURL handle, the second parameter should be a float value.  When not included, PHP’s documentation says it will be 1.0 seconds.  However, if you get adventurous and think that you can set the value higher for sessions that you think will take a long time, think again.  Values higher than 4.0 seconds will have adverse effects.

I was writing an application that is expected to have long-running cURL connections.  I thought setting the timeout to 10 seconds would be acceptable.  After all, it’s a timeout, not a polling interval… right?  WRONG.  It seems that if the multi_curl_select() call has a timeout that is much longer than the request times, then you run the risk of the cURL handle being garbage collected before you even get at it with curl_multi_info_read().  This means that when you eventually get something, it will act as though every connection has timed out, but it will not have an error code because technically it didn’t timeout.

If you are a user of RollingCurl, a Google Code-hosted project that has been forked many times on GitHub, you’ll see a RollingCurl->timeout instance variable.  This is the timeout to use for curl_multi_select().  Be careful when setting it!  I’ve heavily edited my RollingCurl code to account for this and many other problems like a bad memory leak on error in callback, and adding a per-request callback feature.

There is a huge difference in multi-threaded applications between polling interval and timeout.  It seems that for this usage, the word “timeout” was a bad choice.  A timeout should be a maximum time to wait before returning, not a minimum time.

I’ve learned my lesson — my curl_multi_select() calls are now running with the default of 1.0 seconds and running fine.  Sad it took so long to find this problem.

Startups in San Diego

The startup community in San Diego is hard. I don’t mean this as a woeful cry, but as an admission of an uphill battle.

San Diego is no Philadelphia, Boston, Boulder, or San Francisco.  It almost seems like the people in San Diego made their fortune, and now they’re out surfing.  Don’t get me wrong: I think enjoying life is fantastic, and I intend to do the same.  But at some point, the entrepreneurs in other startup cultures came back and started reinvesting in the local community in time and money.

Having a great idea is cool; but any idea needs nurturing: a culture of feedback and commonality of purpose. Here’s to working in San Diego, you entrepreneurs.  Here’s to you.  Let’s make San Diego a better place!

Live CSS Editing

If you code CSS like I do, you’re spending way too much time in Chrome’s Developer Tools pushing the up-and-down arrow keys.  The problem is that when you’ve changed a thousand styles, and you’re just starting to feel good about what you’ve accomplished today, your pesky fingers kick the Cmd+R combo and voila! your changes are gone.  Congratulations!

My first inkling was to try out a web service.  I quickly found sites like ScratchPad and CSSizer, and JSFiddle.  These guys are awesome, no doubt, but I’m lazy.  I don’t want to have to copy-pasta my code everywhere!

So I turned to proprietary apps, like the $60 Stylizer.   They’ve been taking the Windows side by storm, but I wasn’t impressed by the Mac side.

Then, this fantastic blog told me about a magical place in Google Chrome that saves your editing history for stylesheets.  Chrome’s Help pages even outlined it.  Sadly, that feature is either from a particular build (that blog post was from 2 years ago) or has since been broken.

So… Coda it is!  Coda provides a way to preview CSS changes realtime.  My only gripe so far has been that if you edit a CSS document that contains an @import statement, the @import is ignored! Sad day, but easy enough to work around for now!