Category Archives: Code

HTML and Regex

So one of our interns came up to me this morning. ”

Him: I’m parsing HTML using a SAX parser and it’s blowing up.
Me: Really? No surprise there.
Him: I want to use Regex to fix it.
Stupid Me: Hmm, prolly not the best idea, but let’s try.

Thirty minutes later, after gauging out my eyes, I looked for other options.  First off, HTML is NOT XML, and no matter how hard you try to make it XML, it just won’t happen.  Secondly, if you expect ANY kind of consistency in your HTML, you’re prone to kill yourself within a week.

The reason this is painful is because HTML is not a syntactfully concise language.  It’s a mish-mash markup format that is supposed to “just work”.  As soon as you’re happy with your regular expression for one task, you realize that the next task will undo the last.  It. Is. Maddening.  Turns out there are amazing tools for fixing just this issue!

Enter Tidy HTML!

Ah, such a glorious little package that even has a great PHP extension, making the lives of parser-writers that much better.  But there’s a problem: if Tidy isn’t installed and sysops won’t let you, you’re SOL.  The problem here is that Tidy is a third-party package that might not be trusted by some ops teams, and for good reason.  Running the package on a highly-tuned production environment sounds like you want to customize 40 production servers so that one tiny little line of code will work.  If you install a new package every time you need a new tool, you walk down a long road towards bloat and feature-creep.  So let’s go back to Regex right?

Enter StackOverflow (and others) for the win!

StackOverflow is full of amazing, hilarious people.  The above link is one of my favorites as the user attempts to explain that using regex to parse HTML might not be the best idea.  These people have already tried to solve the problem with regular expressions, and they came up lacking.  Their pain is a lesson to us all.

Enter HTMLLawed and HTMLPurifier!

Ah, here we finally have the memory-chomping, slow and bloated libraries we’d expect from an HTML parsing machine.  Are they perfect? No.  But they get the job done when you can’t edit your system.  They’re based on a mixture of character-based parsing and regex callbacks that iterate through markup in phases.  They are fully testable, in fact HTMLPurifier has a well-rounded test suite.  They aren’t nearly as fast as C-based parsers, but they can get the job done when you need it.

Although I didn’t comprehensively solve the interns problem, it was a fun journey fraught with excellent wordage.  Many thanks to those smarter than I for the words of wisdom and alternatives to the madness!

PHP and multi_curl_select()

I’m documenting this in hope it helps someone one day.

When using multi_curl_select($handle, $timeout) to wait for the next active/completed cURL handle, the second parameter should be a float value.  When not included, PHP’s documentation says it will be 1.0 seconds.  However, if you get adventurous and think that you can set the value higher for sessions that you think will take a long time, think again.  Values higher than 4.0 seconds will have adverse effects.

I was writing an application that is expected to have long-running cURL connections.  I thought setting the timeout to 10 seconds would be acceptable.  After all, it’s a timeout, not a polling interval… right?  WRONG.  It seems that if the multi_curl_select() call has a timeout that is much longer than the request times, then you run the risk of the cURL handle being garbage collected before you even get at it with curl_multi_info_read().  This means that when you eventually get something, it will act as though every connection has timed out, but it will not have an error code because technically it didn’t timeout.

If you are a user of RollingCurl, a Google Code-hosted project that has been forked many times on GitHub, you’ll see a RollingCurl->timeout instance variable.  This is the timeout to use for curl_multi_select().  Be careful when setting it!  I’ve heavily edited my RollingCurl code to account for this and many other problems like a bad memory leak on error in callback, and adding a per-request callback feature.

There is a huge difference in multi-threaded applications between polling interval and timeout.  It seems that for this usage, the word “timeout” was a bad choice.  A timeout should be a maximum time to wait before returning, not a minimum time.

I’ve learned my lesson — my curl_multi_select() calls are now running with the default of 1.0 seconds and running fine.  Sad it took so long to find this problem.