Recently one of our customers encountered a strange problem with their VoiceXML application. Their application had three pages: a start page that attempted to fetch a dynamic page and an end page that would be fetched in case the dynamic page took too long to process. Ideally the following should’ve occurred whenever the dynamic page failed to return:
- The Plum IVR platform fetches the start page and sees the VoiceXML directive instructing it to fetch the dynamic page.
- The IVR platform tries to fetch the dynamic page but after a short timeout, gives up.
- The platform then fetches the end page and plays the message telling the caller that the service is currently unavailable.
For some reason, once the platform failed to fetch the dynamic page, it also failed to fetch the end page.
Now at first, we weren’t certain what the source of the problem was. We wrote a similar application on our own servers and failed to replicate the behavior. Fortunately our customer’s setup did fail reliably (which addressed the number one tool for troubleshooting: a way to trigger the bug over and over.)
After trying to use some pretty weak tools to debug the issue (including typing in “netstat -an” over and over), we broke out a packet sniffer. Specifically, we broke out tcpdump. With tcpdump, we were then able to trace the HTTP sessions between the IVR and our customer’s web server. And what did we find? To fetch the end page, our platform was attempting to reuse the socket that was hung on the dynamic script. This, of course, wouldn’t work. It’s all fine and well to reuse a socket for another request if the previous request has completed, otherwise the IVR is shouting at deaf ears on the customer web server.
Having thus isolated the problem to unexpected reuse of a busy socket, we knew the problem actually came from one of the libraries against which our platform is compiled: libcurl.
Now before I continue, it should be said that libcurl is awesome. We used to use libwww and found it to be an unmanageable mess. libcurl is simple, fast, full-featured, and well-documented. But sometimes even the best software has a bug or two. Well, in this case just one.
We decided to write a small test application in PHP that would, using the curl interface embedded in PHP, attempt to fetch those three pages in succession. This is the number two tool for troubleshooting: a simplified analogue of the problem that can reliably reproduce the bug. Sure enough our test application reproduced the issue and now, by using a scripting language like PHP, we could insert all manner of debugging information into our test script and immediately retest.
With debugging turned on, we discovered that libcurl was quite intentionally reusing the socket. Since libcurl itself was printing the debugging messages, we searched the libcurl source code (because it’s Open Source) for where this debugging message came from. And what did we discover? A slight flaw in the libcurl logic where they should’ve typed “!=” (i.e. not equal to) instead of “==”. We changed one character and “voila” we fixed the bug.
Mind you, after the “voila” comes a couple days of testing, building a patch kit, scheduling system maintenance across our infrastructure, and deploying the fix to production — but we’re discussing debugging today, not operations.
The marketing moral of this story? In addition to owning the platform source code in-house which allows us to instate patches within days of identifying a bug, we also take advantage of open source libraries which allows us to fix bugs in third-party software to which our platform links. The benefits of being able to quickly and directly modify our IVR platform code aren’t confined to bug fixes either. We have, in the past, added VoiceXML extensions based on customer requests. When our marketing team says that we own our IVR platform, now you know why it’s important.
Leave a Reply