This is a story about a problem that can happen if your web application has operations that will require a long time (let’s say more than 30 seconds) to generate a response for the browser.

Figuring out what is happening, could not always be so easy. Especially if you didn’t wrote the application and the logging system does not help you. Here’s my story.

The symptoms.
Some days ago I’ve been entrusted to solve a problem on an application deployment. The problem was that, during an operation involving a complex database transaction during several minutes, the application was crashing giving an Oracle exception in the logs: ORA-00054: resource busy and acquire with NOWAIT specified.
The problem was given only on a specific deployment, while in many other environments the operations was completing successfully.

Initially I thought there were some external process that was locking the table (like, some user with an open TOAD on the table), but with the help of a DBA, we removed all the locks and we ran the process while nobody else was connected. Monitoring the database while running, we found more strange things: there were more connections open doing stuff on the database.
Also from the application log (very very verbose) we found that the operations were made in wrong order. The thread number was not included in the log, so I was thinking about wrong order of operations in the code. And this was scary because I was unable to understand how this could possibly work in other environments.
After a careful code review, I concluded that the code was doing the operations in the correct order. Code don’t lie, even when it’s bad code.
Looking at the logs, I noticed a strange things: similar log lines were repeated at distance of almost exactly 30 seconds.
Summarizing, I needed an hypothesis that explains all the following weirdness:

  • Problem was happening only on a specific deployment, while not happening in other environments.
  • The code was implementing database operations in the right order, but from the log I was getting things executed in a different order.
  • The number of database connections for that transaction was higher (3x) than the number I was expecting to be.
  • Looking carefully at the logs, there were similar messages distanced almost exactly by 30 seconds.
  • Locks on database resources that should not occur.

The solution of the mystery
What actually is happening is that the browser sends a single request, but the application server receives many requests. The application was not checking that the operation was already running, so many requests started many operations that was requiring exclusive locks on the database, causing the application fail with that error.

But, how is it possible that a single browser request can result in many requests on the server? I supposed that there were an intermediary server that was doing that. And there were: Apache.

This is what it was happening:

  1. The user’s browser sends the request for the operation.
  2. Apache receives the request and forwards the same request to the application server.
  3. Apache waits for the application server to respond.
  4. After 30 seconds Apache doesn’t get the response from the application server, so Apache thinks that the application server has not received the request or that the server has gone down.
  5. Some fail-over mechanism implemented in Apache makes it to reissue the request to the application server. So in the log messages you can guess that a second request is received, after almost exactly 30 seconds.
  6. steps 4 and 5 repeats again, so a third request thread start on the application server.
  7. Those parallel threads then make the application server to try to lock database resources that were already locked by previous requests.
  8. After the 3rd request goes in timeout the Apache decides that no applications server are alive to respond, and returns an error to the browser.
  9. The browser display the error returned by Apache, but the application server still continue to run the started threads, of course failing.

So, the solution can be, to remove the Apache server in the middle that is causing this behavior, or reconfigure it to have a longer timeout to allow the first thread to complete, or disabling this behavior at all.

Also, the application should not allow multiple threads to run exclusive locking operations. It should implement a sort of synchronization to give the user an error message saying that the operation is already running and cannot be started again until it is finished. But this does not solve the timeout problem on the middle server, as also the first user could get the warning message.

I noticed previously that also Weblogic module for Apache implements a similar fail-over mechanism. So, be aware of that, and consider the case.

Now, it just remains to convince the deployment guys that we are not responsible of what they put between the application and the browser.


5 Responses to “One request, multiple requests. When failover makes your application fail…”  

  1. 1 Armond

    Good catch! Thanks for sharing…

  2. 2 Bruno Bossola

    You’re definitely a problem solver! Very good catch, nothing else to say but “great!”

  3. 3 sandeep

    Nice stuff.
    Thanks for sharing the experiencing.

  4. 4 coldwar

    Greatttttttttttttttttttt :)

  5. 5 rbrettjuergens

    I’ve found that the browser does make multiple requests. Below is the results from
    Google Chrome. No matter what your requesting it makes 2 request on the initial connection.
    Adding latency to the server increased the requests up to 4 (chrome). At least that is what
    my test find.
    The output is from simple threaded c https server code. It prints when a request is made,
    when the headers are parsed, and what the request was. The more latency I added to the
    header parsing the more requests the browser made. Firefox limited its request to three.
    IE made a two request each time it received a Connection Closed response without content.
    The numbers in parens are the threads id number.

    ——-start first request——————–
    ———————————————
    Connection from: 127.0.0.1 (45533)
    Creating Thread
    ———————————————
    Connection from: 127.0.0.1 (45535)
    Creating Thread
    ———————————————
    ———————————————
    parsed headers (45533)
    cert request
    ———————————————
    ———————————————
    parsed headers (45535)
    cert request
    ——-end first request———————
    ———————————————
    Connection from: 127.0.0.1 (45537)
    Creating Thread
    ———————————————
    ———————————————
    parsed headers (45537)
    page request

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>



Calendar

September 2007
M T W T F S S
« Aug   Oct »
 12
3456789
10111213141516
17181920212223
24252627282930

Follow me

twitter flickr LinkedIn feed

Subscribe by email

Enter your email address:

Archives


Categories

Tag Cloud


Listening