One request, multiple requests. When failover makes your application fail…
5 Comments Published September 3rd, 2007 in Java, Software, Web Tags: .This is a story about a problem that can happen if your web application has operations that will require a long time (let’s say more than 30 seconds) to generate a response for the browser.
Figuring out what is happening, could not always be so easy. Especially if you didn’t wrote the application and the logging system does not help you. Here’s my story.
The symptoms.
Some days ago I’ve been entrusted to solve a problem on an application deployment. The problem was that, during an operation involving a complex database transaction during several minutes, the application was crashing giving an Oracle exception in the logs: ORA-00054: resource busy and acquire with NOWAIT specified.
The problem was given only on a specific deployment, while in many other environments the operations was completing successfully.
Initially I thought there were some external process that was locking the table (like, some user with an open TOAD on the table), but with the help of a DBA, we removed all the locks and we ran the process while nobody else was connected. Monitoring the database while running, we found more strange things: there were more connections open doing stuff on the database.
Also from the application log (very very verbose) we found that the operations were made in wrong order. The thread number was not included in the log, so I was thinking about wrong order of operations in the code. And this was scary because I was unable to understand how this could possibly work in other environments.
After a careful code review, I concluded that the code was doing the operations in the correct order. Code don’t lie, even when it’s bad code.
Looking at the logs, I noticed a strange things: similar log lines were repeated at distance of almost exactly 30 seconds.
Summarizing, I needed an hypothesis that explains all the following weirdness:
- Problem was happening only on a specific deployment, while not happening in other environments.
- The code was implementing database operations in the right order, but from the log I was getting things executed in a different order.
- The number of database connections for that transaction was higher (3x) than the number I was expecting to be.
- Looking carefully at the logs, there were similar messages distanced almost exactly by 30 seconds.
- Locks on database resources that should not occur.
The solution of the mystery
What actually is happening is that the browser sends a single request, but the application server receives many requests. The application was not checking that the operation was already running, so many requests started many operations that was requiring exclusive locks on the database, causing the application fail with that error.
But, how is it possible that a single browser request can result in many requests on the server? I supposed that there were an intermediary server that was doing that. And there were: Apache.
This is what it was happening:
- The user’s browser sends the request for the operation.
- Apache receives the request and forwards the same request to the application server.
- Apache waits for the application server to respond.
- After 30 seconds Apache doesn’t get the response from the application server, so Apache thinks that the application server has not received the request or that the server has gone down.
- Some fail-over mechanism implemented in Apache makes it to reissue the request to the application server. So in the log messages you can guess that a second request is received, after almost exactly 30 seconds.
- steps 4 and 5 repeats again, so a third request thread start on the application server.
- Those parallel threads then make the application server to try to lock database resources that were already locked by previous requests.
- After the 3rd request goes in timeout the Apache decides that no applications server are alive to respond, and returns an error to the browser.
- The browser display the error returned by Apache, but the application server still continue to run the started threads, of course failing.
So, the solution can be, to remove the Apache server in the middle that is causing this behavior, or reconfigure it to have a longer timeout to allow the first thread to complete, or disabling this behavior at all.
Also, the application should not allow multiple threads to run exclusive locking operations. It should implement a sort of synchronization to give the user an error message saying that the operation is already running and cannot be started again until it is finished. But this does not solve the timeout problem on the middle server, as also the first user could get the warning message.
I noticed previously that also Weblogic module for Apache implements a similar fail-over mechanism. So, be aware of that, and consider the case.
Now, it just remains to convince the deployment guys that we are not responsible of what they put between the application and the browser.
Search
Calendar
| M | T | W | T | F | S | S |
|---|---|---|---|---|---|---|
| « Aug | Oct » | |||||
| 1 | 2 | |||||
| 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 | 30 |
Archives
Categories
- Android (3)
- Apple (27)
- Books (7)
- Eclipse (14)
- Errors (4)
- Firefox (7)
- Git (2)
- Hardware (17)
- Horror Code (8)
- Internet (21)
- Java (102)
- JavaScript (9)
- Life, universe and everything (46)
- Lifehacks (25)
- Linux (51)
- Opinions (26)
- OSX (6)
- Python (1)
- Software (31)
- Speeches and Conferences (8)
- Unix (4)
- Web (23)
- Windows (19)
Tag Cloud
Android apple architecture Bash configuration CSS Development Düsseldorf Eclipse Git Google Hardware hdr How-To Java JAXB job junit Karmic Linux lion MacBook music Open Source Opinion oracle OSX patterns Pitfalls Practices Resume Security Software Suspend TDD Testing tip tonemapped Tricks Ubuntu unix video Web Workaround XML
WP Cumulus Flash tag cloud by Roy Tanck and Luke Morton requires Flash Player 9 or better.
Blog License
Blogs I like
Books on the desk
Friends' Blogs
- Antonio Terreno & Valter Bernardini
- Bruno Bossola
- Daniele Galluccio
- Domenico Ventura
- Ed Schepis
- Fabrizio Gianneschi
- Luca Grulla
- Luigi Zanderighi
- Marcello Teodori
- Mida Boghetich
- Muralidharan Chandrasekaran
- Piero Ricca
- Renzo Borgatti
- Simone Bordet
- Simone Bruno
- Uberto Barbini
- Valvolog
- Webtide blogs (Greg Wilkins & Jan Bartel)
Links




















Good catch! Thanks for sharing…
You’re definitely a problem solver! Very good catch, nothing else to say but “great!”
Nice stuff.
Thanks for sharing the experiencing.
Greatttttttttttttttttttt :)
I’ve found that the browser does make multiple requests. Below is the results from
Google Chrome. No matter what your requesting it makes 2 request on the initial connection.
Adding latency to the server increased the requests up to 4 (chrome). At least that is what
my test find.
The output is from simple threaded c https server code. It prints when a request is made,
when the headers are parsed, and what the request was. The more latency I added to the
header parsing the more requests the browser made. Firefox limited its request to three.
IE made a two request each time it received a Connection Closed response without content.
The numbers in parens are the threads id number.
——-start first request——————–
———————————————
Connection from: 127.0.0.1 (45533)
Creating Thread
———————————————
Connection from: 127.0.0.1 (45535)
Creating Thread
———————————————
———————————————
parsed headers (45533)
cert request
———————————————
———————————————
parsed headers (45535)
cert request
——-end first request———————
———————————————
Connection from: 127.0.0.1 (45537)
Creating Thread
———————————————
———————————————
parsed headers (45537)
page request