This is a story about a problem that can happen if your web application has operations that will require a long time (let’s say more than 30 seconds) to generate a response for the browser.
Figuring out what is happening, could not always be so easy. Especially if you didn’t wrote the application and the logging system does not help you. Here’s my story.
Some days ago I’ve been entrusted to solve a problem on an application deployment. The problem was that, during an operation involving a complex database transaction during several minutes, the application was crashing giving an Oracle exception in the logs: ORA-00054: resource busy and acquire with NOWAIT specified.
The problem was given only on a specific deployment, while in many other environments the operations was completing successfully.
Initially I thought there were some external process that was locking the table (like, some user with an open TOAD on the table), but with the help of a DBA, we removed all the locks and we ran the process while nobody else was connected. Monitoring the database while running, we found more strange things: there were more connections open doing stuff on the database.
Also from the application log (very very verbose) we found that the operations were made in wrong order. The thread number was not included in the log, so I was thinking about wrong order of operations in the code. And this was scary because I was unable to understand how this could possibly work in other environments.
After a careful code review, I concluded that the code was doing the operations in the correct order. Code don’t lie, even when it’s bad code.
Looking at the logs, I noticed a strange things: similar log lines were repeated at distance of almost exactly 30 seconds.
Summarizing, I needed an hypothesis that explains all the following weirdness:
- Problem was happening only on a specific deployment, while not happening in other environments.
- The code was implementing database operations in the right order, but from the log I was getting things executed in a different order.
- The number of database connections for that transaction was higher (3x) than the number I was expecting to be.
- Looking carefully at the logs, there were similar messages distanced almost exactly by 30 seconds.
- Locks on database resources that should not occur.
The solution of the mystery
What actually is happening is that the browser sends a single request, but the application server receives many requests. The application was not checking that the operation was already running, so many requests started many operations that was requiring exclusive locks on the database, causing the application fail with that error.
But, how is it possible that a single browser request can result in many requests on the server? I supposed that there were an intermediary server that was doing that. And there were: Apache.
This is what it was happening:
- The user’s browser sends the request for the operation.
- Apache receives the request and forwards the same request to the application server.
- Apache waits for the application server to respond.
- After 30 seconds Apache doesn’t get the response from the application server, so Apache thinks that the application server has not received the request or that the server has gone down.
- Some fail-over mechanism implemented in Apache makes it to reissue the request to the application server. So in the log messages you can guess that a second request is received, after almost exactly 30 seconds.
- steps 4 and 5 repeats again, so a third request thread start on the application server.
- Those parallel threads then make the application server to try to lock database resources that were already locked by previous requests.
- After the 3rd request goes in timeout the Apache decides that no applications server are alive to respond, and returns an error to the browser.
- The browser display the error returned by Apache, but the application server still continue to run the started threads, of course failing.
So, the solution can be, to remove the Apache server in the middle that is causing this behavior, or reconfigure it to have a longer timeout to allow the first thread to complete, or disabling this behavior at all.
Also, the application should not allow multiple threads to run exclusive locking operations. It should implement a sort of synchronization to give the user an error message saying that the operation is already running and cannot be started again until it is finished. But this does not solve the timeout problem on the middle server, as also the first user could get the warning message.
I noticed previously that also Weblogic module for Apache implements a similar fail-over mechanism. So, be aware of that, and consider the case.
Now, it just remains to convince the deployment guys that we are not responsible of what they put between the application and the browser.
|« Aug||Oct »|
- Android (3)
- Apple (30)
- Books (7)
- Eclipse (14)
- Errors (5)
- Firefox (7)
- Git (3)
- Hardware (18)
- Horror Code (8)
- Internet (21)
- Java (106)
- Life, universe and everything (45)
- Lifehacks (26)
- Linux (53)
- Opinions (26)
- OSX (11)
- OWNER API (3)
- Python (1)
- Software (35)
- Speeches and Conferences (8)
- Unix (5)
- Web (23)
- Windows (19)
Android apple architecture Bash bsd configuration CSS Development Düsseldorf framework free Git Google Hardware hdr How-To howto Java Karmic library Linux lion MacBook maven opensource Open Source Opinion os x OSX owner owner api patterns Pitfalls Practices properties Software TDD Testing tip tonemapped Tricks Ubuntu unix video Web