Tuesday, July 8, 2008

Using various tools to solve a problem

I was recently called in to take over a performance problem. I quickly determined that only a portion of the database was affected, but it was the most visible part, affecting the business ability to offer a certain feature to millions of our customers. Some queries that should take less than 5 seconds suddenly took minutes to complete.

Of course the App Vendor claimed that nothing changed around that particular query in years, so it had to be a database problem.

Looking at OEM’s Average Active Sessions graph on the Performance tab, I could clearly see a huge red stripe (red means the wait is Application related). Drilling down I found waits for an event called “enq: UL – contention” to be the problem. A metalink search retuned no hits and a google search retuned hundreds of forum websites with DBA’s asking “What is enq: UL – contention?”, most of them did not have any replies. The only reference to this event in the manual was found in Oracle® Database Reference 10g Release 2 (10.2) (B14237-01) Appendix E: Oracle Enqueue Names . It simply stated “UL, User-defined Locks”. To me that meant that the app could be using DBMS_LOCK, but the Vendor said they don’t and app support told me that these queries do not write to the database.

I queried V$SESSION for sessions waiting for EVENT=”enq: UL – contention” to find the SID of the BLOCKING_SESSION, but the BLOCKING_SESSION kept on changing. One blocking session would have confirmed an App issue and the short term solution would have been to kill that process.

Then I logged a priority 1 SR and was asked to provide system state dumps. Eventually BUG 4367986 was identified as a possibility. The patch was applied, but it made no difference!

OK, I clearly had to try something else, so I decided to trace 1 session and go through the TKPROF output. I found that the app related SQL statements all executed in less than 5 second, but one statement stood out at 60 seconds elapsed time. That SQL statement was;

begin ctxsys.drvdml.com_sync_index(:idxname, :idxmem, :partname); end;

Schema ctxsys means Oracle Text is in use and the Vendor then confirmed that Oracle Text indexes where build on an app related audit table almost 2 months ago. The TKPROF also showed INSERTs into the app audit table, so we were finally on the right track.

This information was uploaded in the SR and I had to check the SYNC intervals via a select from dr$index and confirmed that the 2 audit indexes had SYNC ON COMMIT. We changed the COMMIT interval to MANUAL , just to proof if this problem is related to the SYNC ON COMMIT. Once we changed the commit interval to MANUAL, the wait event disappeared.

exec CTX_DDL.REPLACE_INDEX_METADATA('XX','REPLACE METADATA SYNC (MANUAL)');

The SR engineer then supplied the following script to change the SYNC to every 5 minutes.

exec CTX_DDL.REPLACE_INDEX_METADATA('XX','REPLACE METADATA SYNC (every "SYSDATE+5/1440")');

So the “moral of the story” is that as a DBA you have more than one tool to solve a problem and you should use as many of them as possible until the root cause of the problem is clear.