Saturday, September 29, 2018

Connection Pooling for Scalability and Performance

Got a call from our web support team, seems that a critical database transaction with a 2 second SLA is sometimes taking 6 seconds to complete. They have been investigating the problem for a day and have concluded that it is a Database problem. They thought there must be some type of short spike in the database that affects their transactions.

First step was to look at the health of the database, the load profile and wait events of ACTIVE sessions, nothing jumped out as a problem. I even did a session trace of one of the web service sessions in question and did not find any SQL query taking longer than a few milliseconds.

An AWR report showed an average of 50 logons per second over a 20 minute period. That set off the alarm bells as database connections are expensive (just think of the overhead of establishing a network connection and then also initializing a database connection which requires user authentication, authentication protocols, establishing transactional contexts etc.).

The web development team assured me that they are using connection pooling and the number of sessions was far below the processes parameter, but it was when I queried 2 sources: the DBA_AUDIT_SESSION view and the listener.ora log file that my suspicions where confirmed. Hundreds of LOGONS and LOGOFFS per minute for one specific database username connecting from 4 userhosts/clients.

select os_username, username, userhost, action_name, count(*)
from DBA_AUDIT_SESSION
where timestamp > sysdate - 1/24
having count(*) > 100
group by  os_username, username, userhost, action_name
order by count(*);

As I said nothing stood out in my initial investigation and count from V$SESSION, but now that I have the offending username and userhost I could look more closely at V$SESSION. Although the general count of sessions for this spesific username stayed constant at about 50, the LOGON_TIME was never more than 2 seconds older than the current time! The database sessions had a very short life: logon, executes one or a few queries, logoff. Everytime I queries V$SESSION I was looking at 50 different sessions than a few seconds before.

select to_char(sysdate, 'yyyy-mm-dd hh24:mi:ss') right_now,
to_char(logon_time, 'yyyy-mm-dd hh24:mi:ss') logon_time
from V$SESSION
where username = ‘XX’
and userhost in (‘HOST1’,’HOST2’,’HOST3’,’HOST4’);

It turned out that we have multiple apps and web teams using this database and a totally separate team added a transaction without using connection pooling! This affected all new database connections. The original team who brought the issue to my attention where using connection pooling, but every now and again even they had to create new connections, and it is while establishing these new connections on a already overloaded database connection function that slowed their transactions down. After alerting the teams of my findings, they made the desired changes and within a few hours the original team gave the thumbs up and wanted to know what I did to make the database faster.

24 hours later I used this query from DBA_HIST_SYSMETRIC_SUMMARY to create a graph to show how the Logons per Second metric came down, the timing matched a graph from the app team that showed when all web transactions stabilized to less than 1 second again.

select to_char(s.end_interval_time,'mm-dd_hh24:mi') snap_time, sy.average
from DBA_HIST_SNAPSHOT s, DBA_HIST_SYSMETRIC_SUMMARY sy
where s.dbid = sy.dbid
and s.instance_number = sy.instance_number
and s.snap_id = sy.snap_id
and sy.metric_name = 'Logons Per Sec'
and s.end_interval_time > trunc(sysdate-2)
order by snap_time
/

No comments: