Got a call from our web support team, seems
that a critical database transaction with a 2 second SLA is sometimes taking 6 seconds to
complete. They have been investigating the problem for a day and have concluded
that it is a Database problem. They thought there must be some type of short spike in the database that affects their transactions.
First step was to look at the health of the database, the load profile and wait events of ACTIVE sessions, nothing jumped out as a problem. I even did a session trace of one of the web service sessions in question and did not find any SQL query taking longer than a few milliseconds.
First step was to look at the health of the database, the load profile and wait events of ACTIVE sessions, nothing jumped out as a problem. I even did a session trace of one of the web service sessions in question and did not find any SQL query taking longer than a few milliseconds.
An AWR report showed an average of 50
logons per second over a 20 minute period. That set off the alarm bells as database
connections are expensive (just think of the overhead of establishing a network
connection and then also initializing a database connection which requires user
authentication, authentication protocols, establishing transactional contexts etc.).
The web development team assured me that
they are using connection pooling and the number of sessions was far below the
processes parameter, but it was when I queried 2 sources: the DBA_AUDIT_SESSION
view and the listener.ora log file that my suspicions where confirmed. Hundreds
of LOGONS and LOGOFFS per minute for one specific database username connecting
from 4 userhosts/clients.
select os_username, username, userhost,
action_name, count(*)
from DBA_AUDIT_SESSION
where timestamp > sysdate - 1/24
having count(*) > 100
group by os_username, username, userhost,
action_name
order by count(*);
As I said nothing stood out in my initial
investigation and count from V$SESSION, but now that I have the offending
username and userhost I could look more closely at V$SESSION. Although the
general count of sessions for this spesific username stayed constant at about
50, the LOGON_TIME was never more than 2 seconds older than the current time! The
database sessions had a very short life: logon, executes one or a few queries, logoff.
Everytime I queries V$SESSION I was looking at 50 different sessions than a few
seconds before.
select to_char(sysdate, 'yyyy-mm-dd hh24:mi:ss')
right_now,
to_char(logon_time, 'yyyy-mm-dd hh24:mi:ss')
logon_time
from V$SESSION
where username = ‘XX’
and userhost in (‘HOST1’,’HOST2’,’HOST3’,’HOST4’);
It turned out that we have multiple apps
and web teams using this database and a totally separate team added a
transaction without using connection pooling! This affected all new database connections. The original team who brought the issue to my attention where using connection pooling, but every now and again even they had to create new connections, and it is while establishing these new connections on a already overloaded database connection function that slowed their transactions down. After alerting the teams of my findings,
they made the desired changes and within
a few hours the original team gave the thumbs up and wanted to know
what I did to make the database faster.
24 hours later I used this query from DBA_HIST_SYSMETRIC_SUMMARY to create a graph to show how the Logons per Second metric came
down, the timing matched a graph from the app team that showed when all web
transactions stabilized to less than 1 second again.
select
to_char(s.end_interval_time,'mm-dd_hh24:mi') snap_time, sy.average
from
DBA_HIST_SNAPSHOT s, DBA_HIST_SYSMETRIC_SUMMARY sy
where
s.dbid = sy.dbid
and
s.instance_number = sy.instance_number
and
s.snap_id = sy.snap_id
and
sy.metric_name = 'Logons Per Sec'
and
s.end_interval_time > trunc(sysdate-2)
order
by snap_time
/
No comments:
Post a Comment