Saturday, September 29, 2018

Connection Pooling for Scalability and Performance

Got a call from our web support team, seems that a critical database transaction with a 2 second SLA is sometimes taking 6 seconds to complete. They have been investigating the problem for a day and have concluded that it is a Database problem. They thought there must be some type of short spike in the database that affects their transactions.

First step was to look at the health of the database, the load profile and wait events of ACTIVE sessions, nothing jumped out as a problem. I even did a session trace of one of the web service sessions in question and did not find any SQL query taking longer than a few milliseconds.

An AWR report showed an average of 50 logons per second over a 20 minute period. That set off the alarm bells as database connections are expensive (just think of the overhead of establishing a network connection and then also initializing a database connection which requires user authentication, authentication protocols, establishing transactional contexts etc.).

The web development team assured me that they are using connection pooling and the number of sessions was far below the processes parameter, but it was when I queried 2 sources: the DBA_AUDIT_SESSION view and the listener.ora log file that my suspicions where confirmed. Hundreds of LOGONS and LOGOFFS per minute for one specific database username connecting from 4 userhosts/clients.

select os_username, username, userhost, action_name, count(*)
from DBA_AUDIT_SESSION
where timestamp > sysdate - 1/24
having count(*) > 100
group by  os_username, username, userhost, action_name
order by count(*);

As I said nothing stood out in my initial investigation and count from V$SESSION, but now that I have the offending username and userhost I could look more closely at V$SESSION. Although the general count of sessions for this spesific username stayed constant at about 50, the LOGON_TIME was never more than 2 seconds older than the current time! The database sessions had a very short life: logon, executes one or a few queries, logoff. Everytime I queries V$SESSION I was looking at 50 different sessions than a few seconds before.

select to_char(sysdate, 'yyyy-mm-dd hh24:mi:ss') right_now,
to_char(logon_time, 'yyyy-mm-dd hh24:mi:ss') logon_time
from V$SESSION
where username = ‘XX’
and userhost in (‘HOST1’,’HOST2’,’HOST3’,’HOST4’);

It turned out that we have multiple apps and web teams using this database and a totally separate team added a transaction without using connection pooling! This affected all new database connections. The original team who brought the issue to my attention where using connection pooling, but every now and again even they had to create new connections, and it is while establishing these new connections on a already overloaded database connection function that slowed their transactions down. After alerting the teams of my findings, they made the desired changes and within a few hours the original team gave the thumbs up and wanted to know what I did to make the database faster.

24 hours later I used this query from DBA_HIST_SYSMETRIC_SUMMARY to create a graph to show how the Logons per Second metric came down, the timing matched a graph from the app team that showed when all web transactions stabilized to less than 1 second again.

select to_char(s.end_interval_time,'mm-dd_hh24:mi') snap_time, sy.average
from DBA_HIST_SNAPSHOT s, DBA_HIST_SYSMETRIC_SUMMARY sy
where s.dbid = sy.dbid
and s.instance_number = sy.instance_number
and s.snap_id = sy.snap_id
and sy.metric_name = 'Logons Per Sec'
and s.end_interval_time > trunc(sysdate-2)
order by snap_time
/

Friday, September 21, 2018

Cloning a RAC DB on the same cluster

We are using OMF in ASM so will include the following 3 lines in the RMAN DUPLICATE TARGET command:
SET DB_FILE_NAME_CONVERT '+DG_DATA','+DG_DATA'
SET LOG_FILE_NAME_CONVERT '+DG_DATA','+DG_DATA'
SET CONTROL_FILES='+DG_DATA','+DG_DATA'

Do NOT use NOFILENAMECHECK when closing to the same server, any error in your script and RMAN might overwrite the target DBs db files!

Also need to SET CLUSTER_DATABASE='FALSE' in the RMAN DUPLICATE TARGET command because I will clone a RAC to single instance and then convert the single instance to a RAC.

Prep includes creating a pwd file and a init.ora file with 3 lines
db_name=TMTEST
db_unique_name=TMTEST
compatible=12.1.0.2.0

I added the 3rd line after getting “RMAN-06136: ORACLE error from auxiliary database: ORA-00201: control file version 12.1.0.2.0 incompatible with ORACLE version 12.0.0.0.0” on my first clone attempt

Set ORACLE_SID to the new DB (TMTEST) and startup nomount. When launching RMAN connect to the TARGET via a TNS entry

rman target sys/XX@DLIOWCC auxiliary sys/XX
RMAN> DUPLICATE TARGET DATABASE TO TMTEST FROM ACTIVE DATABASE
SPFILE
SET CONTROL_FILES='+DG_DATA','+DG_DATA';
SET DB_FILE_NAME_CONVERT '+DG_DATA','+DG_DATA'
SET LOG_FILE_NAME_CONVERT '+DG_DATA','+DG_DATA'
SET CLUSTER_DATABASE='FALSE';

10 minutes later I had a Cloned DB with ORACLE_SID=TMTEST

Some post cloning tasks in order to make it a RAC:

SQL> show parameter spfile
NAME    TYPE    VALUE
------- ------- ---------------------------------------------------
spfile  string  /u01/oracle/product/12.1.0/db_1/dbs/spfileTMTEST.ora
SQL> alter system set cluster_database=TRUE scope=spfile sid='*';
SQL> create pfile ‘initTMTESTcloned.ora’ from spfile;
vi initTMTESTcloned.ora

Found a few entries I had to change from:
DLIOWCC2.instance_number=2
DLIOWCC1.instance_number=1
DLIOWCC2.thread=2
DLIOWCC1.thread=1
DLIOWCC1.undo_tablespace='UNDOTBS1'
DLIOWCC2.undo_tablespace='UNDOTBS2'
To
TMTEST2.instance_number=2
TMTEST1.instance_number=1
TMTEST2.thread=2
TMTEST1.thread=1
TMTEST1.undo_tablespace='UNDOTBS1'
TMTEST2.undo_tablespace='UNDOTBS2'

SQL> create spfile='+DG_DATA/TMTEST/spfileTMTEST.ora' from pfile='initTMTESTcloned.ora';
vi initTMTEST1.ora
add line spfile='+DG_DATA/TMTEST/spfileTMTEST.ora'
copy the init.ora and orapw file to node 2

Register the database and modify the configuration
srvctl add database -db TMTEST -oraclehome /u01/app/oracle/product/12.1.0/dbhome_1 -spfile +DG_DATA/TMTEST/spfileTMTEST.ora
srvctl add instance -db TMTEST -i TMTEST1 -node node1za
srvctl add instance -db TMTEST -i TMTEST2 -node node2za
srvctl config database -db TMTEST

shutdown the instance (remember ORACLE_SID=TMTEST)
and start the DB from srvctl
srvctl start database -db TMTEST

Remember to change ORACLE_SID from TMTEST to TMTEST1 if you want to connect via sqlplus on the DB node node1za

Monday, September 17, 2018

ORA-16047 - DGID mismatch between destination setting and target database

A few days after a DG planned switchover I got a call from the DBAs. OEM and v$dataguard_stats shows no lag, but archives have not shipped from the Primary DB to the Physical Standby DB since the switchover.


After confirming their suspicions by querying V$archived_log and v$dataguard_stats I went to the 3 primary sources for DG troubleshooting.

1.) DGMGRL

the command "show configuration" indicated that the Standby database was disabled.

2.) Primary and Standby Alert.log files

The Primary DB had this peculiar error in its Alert.log file:
Fri Sep 14 13:55:31 2018
ORA-16047: DGID mismatch between destination setting and target database

using the oerr utility I found the full descrition of the error:

Error:  ORA-16047 DGID mismatch between destination setting and target database 
---------------------------------------------------------------------------
Cause:  The DB_UNIQUE_NAME specified for the destination did not match the 
 DB_UNIQUE_NAME at the target database. 
Action: Make sure the DB_UNIQUE_NAME specified in the LOG_ARCHIVE_DEST_n 
 parameter matches the DB_UNIQUE_NAME parameter defined at the 
 destination. 

But the DB_UNIQUE_NAME was correct in the Primary's LOG_ARCHIVE_DEST_2 parameter.

Next I looked at the Standby DBs Alert.log file, especially around the timestamp of the ORA-16047 error in the Primary DBs Alert.log. It seems that the Broker was not happy and decided to deconfigure the LOG_ARCHIVE_CONFIG parameter!

Fri Sep 14 13:54:23 2018
ALTER SYSTEM SET log_archive_config='nodg_config' SCOPE=BOTH;
...
Completed: Data Guard Broker shutdown
Fri Sep 14 13:54:59 2018
ALTER SYSTEM SET dg_broker_start=FALSE SCOPE=BOTH;
ALTER SYSTEM SET dg_broker_start=TRUE SCOPE=BOTH;


3.) Primary and Standby DG Alert log files

Lets look at the DG Alert.log files around the same timestamp.
The Standby DG Alert.log had hundreds of repeating ORA-16501 error messages until the timestamp in question when we got:

...
Failed to send message to site pc3dfmw2tc. Error code is ORA-16501.
...
09/14/2018 13:54:06
REMOVE CONFIGURATION [PRESEVERE DESTINATIONS]

So I was correct, the Broker was not happy (as can be seen by the repeated ORA-16501 errors) and eventually deconfigured the LOG_ARCHIVE_CONFIG parameter!

Error:  ORA-16501 the Data Guard broker operation failed
---------------------------------------------------------------------------
Cause:  the Data Guard broker operation failed
Action: See accompanying messages for details.

Not a lot of info for the ORA-16501 error, it seems to be a generic error. Investigating further back in the Standby's DG Alert.log file I found that the switchover completed with errors. The DBA was satisfied that the switchover worked, but did not do due diligence in investigating the accompanied error, and a few hours later DG stopped shipping Archivelogs altogether.
The solution was to correct the log_archive_config setting to what it was before the switchover, then to re-enable the Standby database in the DG Broker config. That fixed the lag, archives are now transported and applied to the Standby DB. That also corrected the Apply Lag reported in v$dataguard_stats.The Standby DB is now in sync and is usable in case of a DR, but further troubleshooting is necessary to determine why the Broker configuration was not 100% happy in the first place.