Tuesday, October 30, 2018

OS Logfiles via SQL

Due to security restrictions it has become cumbersome to login to a DB Server or to even login as SYS to the DB.
I've had to quickly adapt and found the following views and queries beneficial when querying log files that I normally accessed from the OS:

To view the rman log for the last rman session:

select output
from v$rman_output
where session_recid = (select max(session_recid) from v$rman_status)
order by recid ;

The v$rman_output view is memory-based so it will not persist thru a database restart, the view provides the exact output from rman as it would appear on the screen or a logfile.

To query the alert.log:

select * from
(select record_id,
        to_char(originating_timestamp,'DD.MM.YYYY HH24:MI:SS'),
        message_text
 from x$dbgalertext
 order by RECORD_ID desc)
where rownum <= 20 order by RECORD_ID asc;

but will fail with "ORA-00942: table or view does not exist" if you are not logged in as SYS, and you can't even grant select access to the X$ table to another user, the grant will fail with "ORA-02030: can only select from fixed tables/views". My workaround was to create a view and then grant a select privilege on the view:

create view alert_log as select * from  X$DBGALERTEXT;
grant select on alert_log to meyerm;

Then as user meyerm:

select * from
(select record_id,
        to_char(originating_timestamp,'DD.MM.YYYY HH24:MI:SS'),
        message_text
 from sys.alert_log
 order by RECORD_ID desc)
where rownum <= 20 order by RECORD_ID asc;

Monday, October 8, 2018

Move datafile to another ASM diskgroup

From Oracle 12.1 you can execute SQL commands from within RMAN, which was great for this exercise so I did not have to jump between RMAN and SQLPlus every few minutes.

The steps to move a datafile to another ASM diskgroup is basically to take the datafile offline, copy the datafile to the new diskgroup (using RMAN commands), rename the datafile (or switch to copy) and bring the datafile online again.

RMAN> alter database datafile '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545' offline;

RMAN> copy datafile '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545' to '+DG_C2DATA';

...
output file name=+DG_C2DATA/ac3duim/datafile/ggatetbs.275.923218515 tag=TAG20160922T093514 RECID=1 STAMP=923218516
...

RMAN> alter database rename file '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545' to '+DG_C2DATA/ac3duim/datafile/ggatetbs.275.923218515';
OR
RMAN> switch datafile '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545' to copy;

RMAN> recover datafile '+DG_C2DATA/ac3duim/datafile/ggatetbs.275.923218515';

RMAN> alter database datafile '+DG_C2DATA/ac3duim/datafile/ggatetbs.275.923218515' online;

The old datafile can now be deleted (also via RMAN)


RMAN> delete datafilecopy '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545';

Wednesday, October 3, 2018

Data Guard Failover surprise

Ran into a interesting scenario after reinstating the previous primary DB after an emergency failover.

A Data Guard failover is done during emergencies (an actual DR situation) and the old standby gets activated as a Primary database with RESETLOGS after an incomplete recovery. 

Here's the alert.log from the new Primary database:

Tue Oct 02 16:07:34 2018
 Data Guard Broker: A failover will now be attempted
Tue Oct 02 16:07:34 2018
ALTER DATABASE FAILOVER TO TMSBY
...
Tue Oct 02 16:07:40 2018
ALTER DATABASE SWITCHOVER TO PRIMARY (TMSBY1)
RESETLOGS after incomplete recovery UNTIL CHANGE 14297961467966 time 10/02/2018 16:02:10
...
Tue Oct 02 16:07:40 2018
Setting recovery target incarnation to 3
Tue Oct 02 16:07:41 2018
Switchover: Complete - Database mounted as primary

A few hours later the old Primary DB (TMPRD) was fixed, started in MOUNT mode and reinstated via the Broker:

DGMGRL> show configuration
Configuration - tmdg
  Protection Mode: MaxPerformance
  Members:
  TMSBY  - Primary database
    TMPRD - Physical standby database (disabled)
      ORA-16661: the standby database needs to be reinstated
Fast-Start Failover: DISABLED
Configuration Status:
SUCCESS   (status updated 26 seconds ago)
DGMGRL> reinstate database TMPRD
Reinstating database "TMPRD", please wait...
Reinstatement of database "TMPRD" succeeded

Reinstatement completed successfully, but the new standby (TMPRD) was not applying logs, the alert.log confirmed that it was waiting for an Archivelog that was only created on the previous standby site. 

Media Recovery Waiting for thread 2 sequence 358 branch(resetlogs_id) 987409798

But that archivelog has already been backed up and deleted via RMAN. A straight forward restore on TMSBY should work but then it failed with: 

RMAN> restore archivelog sequence 358 thread 2;
Starting restore at 03-OCT-18 08:40
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=232 instance=TMSBY1 device type=DISK
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 10/03/2018 08:40:15
RMAN-20242: specification does not match any archived log in the repository

Remember the Failover forced a RESETLOGS so the incarnation of the DB changed, so the trick is:

RMAN> list incarnation;
List of Database Incarnations
DB Key  Inc Key DB Name  DB ID            STATUS  Reset SCN  Reset Time
------- ------- -------- ---------------- --- ---------- ----------
1       1       TMPRD   645377540        PARENT  1594143    31-AUG-18 10:37
2       2       TMPRD   645377540        PARENT  5368114    21-SEP-18 08:29
3       3       TMPRD   645377540        CURRENT 14297961467967 02-OCT-18 16:07

RMAN> restore archivelog sequence 358 thread 2 Incarnation 2;

Starting restore at 03-OCT-18 08:44
using channel ORA_DISK_1
channel ORA_DISK_1: starting archived log restore to default destination
channel ORA_DISK_1: restoring archived log
archived log thread=2 sequence=358
channel ORA_DISK_1: reading from backup piece +DG_FRA/TMSBY/BACKUPSET/2018_10_02/annnf0_TMSBY1_arch_0.337.988488123
channel ORA_DISK_1: piece handle=+DG_FRA/TMSBY/BACKUPSET/2018_10_02/annnf0_TMSBY1_arch_0.337.988488123 tag=TMSBY1_ARCH
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:15
Finished restore at 03-OCT-18 08:45

And within a few minutes the restored Archivelog gets picked up by the FAL process, transfered and applied on the new standby (TMPRD). Alert.log contents:

RFS[3]: Opened log for thread 2 sequence 358 dbid 645377540 branch 987409798
...
Archived Log entry 1627 added for thread 2 sequence 358 rlc 987409798 ID 0x0 dest 2:
...
Media Recovery Log +DG_ARCHIVE/TMPRD/ARCHIVELOG/2018_10_03/thread_2_seq_358.49304.988533935

All in all I was impressed with the smooth failover and reinstatement via the broker, but even then, not everything can get automated and there's always some hicup where the DBA should roll up their sleeves and dig in in order to get things back to normal again.

Saturday, September 29, 2018

Connection Pooling for Scalability and Performance

Got a call from our web support team, seems that a critical database transaction with a 2 second SLA is sometimes taking 6 seconds to complete. They have been investigating the problem for a day and have concluded that it is a Database problem. They thought there must be some type of short spike in the database that affects their transactions.

First step was to look at the health of the database, the load profile and wait events of ACTIVE sessions, nothing jumped out as a problem. I even did a session trace of one of the web service sessions in question and did not find any SQL query taking longer than a few milliseconds.

An AWR report showed an average of 50 logons per second over a 20 minute period. That set off the alarm bells as database connections are expensive (just think of the overhead of establishing a network connection and then also initializing a database connection which requires user authentication, authentication protocols, establishing transactional contexts etc.).

The web development team assured me that they are using connection pooling and the number of sessions was far below the processes parameter, but it was when I queried 2 sources: the DBA_AUDIT_SESSION view and the listener.ora log file that my suspicions where confirmed. Hundreds of LOGONS and LOGOFFS per minute for one specific database username connecting from 4 userhosts/clients.

select os_username, username, userhost, action_name, count(*)
from DBA_AUDIT_SESSION
where timestamp > sysdate - 1/24
having count(*) > 100
group by  os_username, username, userhost, action_name
order by count(*);

As I said nothing stood out in my initial investigation and count from V$SESSION, but now that I have the offending username and userhost I could look more closely at V$SESSION. Although the general count of sessions for this spesific username stayed constant at about 50, the LOGON_TIME was never more than 2 seconds older than the current time! The database sessions had a very short life: logon, executes one or a few queries, logoff. Everytime I queries V$SESSION I was looking at 50 different sessions than a few seconds before.

select to_char(sysdate, 'yyyy-mm-dd hh24:mi:ss') right_now,
to_char(logon_time, 'yyyy-mm-dd hh24:mi:ss') logon_time
from V$SESSION
where username = ‘XX’
and userhost in (‘HOST1’,’HOST2’,’HOST3’,’HOST4’);

It turned out that we have multiple apps and web teams using this database and a totally separate team added a transaction without using connection pooling! This affected all new database connections. The original team who brought the issue to my attention where using connection pooling, but every now and again even they had to create new connections, and it is while establishing these new connections on a already overloaded database connection function that slowed their transactions down. After alerting the teams of my findings, they made the desired changes and within a few hours the original team gave the thumbs up and wanted to know what I did to make the database faster.

24 hours later I used this query from DBA_HIST_SYSMETRIC_SUMMARY to create a graph to show how the Logons per Second metric came down, the timing matched a graph from the app team that showed when all web transactions stabilized to less than 1 second again.

select to_char(s.end_interval_time,'mm-dd_hh24:mi') snap_time, sy.average
from DBA_HIST_SNAPSHOT s, DBA_HIST_SYSMETRIC_SUMMARY sy
where s.dbid = sy.dbid
and s.instance_number = sy.instance_number
and s.snap_id = sy.snap_id
and sy.metric_name = 'Logons Per Sec'
and s.end_interval_time > trunc(sysdate-2)
order by snap_time
/

Friday, September 21, 2018

Cloning a RAC DB on the same cluster

We are using OMF in ASM so will include the following 3 lines in the RMAN DUPLICATE TARGET command:
SET DB_FILE_NAME_CONVERT '+DG_DATA','+DG_DATA'
SET LOG_FILE_NAME_CONVERT '+DG_DATA','+DG_DATA'
SET CONTROL_FILES='+DG_DATA','+DG_DATA'

Do NOT use NOFILENAMECHECK when closing to the same server, any error in your script and RMAN might overwrite the target DBs db files!

Also need to SET CLUSTER_DATABASE='FALSE' in the RMAN DUPLICATE TARGET command because I will clone a RAC to single instance and then convert the single instance to a RAC.

Prep includes creating a pwd file and a init.ora file with 3 lines
db_name=TMTEST
db_unique_name=TMTEST
compatible=12.1.0.2.0

I added the 3rd line after getting “RMAN-06136: ORACLE error from auxiliary database: ORA-00201: control file version 12.1.0.2.0 incompatible with ORACLE version 12.0.0.0.0” on my first clone attempt

Set ORACLE_SID to the new DB (TMTEST) and startup nomount. When launching RMAN connect to the TARGET via a TNS entry

rman target sys/XX@DLIOWCC auxiliary sys/XX
RMAN> DUPLICATE TARGET DATABASE TO TMTEST FROM ACTIVE DATABASE
SPFILE
SET CONTROL_FILES='+DG_DATA','+DG_DATA';
SET DB_FILE_NAME_CONVERT '+DG_DATA','+DG_DATA'
SET LOG_FILE_NAME_CONVERT '+DG_DATA','+DG_DATA'
SET CLUSTER_DATABASE='FALSE';

10 minutes later I had a Cloned DB with ORACLE_SID=TMTEST

Some post cloning tasks in order to make it a RAC:

SQL> show parameter spfile
NAME    TYPE    VALUE
------- ------- ---------------------------------------------------
spfile  string  /u01/oracle/product/12.1.0/db_1/dbs/spfileTMTEST.ora
SQL> alter system set cluster_database=TRUE scope=spfile sid='*';
SQL> create pfile ‘initTMTESTcloned.ora’ from spfile;
vi initTMTESTcloned.ora

Found a few entries I had to change from:
DLIOWCC2.instance_number=2
DLIOWCC1.instance_number=1
DLIOWCC2.thread=2
DLIOWCC1.thread=1
DLIOWCC1.undo_tablespace='UNDOTBS1'
DLIOWCC2.undo_tablespace='UNDOTBS2'
To
TMTEST2.instance_number=2
TMTEST1.instance_number=1
TMTEST2.thread=2
TMTEST1.thread=1
TMTEST1.undo_tablespace='UNDOTBS1'
TMTEST2.undo_tablespace='UNDOTBS2'

SQL> create spfile='+DG_DATA/TMTEST/spfileTMTEST.ora' from pfile='initTMTESTcloned.ora';
vi initTMTEST1.ora
add line spfile='+DG_DATA/TMTEST/spfileTMTEST.ora'
copy the init.ora and orapw file to node 2

Register the database and modify the configuration
srvctl add database -db TMTEST -oraclehome /u01/app/oracle/product/12.1.0/dbhome_1 -spfile +DG_DATA/TMTEST/spfileTMTEST.ora
srvctl add instance -db TMTEST -i TMTEST1 -node node1za
srvctl add instance -db TMTEST -i TMTEST2 -node node2za
srvctl config database -db TMTEST

shutdown the instance (remember ORACLE_SID=TMTEST)
and start the DB from srvctl
srvctl start database -db TMTEST

Remember to change ORACLE_SID from TMTEST to TMTEST1 if you want to connect via sqlplus on the DB node node1za

Monday, September 17, 2018

ORA-16047 - DGID mismatch between destination setting and target database

A few days after a DG planned switchover I got a call from the DBAs. OEM and v$dataguard_stats shows no lag, but archives have not shipped from the Primary DB to the Physical Standby DB since the switchover.


After confirming their suspicions by querying V$archived_log and v$dataguard_stats I went to the 3 primary sources for DG troubleshooting.

1.) DGMGRL

the command "show configuration" indicated that the Standby database was disabled.

2.) Primary and Standby Alert.log files

The Primary DB had this peculiar error in its Alert.log file:
Fri Sep 14 13:55:31 2018
ORA-16047: DGID mismatch between destination setting and target database

using the oerr utility I found the full descrition of the error:

Error:  ORA-16047 DGID mismatch between destination setting and target database 
---------------------------------------------------------------------------
Cause:  The DB_UNIQUE_NAME specified for the destination did not match the 
 DB_UNIQUE_NAME at the target database. 
Action: Make sure the DB_UNIQUE_NAME specified in the LOG_ARCHIVE_DEST_n 
 parameter matches the DB_UNIQUE_NAME parameter defined at the 
 destination. 

But the DB_UNIQUE_NAME was correct in the Primary's LOG_ARCHIVE_DEST_2 parameter.

Next I looked at the Standby DBs Alert.log file, especially around the timestamp of the ORA-16047 error in the Primary DBs Alert.log. It seems that the Broker was not happy and decided to deconfigure the LOG_ARCHIVE_CONFIG parameter!

Fri Sep 14 13:54:23 2018
ALTER SYSTEM SET log_archive_config='nodg_config' SCOPE=BOTH;
...
Completed: Data Guard Broker shutdown
Fri Sep 14 13:54:59 2018
ALTER SYSTEM SET dg_broker_start=FALSE SCOPE=BOTH;
ALTER SYSTEM SET dg_broker_start=TRUE SCOPE=BOTH;


3.) Primary and Standby DG Alert log files

Lets look at the DG Alert.log files around the same timestamp.
The Standby DG Alert.log had hundreds of repeating ORA-16501 error messages until the timestamp in question when we got:

...
Failed to send message to site pc3dfmw2tc. Error code is ORA-16501.
...
09/14/2018 13:54:06
REMOVE CONFIGURATION [PRESEVERE DESTINATIONS]

So I was correct, the Broker was not happy (as can be seen by the repeated ORA-16501 errors) and eventually deconfigured the LOG_ARCHIVE_CONFIG parameter!

Error:  ORA-16501 the Data Guard broker operation failed
---------------------------------------------------------------------------
Cause:  the Data Guard broker operation failed
Action: See accompanying messages for details.

Not a lot of info for the ORA-16501 error, it seems to be a generic error. Investigating further back in the Standby's DG Alert.log file I found that the switchover completed with errors. The DBA was satisfied that the switchover worked, but did not do due diligence in investigating the accompanied error, and a few hours later DG stopped shipping Archivelogs altogether.
The solution was to correct the log_archive_config setting to what it was before the switchover, then to re-enable the Standby database in the DG Broker config. That fixed the lag, archives are now transported and applied to the Standby DB. That also corrected the Apply Lag reported in v$dataguard_stats.The Standby DB is now in sync and is usable in case of a DR, but further troubleshooting is necessary to determine why the Broker configuration was not 100% happy in the first place.