Wednesday, October 3, 2018

Data Guard Failover surprise

Ran into a interesting scenario after reinstating the previous primary DB after an emergency failover.

A Data Guard failover is done during emergencies (an actual DR situation) and the old standby gets activated as a Primary database with RESETLOGS after an incomplete recovery. 

Here's the alert.log from the new Primary database:

Tue Oct 02 16:07:34 2018
 Data Guard Broker: A failover will now be attempted
Tue Oct 02 16:07:34 2018
ALTER DATABASE FAILOVER TO TMSBY
...
Tue Oct 02 16:07:40 2018
ALTER DATABASE SWITCHOVER TO PRIMARY (TMSBY1)
RESETLOGS after incomplete recovery UNTIL CHANGE 14297961467966 time 10/02/2018 16:02:10
...
Tue Oct 02 16:07:40 2018
Setting recovery target incarnation to 3
Tue Oct 02 16:07:41 2018
Switchover: Complete - Database mounted as primary

A few hours later the old Primary DB (TMPRD) was fixed, started in MOUNT mode and reinstated via the Broker:

DGMGRL> show configuration
Configuration - tmdg
  Protection Mode: MaxPerformance
  Members:
  TMSBY  - Primary database
    TMPRD - Physical standby database (disabled)
      ORA-16661: the standby database needs to be reinstated
Fast-Start Failover: DISABLED
Configuration Status:
SUCCESS   (status updated 26 seconds ago)
DGMGRL> reinstate database TMPRD
Reinstating database "TMPRD", please wait...
Reinstatement of database "TMPRD" succeeded

Reinstatement completed successfully, but the new standby (TMPRD) was not applying logs, the alert.log confirmed that it was waiting for an Archivelog that was only created on the previous standby site. 

Media Recovery Waiting for thread 2 sequence 358 branch(resetlogs_id) 987409798

But that archivelog has already been backed up and deleted via RMAN. A straight forward restore on TMSBY should work but then it failed with: 

RMAN> restore archivelog sequence 358 thread 2;
Starting restore at 03-OCT-18 08:40
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=232 instance=TMSBY1 device type=DISK
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 10/03/2018 08:40:15
RMAN-20242: specification does not match any archived log in the repository

Remember the Failover forced a RESETLOGS so the incarnation of the DB changed, so the trick is:

RMAN> list incarnation;
List of Database Incarnations
DB Key  Inc Key DB Name  DB ID            STATUS  Reset SCN  Reset Time
------- ------- -------- ---------------- --- ---------- ----------
1       1       TMPRD   645377540        PARENT  1594143    31-AUG-18 10:37
2       2       TMPRD   645377540        PARENT  5368114    21-SEP-18 08:29
3       3       TMPRD   645377540        CURRENT 14297961467967 02-OCT-18 16:07

RMAN> restore archivelog sequence 358 thread 2 Incarnation 2;

Starting restore at 03-OCT-18 08:44
using channel ORA_DISK_1
channel ORA_DISK_1: starting archived log restore to default destination
channel ORA_DISK_1: restoring archived log
archived log thread=2 sequence=358
channel ORA_DISK_1: reading from backup piece +DG_FRA/TMSBY/BACKUPSET/2018_10_02/annnf0_TMSBY1_arch_0.337.988488123
channel ORA_DISK_1: piece handle=+DG_FRA/TMSBY/BACKUPSET/2018_10_02/annnf0_TMSBY1_arch_0.337.988488123 tag=TMSBY1_ARCH
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:15
Finished restore at 03-OCT-18 08:45

And within a few minutes the restored Archivelog gets picked up by the FAL process, transfered and applied on the new standby (TMPRD). Alert.log contents:

RFS[3]: Opened log for thread 2 sequence 358 dbid 645377540 branch 987409798
...
Archived Log entry 1627 added for thread 2 sequence 358 rlc 987409798 ID 0x0 dest 2:
...
Media Recovery Log +DG_ARCHIVE/TMPRD/ARCHIVELOG/2018_10_03/thread_2_seq_358.49304.988533935

All in all I was impressed with the smooth failover and reinstatement via the broker, but even then, not everything can get automated and there's always some hicup where the DBA should roll up their sleeves and dig in in order to get things back to normal again.

No comments: