Due to security restrictions it has become cumbersome to login to a DB Server or to even login as SYS to the DB.
I've had to quickly adapt and found the following views and queries beneficial when querying log files that I normally accessed from the OS:
To view the rman log for the last rman session:
select output
from v$rman_output
where session_recid = (select max(session_recid) from v$rman_status)
order by recid ;
The v$rman_output view is memory-based so it will not persist thru a database restart, the view provides the exact output from rman as it would appear on the screen or a logfile.
To query the alert.log:
select * from
(select record_id,
to_char(originating_timestamp,'DD.MM.YYYY HH24:MI:SS'),
message_text
from x$dbgalertext
order by RECORD_ID desc)
where rownum <= 20 order by RECORD_ID asc;
but will fail with "ORA-00942: table or view does not exist" if you are not logged in as SYS, and you can't even grant select access to the X$ table to another user, the grant will fail with "ORA-02030: can only select from fixed tables/views". My workaround was to create a view and then grant a select privilege on the view:
create view alert_log as select * from X$DBGALERTEXT;
grant select on alert_log to meyerm;
Then as user meyerm:
select * from
(select record_id,
to_char(originating_timestamp,'DD.MM.YYYY HH24:MI:SS'),
message_text
from sys.alert_log
order by RECORD_ID desc)
where rownum <= 20 order by RECORD_ID asc;
Tuesday, October 30, 2018
Monday, October 8, 2018
Move datafile to another ASM diskgroup
From Oracle 12.1 you can execute SQL commands from within RMAN, which was great for this exercise so I did not have to jump between RMAN and SQLPlus every few minutes.
The steps to move a datafile to another ASM diskgroup is basically to take the datafile offline, copy the datafile to the new diskgroup (using RMAN commands), rename the datafile (or switch to copy) and bring the datafile online again.
RMAN> alter database datafile '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545' offline;
RMAN> copy datafile '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545' to '+DG_C2DATA';
...
output file name=+DG_C2DATA/ac3duim/datafile/ggatetbs.275.923218515 tag=TAG20160922T093514 RECID=1 STAMP=923218516
...
RMAN> alter database rename file '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545' to '+DG_C2DATA/ac3duim/datafile/ggatetbs.275.923218515';
OR
RMAN> switch datafile '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545' to copy;
RMAN> recover datafile '+DG_C2DATA/ac3duim/datafile/ggatetbs.275.923218515';
RMAN> alter database datafile '+DG_C2DATA/ac3duim/datafile/ggatetbs.275.923218515' online;
The old datafile can now be deleted (also via RMAN)
RMAN> delete datafilecopy '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545';
The steps to move a datafile to another ASM diskgroup is basically to take the datafile offline, copy the datafile to the new diskgroup (using RMAN commands), rename the datafile (or switch to copy) and bring the datafile online again.
RMAN> alter database datafile '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545' offline;
RMAN> copy datafile '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545' to '+DG_C2DATA';
...
output file name=+DG_C2DATA/ac3duim/datafile/ggatetbs.275.923218515 tag=TAG20160922T093514 RECID=1 STAMP=923218516
...
RMAN> alter database rename file '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545' to '+DG_C2DATA/ac3duim/datafile/ggatetbs.275.923218515';
OR
RMAN> switch datafile '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545' to copy;
RMAN> recover datafile '+DG_C2DATA/ac3duim/datafile/ggatetbs.275.923218515';
RMAN> alter database datafile '+DG_C2DATA/ac3duim/datafile/ggatetbs.275.923218515' online;
The old datafile can now be deleted (also via RMAN)
RMAN> delete datafilecopy '+DG_DATA/ac3duim/datafile/ggatetbs.362.905353545';
Wednesday, October 3, 2018
Data Guard Failover surprise
Ran into a interesting scenario after reinstating the previous primary DB after an emergency failover.
A Data Guard failover is done during emergencies (an actual DR situation) and the old standby gets activated as a Primary database with RESETLOGS after an incomplete recovery.
Here's the alert.log from the new Primary database:
Tue Oct 02 16:07:34 2018
Data Guard Broker: A failover will now be attempted
Tue Oct 02 16:07:34 2018
ALTER DATABASE FAILOVER TO TMSBY
...
Tue Oct 02 16:07:40 2018
ALTER DATABASE SWITCHOVER TO PRIMARY (TMSBY1)
RESETLOGS after incomplete recovery UNTIL CHANGE 14297961467966 time 10/02/2018 16:02:10
...
Tue Oct 02 16:07:40 2018
Setting recovery target incarnation to 3
Tue Oct 02 16:07:41 2018
Switchover: Complete - Database mounted as primary
A few hours later the old Primary DB (TMPRD) was fixed, started in MOUNT mode and reinstated via the Broker:
DGMGRL> show configuration
Configuration - tmdg
Protection Mode: MaxPerformance
Members:
TMSBY - Primary database
TMPRD - Physical standby database (disabled)
ORA-16661: the standby database needs to be reinstated
Fast-Start Failover: DISABLED
Configuration Status:
SUCCESS (status updated 26 seconds ago)
DGMGRL> reinstate database TMPRD
Reinstating database "TMPRD", please wait...
Reinstatement of database "TMPRD" succeeded
Reinstatement completed successfully, but the new standby (TMPRD) was not applying logs, the alert.log confirmed that it was waiting for an Archivelog that was only created on the previous standby site.
Media Recovery Waiting for thread 2 sequence 358 branch(resetlogs_id) 987409798
But that archivelog has already been backed up and deleted via RMAN. A straight forward restore on TMSBY should work but then it failed with:
RMAN> restore archivelog sequence 358 thread 2;
Starting restore at 03-OCT-18 08:40
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=232 instance=TMSBY1 device type=DISK
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 10/03/2018 08:40:15
RMAN-20242: specification does not match any archived log in the repository
Remember the Failover forced a RESETLOGS so the incarnation of the DB changed, so the trick is:
RMAN> list incarnation;
List of Database Incarnations
DB Key Inc Key DB Name DB ID STATUS Reset SCN Reset Time
------- ------- -------- ---------------- --- ---------- ----------
1 1 TMPRD 645377540 PARENT 1594143 31-AUG-18 10:37
2 2 TMPRD 645377540 PARENT 5368114 21-SEP-18 08:29
3 3 TMPRD 645377540 CURRENT 14297961467967 02-OCT-18 16:07
RMAN> restore archivelog sequence 358 thread 2 Incarnation 2;
Starting restore at 03-OCT-18 08:44
using channel ORA_DISK_1
channel ORA_DISK_1: starting archived log restore to default destination
channel ORA_DISK_1: restoring archived log
archived log thread=2 sequence=358
channel ORA_DISK_1: reading from backup piece +DG_FRA/TMSBY/BACKUPSET/2018_10_02/annnf0_TMSBY1_arch_0.337.988488123
channel ORA_DISK_1: piece handle=+DG_FRA/TMSBY/BACKUPSET/2018_10_02/annnf0_TMSBY1_arch_0.337.988488123 tag=TMSBY1_ARCH
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:15
Finished restore at 03-OCT-18 08:45
And within a few minutes the restored Archivelog gets picked up by the FAL process, transfered and applied on the new standby (TMPRD). Alert.log contents:
RFS[3]: Opened log for thread 2 sequence 358 dbid 645377540 branch 987409798
...
Archived Log entry 1627 added for thread 2 sequence 358 rlc 987409798 ID 0x0 dest 2:
...
Media Recovery Log +DG_ARCHIVE/TMPRD/ARCHIVELOG/2018_10_03/thread_2_seq_358.49304.988533935
A Data Guard failover is done during emergencies (an actual DR situation) and the old standby gets activated as a Primary database with RESETLOGS after an incomplete recovery.
Here's the alert.log from the new Primary database:
Tue Oct 02 16:07:34 2018
Data Guard Broker: A failover will now be attempted
Tue Oct 02 16:07:34 2018
ALTER DATABASE FAILOVER TO TMSBY
...
Tue Oct 02 16:07:40 2018
ALTER DATABASE SWITCHOVER TO PRIMARY (TMSBY1)
RESETLOGS after incomplete recovery UNTIL CHANGE 14297961467966 time 10/02/2018 16:02:10
...
Tue Oct 02 16:07:40 2018
Setting recovery target incarnation to 3
Tue Oct 02 16:07:41 2018
Switchover: Complete - Database mounted as primary
A few hours later the old Primary DB (TMPRD) was fixed, started in MOUNT mode and reinstated via the Broker:
DGMGRL> show configuration
Configuration - tmdg
Protection Mode: MaxPerformance
Members:
TMSBY - Primary database
TMPRD - Physical standby database (disabled)
ORA-16661: the standby database needs to be reinstated
Fast-Start Failover: DISABLED
Configuration Status:
SUCCESS (status updated 26 seconds ago)
DGMGRL> reinstate database TMPRD
Reinstating database "TMPRD", please wait...
Reinstatement of database "TMPRD" succeeded
Reinstatement completed successfully, but the new standby (TMPRD) was not applying logs, the alert.log confirmed that it was waiting for an Archivelog that was only created on the previous standby site.
Media Recovery Waiting for thread 2 sequence 358 branch(resetlogs_id) 987409798
But that archivelog has already been backed up and deleted via RMAN. A straight forward restore on TMSBY should work but then it failed with:
RMAN> restore archivelog sequence 358 thread 2;
Starting restore at 03-OCT-18 08:40
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=232 instance=TMSBY1 device type=DISK
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 10/03/2018 08:40:15
RMAN-20242: specification does not match any archived log in the repository
Remember the Failover forced a RESETLOGS so the incarnation of the DB changed, so the trick is:
RMAN> list incarnation;
List of Database Incarnations
DB Key Inc Key DB Name DB ID STATUS Reset SCN Reset Time
------- ------- -------- ---------------- --- ---------- ----------
1 1 TMPRD 645377540 PARENT 1594143 31-AUG-18 10:37
2 2 TMPRD 645377540 PARENT 5368114 21-SEP-18 08:29
3 3 TMPRD 645377540 CURRENT 14297961467967 02-OCT-18 16:07
RMAN> restore archivelog sequence 358 thread 2 Incarnation 2;
Starting restore at 03-OCT-18 08:44
using channel ORA_DISK_1
channel ORA_DISK_1: starting archived log restore to default destination
channel ORA_DISK_1: restoring archived log
archived log thread=2 sequence=358
channel ORA_DISK_1: reading from backup piece +DG_FRA/TMSBY/BACKUPSET/2018_10_02/annnf0_TMSBY1_arch_0.337.988488123
channel ORA_DISK_1: piece handle=+DG_FRA/TMSBY/BACKUPSET/2018_10_02/annnf0_TMSBY1_arch_0.337.988488123 tag=TMSBY1_ARCH
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:15
Finished restore at 03-OCT-18 08:45
And within a few minutes the restored Archivelog gets picked up by the FAL process, transfered and applied on the new standby (TMPRD). Alert.log contents:
RFS[3]: Opened log for thread 2 sequence 358 dbid 645377540 branch 987409798
...
Archived Log entry 1627 added for thread 2 sequence 358 rlc 987409798 ID 0x0 dest 2:
...
Media Recovery Log +DG_ARCHIVE/TMPRD/ARCHIVELOG/2018_10_03/thread_2_seq_358.49304.988533935
All in all I was impressed with the smooth failover and reinstatement via the broker, but even then, not everything can get automated and there's always some hicup where the DBA should roll up their sleeves and dig in in order to get things back to normal again.
Saturday, September 29, 2018
Connection Pooling for Scalability and Performance
Got a call from our web support team, seems
that a critical database transaction with a 2 second SLA is sometimes taking 6 seconds to
complete. They have been investigating the problem for a day and have concluded
that it is a Database problem. They thought there must be some type of short spike in the database that affects their transactions.
First step was to look at the health of the database, the load profile and wait events of ACTIVE sessions, nothing jumped out as a problem. I even did a session trace of one of the web service sessions in question and did not find any SQL query taking longer than a few milliseconds.
First step was to look at the health of the database, the load profile and wait events of ACTIVE sessions, nothing jumped out as a problem. I even did a session trace of one of the web service sessions in question and did not find any SQL query taking longer than a few milliseconds.
An AWR report showed an average of 50
logons per second over a 20 minute period. That set off the alarm bells as database
connections are expensive (just think of the overhead of establishing a network
connection and then also initializing a database connection which requires user
authentication, authentication protocols, establishing transactional contexts etc.).
The web development team assured me that
they are using connection pooling and the number of sessions was far below the
processes parameter, but it was when I queried 2 sources: the DBA_AUDIT_SESSION
view and the listener.ora log file that my suspicions where confirmed. Hundreds
of LOGONS and LOGOFFS per minute for one specific database username connecting
from 4 userhosts/clients.
select os_username, username, userhost,
action_name, count(*)
from DBA_AUDIT_SESSION
where timestamp > sysdate - 1/24
having count(*) > 100
group by os_username, username, userhost,
action_name
order by count(*);
As I said nothing stood out in my initial
investigation and count from V$SESSION, but now that I have the offending
username and userhost I could look more closely at V$SESSION. Although the
general count of sessions for this spesific username stayed constant at about
50, the LOGON_TIME was never more than 2 seconds older than the current time! The
database sessions had a very short life: logon, executes one or a few queries, logoff.
Everytime I queries V$SESSION I was looking at 50 different sessions than a few
seconds before.
select to_char(sysdate, 'yyyy-mm-dd hh24:mi:ss')
right_now,
to_char(logon_time, 'yyyy-mm-dd hh24:mi:ss')
logon_time
from V$SESSION
where username = ‘XX’
and userhost in (‘HOST1’,’HOST2’,’HOST3’,’HOST4’);
It turned out that we have multiple apps
and web teams using this database and a totally separate team added a
transaction without using connection pooling! This affected all new database connections. The original team who brought the issue to my attention where using connection pooling, but every now and again even they had to create new connections, and it is while establishing these new connections on a already overloaded database connection function that slowed their transactions down. After alerting the teams of my findings,
they made the desired changes and within
a few hours the original team gave the thumbs up and wanted to know
what I did to make the database faster.
24 hours later I used this query from DBA_HIST_SYSMETRIC_SUMMARY to create a graph to show how the Logons per Second metric came
down, the timing matched a graph from the app team that showed when all web
transactions stabilized to less than 1 second again.
select
to_char(s.end_interval_time,'mm-dd_hh24:mi') snap_time, sy.average
from
DBA_HIST_SNAPSHOT s, DBA_HIST_SYSMETRIC_SUMMARY sy
where
s.dbid = sy.dbid
and
s.instance_number = sy.instance_number
and
s.snap_id = sy.snap_id
and
sy.metric_name = 'Logons Per Sec'
and
s.end_interval_time > trunc(sysdate-2)
order
by snap_time
/
Friday, September 21, 2018
Cloning a RAC DB on the same cluster
We are using OMF in ASM so will include the following
3 lines in the RMAN DUPLICATE TARGET command:
SET DB_FILE_NAME_CONVERT '+DG_DATA','+DG_DATA'
SET LOG_FILE_NAME_CONVERT '+DG_DATA','+DG_DATA'
SET CONTROL_FILES='+DG_DATA','+DG_DATA'
Do NOT use NOFILENAMECHECK when closing to the same
server, any error in your script and RMAN might overwrite the target DBs db
files!
Also need to SET
CLUSTER_DATABASE='FALSE' in the RMAN DUPLICATE TARGET
command because I will clone a RAC to single instance and then convert the
single instance to a RAC.
Prep includes creating a pwd file and a init.ora file
with 3 lines
db_name=TMTEST
db_unique_name=TMTEST
compatible=12.1.0.2.0
I added the 3rd line after getting “RMAN-06136:
ORACLE error from auxiliary database: ORA-00201: control file version
12.1.0.2.0 incompatible with ORACLE version 12.0.0.0.0” on my first clone
attempt
Set ORACLE_SID to the new DB (TMTEST) and startup
nomount. When launching RMAN connect to the TARGET via a TNS entry
rman target sys/XX@DLIOWCC auxiliary sys/XX
RMAN> DUPLICATE TARGET DATABASE TO TMTEST FROM ACTIVE
DATABASE
SPFILE
SET CONTROL_FILES='+DG_DATA','+DG_DATA';
SET DB_FILE_NAME_CONVERT '+DG_DATA','+DG_DATA'
SET LOG_FILE_NAME_CONVERT '+DG_DATA','+DG_DATA'
SET CLUSTER_DATABASE='FALSE';
10 minutes later I had a Cloned DB with
ORACLE_SID=TMTEST
Some post cloning tasks in order to make it a RAC:
SQL> show parameter spfile
NAME TYPE VALUE
------- ------- ---------------------------------------------------
spfile string /u01/oracle/product/12.1.0/db_1/dbs/spfileTMTEST.ora
SQL> alter system set cluster_database=TRUE
scope=spfile sid='*';
SQL> create pfile ‘initTMTESTcloned.ora’ from spfile;
vi initTMTESTcloned.ora
Found a few entries I had to change from:
DLIOWCC2.instance_number=2
DLIOWCC1.instance_number=1
DLIOWCC2.thread=2
DLIOWCC1.thread=1
DLIOWCC1.undo_tablespace='UNDOTBS1'
DLIOWCC2.undo_tablespace='UNDOTBS2'
To
TMTEST2.instance_number=2
TMTEST1.instance_number=1
TMTEST2.thread=2
TMTEST1.thread=1
TMTEST1.undo_tablespace='UNDOTBS1'
TMTEST2.undo_tablespace='UNDOTBS2'
SQL> create spfile='+DG_DATA/TMTEST/spfileTMTEST.ora'
from pfile='initTMTESTcloned.ora';
vi initTMTEST1.ora
add line spfile='+DG_DATA/TMTEST/spfileTMTEST.ora'
copy the init.ora and orapw file to node 2
Register the database and modify the configuration
srvctl add database -db TMTEST -oraclehome
/u01/app/oracle/product/12.1.0/dbhome_1 -spfile +DG_DATA/TMTEST/spfileTMTEST.ora
srvctl add instance -db TMTEST -i TMTEST1 -node node1za
srvctl add instance -db TMTEST -i TMTEST2 -node node2za
srvctl config database -db TMTEST
shutdown the instance (remember ORACLE_SID=TMTEST)
and start the DB from srvctl
srvctl start database -db TMTEST
Remember to change ORACLE_SID from TMTEST to TMTEST1
if you want to connect via sqlplus on the DB node node1za
Monday, September 17, 2018
ORA-16047 - DGID mismatch between destination setting and target database
A few days after a DG planned switchover I got a call from the DBAs. OEM and v$dataguard_stats shows no lag, but archives have not shipped from the Primary DB to the Physical Standby DB since the switchover.
After confirming their suspicions by querying V$archived_log and v$dataguard_stats I went to the 3 primary sources for DG troubleshooting.
1.) DGMGRL
the command "show configuration" indicated that the Standby database was disabled.
2.) Primary and Standby Alert.log files
The Primary DB had this peculiar error in its Alert.log file:
Fri Sep 14 13:55:31 2018
ORA-16047: DGID mismatch between destination setting and target database
using the oerr utility I found the full descrition of the error:
Error: ORA-16047 DGID mismatch between destination setting and target database
---------------------------------------------------------------------------
Cause: The DB_UNIQUE_NAME specified for the destination did not match the
DB_UNIQUE_NAME at the target database.
Action: Make sure the DB_UNIQUE_NAME specified in the LOG_ARCHIVE_DEST_n
parameter matches the DB_UNIQUE_NAME parameter defined at the
destination.
But the DB_UNIQUE_NAME was correct in the Primary's LOG_ARCHIVE_DEST_2 parameter.
Next I looked at the Standby DBs Alert.log file, especially around the timestamp of the ORA-16047 error in the Primary DBs Alert.log. It seems that the Broker was not happy and decided to deconfigure the LOG_ARCHIVE_CONFIG parameter!
Fri Sep 14 13:54:23 2018
ALTER SYSTEM SET log_archive_config='nodg_config' SCOPE=BOTH;
...
Completed: Data Guard Broker shutdown
Fri Sep 14 13:54:59 2018
ALTER SYSTEM SET dg_broker_start=FALSE SCOPE=BOTH;
ALTER SYSTEM SET dg_broker_start=TRUE SCOPE=BOTH;
3.) Primary and Standby DG Alert log files
Lets look at the DG Alert.log files around the same timestamp.
The Standby DG Alert.log had hundreds of repeating ORA-16501 error messages until the timestamp in question when we got:
...
Failed to send message to site pc3dfmw2tc. Error code is ORA-16501.
...
09/14/2018 13:54:06
REMOVE CONFIGURATION [PRESEVERE DESTINATIONS]
So I was correct, the Broker was not happy (as can be seen by the repeated ORA-16501 errors) and eventually deconfigured the LOG_ARCHIVE_CONFIG parameter!
Error: ORA-16501 the Data Guard broker operation failed
---------------------------------------------------------------------------
Cause: the Data Guard broker operation failed
Action: See accompanying messages for details.
Not a lot of info for the ORA-16501 error, it seems to be a generic error. Investigating further back in the Standby's DG Alert.log file I found that the switchover completed with errors. The DBA was satisfied that the switchover worked, but did not do due diligence in investigating the accompanied error, and a few hours later DG stopped shipping Archivelogs altogether.The solution was to correct the log_archive_config setting to what it was before the switchover, then to re-enable the Standby database in the DG Broker config. That fixed the lag, archives are now transported and applied to the Standby DB. That also corrected the Apply Lag reported in v$dataguard_stats.The Standby DB is now in sync and is usable in case of a DR, but further troubleshooting is necessary to determine why the Broker configuration was not 100% happy in the first place.
Subscribe to:
Posts (Atom)