Wednesday, 28 January 2015

Physical Standby out of sync ?

So Physical Standby out of sync ?



When you are using Data guard, there are several scenarios when physical standby can go out of sync with the primary database.

Before doing anything to correct the problem, we need to verify that why standby is not in sync with the primary. In this particular article, we are covering the scenario where a log is missing from the standby but apart from the missing log, all logs are available.

Verify from v$archived_log that there is a gap in the sequence number. All the logs up to that gap should have APPLIED=YES and all the sequence# after the missing log sequence# are APPLIED=NO. This means that due to the missing log, MRP is not applying the logs on standby but the logs are still being transmitted to the standby and are available.

SQL> SELECT SEQUENCE#, APPLIED FROM V$ARCHIVED_LOG;

So for example, if the missing log sequence# is 400, then the above query should show that up to sequence#399, all have APPLIED=YES and starting from 401, all are APPLIED=NO.
There are few steps to be performed when the standby is not in sync with the primary because there is a gap of logs on standby.

These steps are:

STEP #1: Take an incremental backup of primary from the SCN where standby is lagging behind and apply on the standby server

STEP #2: If step#1 is not able to sync up, then re-create the controlfile of standby database from the primary

STEP #3: If after step#2, you still find that logs are not being applied on the standby, check the alert log and you may need to re-register the logs with the standby database.

*******************************************************************************************
STEP#1

1. On STANDBY database query the v$database view and record the current SCN of the standby database:

SQL> SELECT CURRENT_SCN FROM V$DATABASE;

CURRENT_SCN
-----------
1.3945E+10
SQL> SELECT to_char(CURRENT_SCN) FROM V$DATABASE;
TO_CHAR(CURRENT_SCN)
----------------------------------------
13945141914

2. Stop Redo Apply on the standby database:

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL
*
ERROR at line 1:
ORA-16136: Managed Standby Recovery not active
If you see the above error, it means Managed Recovery is already off
You can also confirm from the view v$managed_standby to see if the MRP is running or not

SQL> SELECT PROCESS, STATUS FROM V$MANAGED_STANDBY;

3. Connect to the primary database as the RMAN target and create an incremental backup from the current SCN of the standby database that was recorded in step 1:

For example,

BACKUP INCREMENTAL FROM SCN 13945141914 DATABASE FORMAT '/tmp/ForStandby_%U' tag 'FOR STANDBY'

You can choose a location other than /tmp also.

4. Do a recovery of the standby database using the incremental backup of primary taken above:
On the Standby server, without connecting to recovery catalog, catalog the backupset of the incremental backup taken above. Before this, of course you need to copy the backup piece of the incremental backup taken above to a location accessible to standby server.

$ rman nocatalog target /

RMAN> CATALOG BACKUPPIECE '/dump/proddb/inc_bkup/ForStandby_1qjm8jn2_1_1';
Now in the same session, start the recovery

RMAN> RECOVER DATABASE NOREDO;

You should see something like:
Starting recover at 2008-09-17 04:59:57
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=309 devtype=DISK
channel ORA_DISK_1: starting incremental datafile backupset restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
....
..
..
.
channel ORA_DISK_1: reading from backup piece /dump/proddb/inc_bkup/ForStandby_1qjm8jn2_1_1
channel ORA_DISK_1: restored backup piece 1
piece handle=/dump/proddb/inc_bkup/ForStandby_1qjm8jn2_1_1 tag=FOR STANDBY
channel ORA_DISK_1: restore complete, elapsed time: 01:53:08
Finished recover at 2008-07-25 05:20:3

Delete the backup set from standby:

RMAN> DELETE BACKUP TAG 'FOR STANDBY';
using channel ORA_DISK_1
List of Backup Pieces

BP Key  BS Key  Pc# Cp# Status      Device Type Piece Name
------- ------- --- --- ----------- ----------- ----------
17713   17713   1   1   AVAILABLE   DISK        /dump/proddb/inc_bkup/ForStandby_1qjm8jn2_1_1
Do you really want to delete the above objects (enter YES or NO)? YES
deleted backup piece
backup piece handle=/dump/proddb/inc_bkup/ForStandby_1qjm8jn2_1_1 recid=17713 stamp=660972421
Deleted 1 objects

5. Try to start the managed recovery.

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;
If you get an error here, you need to go to STEP#2 for bringing standby in sync.
If no error, then using the view v$managed_standby, verify that MRP process is started and has the status APPLYING_LOGS.

6. After this, check whether the logs are being applied on the standby or not:

SQL> SELECT SEQUENCE#, APPLIED FROM V$ARCHIVED_LOG;

After doing a recovery using the incremental backup, you will not see the sequence#'s which were visible earlier with APPLIED=NO because they have been absorbed as part of the incremental backup and applied on standby during recovery.

The APPLIED column starts showing YES for the logs which are being transmitted now, this means logs are being applied.

Check the status of MRP process in the view v$managed_standby. The status should be APPLYING_LOGS for the duration that available logs are being applied and once all available logs have been applied, the status should be WAITING_FOR_LOGS

7. Another check to verify that primary and standby are in sync. Run the following query on both standby and primary:

SQL> select max(sequence#) from v$log_history.
Output should be same on both databases.

*******************************************************************************************
STEP #2:

Since Managed recovery failed after applying the incremental backup, we need to recreate the controlfile of standby. The reason for recreating the controlfile is that the state of the database was same because the database_scn was not updated in the control file after applying the incremental backup while the scn for datafiles were updated. Consequently, the standby database was still looking for the old file to apply.

A good MOSC note for re-creating the controlfile in such a scenario is 734862.1.
Steps to recreate the standby controlfile and start the managed recovery on standby:

1. Take the backup of controlfile from primary
rman target sys/oracle@proddb catalog rman/cat@emrep
backup current controlfile for standby;

2. Copy the controlfile backup to the standby system (or if it is on the common NFS mount, no need to transfer or copy) and restore the controlfile onto the standby database
Shutdown all instances (If standby is RAC) of the standby.

sqlplus / as sysdba
shutdown immediate
exit
Startup nomount, one instance.
sqlplus / as sysdba
startup nomount
exit
Restore the standby control file.
rman nocatalog target /
restore standby controlfile from '/tmp/o1_mf_TAG20070220T151030_.bkp';
exit

3. Startup the standby with the new control file.

sqlplus / as sysdba
shutdown immediate
startup mount
exit

4.  Restart managed recovery in one instance (if standby is RAC) of the standby database:
sqlplus / as sysdba

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;

The above statement may succeed without errors but the MRP process will still not start. The reason is that since the controlfile has been restored from the primary, it is looking for datafiles at the same location as are in primary instead of standby. For example, if the primary datafiles are located at '+DATA/proddb_1/DATAFILE' and standby datafiles are at '+DATA/proddb_2/DATAFILE', the new controlfile will show the datafile?s location as '+DATA/proddb_1/DATAFILE'. This can be verified from the query "select name from v$datafile" on the standby instance. We need to rename all the datafiles to reflect the correct location.

There are two ways to rename the datafiles:

1. Without using RMAN

Change the parameter standby_file_management=manual in standby?s parameter file.
ALTER DATABASE RENAME FILE '+DATA/proddb_1/datafile/users.310.620229743' TO '+DATA/proddb_2/datafile/USERS.1216.648429765';

2. Using RMAN

rman nocatalog target /
Catalog the files, the string specified should refer to the diskgroup/filesystem destination of the standby data files.

RMAN> catalog start with '+diskgroup/<dbname>/datafile/';
e.g.:
RMAN> catalog start with '+DATA/proddb_2/datafile/';

This will give the user a list of files and ask if they should all be cataloged. The user should review and say YES if all the datafiles are properly listed.

Once that is done, then commit the changes to the controlfile

RMAN> switch database to copy;
Now start the managed recovery as:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT
and check for processes in the view v$managed_standby.  MRP process should be there. It will also start applying all the archived logs that were missing since last applied log.  This process might take hours.

5. Another check to verify that primary and standby are in sync:
Run the following query on both standby and primary after all logs in v$archived_log show APPLIED=YES:

SQL> select max(sequence#) from v$log_history.
Output should be same on both databases.

*****************************************************************************************
STEP #3

After recreating the controlfile, you still find that logs are being transmitted but not being applied on the standby. Check the alert log of standby. For example, see if you find something similar to below snippet:

Fetching gap sequence in thread 1, gap sequence 74069-74095
Wed Sep 17 06:45:47 2008
RFS[1]: Archived Log: '+DATA/ipwp_sac1/archivelog/2008_09_17/thread_1_seq_74093.259.665649929'
Wed Sep 17 06:45:55 2008
Fetching gap sequence in thread 1, gap sequence 74069-74092
Wed Sep 17 06:45:57 2008
RFS[1]: Archived Log: '+DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74094.258.665649947'
Wed Sep 17 06:46:16 2008
RFS[1]: Archived Log: '+DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74095.256.665649957'
Wed Sep 17 06:46:26 2008
FAL[client]: Failed to request gap sequence
 GAP - thread 1 sequence 74069-74092

The contents of alert log shows that logs sequence# from 74069 to 74092 may have been transmitted but not applied. The view v$archived_log shows the sequence# starting from 74093 and APPLIED=NO.

So this situation means that logs up to 74068 were applied as part of the incremental backup and from 74069 to 74093 have been transferred to standby server but they must have failed to register with standby database. Try the following steps: 

  1. Locate the log sequence# shown in alert log (for example 74069 to 74092). For example, +DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74069.995.665630861
  2. Register all these archived logs with the standby database.
alter database register logfile '+DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74069.995.665630861';
alter database register logfile '+DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74070.998.665631405';
alter database register logfile '+DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74071.792.665633755';
alter database register logfile '+DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74072.263.665633713';
??..
?.and so on till the last one.
  1. Now check the view v$archived_log and finally should see the logs being applied. The status of MRP should change from ARCHIVE_LOG_GAP to APPLYING_LOGS and eventually WAITING_FOR_LOGS.


*******************************************************************************************
Contributed by:
 
Name: Manpreet Singh
Current title : Solution Architecture (DBA)
Overall DBA Experience : 11+ yrs 
  

Key Skills: RMAN, DATAGUARD, RAC 11GR2 Installations and Upgrades, Oracle ASM , Oracle RAC node expansion

No comments:

Post a Comment