Wednesday, 28 January 2015

Physical Standby out of sync ?

So Physical Standby out of sync ?



When you are using Data guard, there are several scenarios when physical standby can go out of sync with the primary database.

Before doing anything to correct the problem, we need to verify that why standby is not in sync with the primary. In this particular article, we are covering the scenario where a log is missing from the standby but apart from the missing log, all logs are available.

Verify from v$archived_log that there is a gap in the sequence number. All the logs up to that gap should have APPLIED=YES and all the sequence# after the missing log sequence# are APPLIED=NO. This means that due to the missing log, MRP is not applying the logs on standby but the logs are still being transmitted to the standby and are available.

SQL> SELECT SEQUENCE#, APPLIED FROM V$ARCHIVED_LOG;

So for example, if the missing log sequence# is 400, then the above query should show that up to sequence#399, all have APPLIED=YES and starting from 401, all are APPLIED=NO.
There are few steps to be performed when the standby is not in sync with the primary because there is a gap of logs on standby.

These steps are:

STEP #1: Take an incremental backup of primary from the SCN where standby is lagging behind and apply on the standby server

STEP #2: If step#1 is not able to sync up, then re-create the controlfile of standby database from the primary

STEP #3: If after step#2, you still find that logs are not being applied on the standby, check the alert log and you may need to re-register the logs with the standby database.

*******************************************************************************************
STEP#1

1. On STANDBY database query the v$database view and record the current SCN of the standby database:

SQL> SELECT CURRENT_SCN FROM V$DATABASE;

CURRENT_SCN
-----------
1.3945E+10
SQL> SELECT to_char(CURRENT_SCN) FROM V$DATABASE;
TO_CHAR(CURRENT_SCN)
----------------------------------------
13945141914

2. Stop Redo Apply on the standby database:

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL
*
ERROR at line 1:
ORA-16136: Managed Standby Recovery not active
If you see the above error, it means Managed Recovery is already off
You can also confirm from the view v$managed_standby to see if the MRP is running or not

SQL> SELECT PROCESS, STATUS FROM V$MANAGED_STANDBY;

3. Connect to the primary database as the RMAN target and create an incremental backup from the current SCN of the standby database that was recorded in step 1:

For example,

BACKUP INCREMENTAL FROM SCN 13945141914 DATABASE FORMAT '/tmp/ForStandby_%U' tag 'FOR STANDBY'

You can choose a location other than /tmp also.

4. Do a recovery of the standby database using the incremental backup of primary taken above:
On the Standby server, without connecting to recovery catalog, catalog the backupset of the incremental backup taken above. Before this, of course you need to copy the backup piece of the incremental backup taken above to a location accessible to standby server.

$ rman nocatalog target /

RMAN> CATALOG BACKUPPIECE '/dump/proddb/inc_bkup/ForStandby_1qjm8jn2_1_1';
Now in the same session, start the recovery

RMAN> RECOVER DATABASE NOREDO;

You should see something like:
Starting recover at 2008-09-17 04:59:57
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=309 devtype=DISK
channel ORA_DISK_1: starting incremental datafile backupset restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
....
..
..
.
channel ORA_DISK_1: reading from backup piece /dump/proddb/inc_bkup/ForStandby_1qjm8jn2_1_1
channel ORA_DISK_1: restored backup piece 1
piece handle=/dump/proddb/inc_bkup/ForStandby_1qjm8jn2_1_1 tag=FOR STANDBY
channel ORA_DISK_1: restore complete, elapsed time: 01:53:08
Finished recover at 2008-07-25 05:20:3

Delete the backup set from standby:

RMAN> DELETE BACKUP TAG 'FOR STANDBY';
using channel ORA_DISK_1
List of Backup Pieces

BP Key  BS Key  Pc# Cp# Status      Device Type Piece Name
------- ------- --- --- ----------- ----------- ----------
17713   17713   1   1   AVAILABLE   DISK        /dump/proddb/inc_bkup/ForStandby_1qjm8jn2_1_1
Do you really want to delete the above objects (enter YES or NO)? YES
deleted backup piece
backup piece handle=/dump/proddb/inc_bkup/ForStandby_1qjm8jn2_1_1 recid=17713 stamp=660972421
Deleted 1 objects

5. Try to start the managed recovery.

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;
If you get an error here, you need to go to STEP#2 for bringing standby in sync.
If no error, then using the view v$managed_standby, verify that MRP process is started and has the status APPLYING_LOGS.

6. After this, check whether the logs are being applied on the standby or not:

SQL> SELECT SEQUENCE#, APPLIED FROM V$ARCHIVED_LOG;

After doing a recovery using the incremental backup, you will not see the sequence#'s which were visible earlier with APPLIED=NO because they have been absorbed as part of the incremental backup and applied on standby during recovery.

The APPLIED column starts showing YES for the logs which are being transmitted now, this means logs are being applied.

Check the status of MRP process in the view v$managed_standby. The status should be APPLYING_LOGS for the duration that available logs are being applied and once all available logs have been applied, the status should be WAITING_FOR_LOGS

7. Another check to verify that primary and standby are in sync. Run the following query on both standby and primary:

SQL> select max(sequence#) from v$log_history.
Output should be same on both databases.

*******************************************************************************************
STEP #2:

Since Managed recovery failed after applying the incremental backup, we need to recreate the controlfile of standby. The reason for recreating the controlfile is that the state of the database was same because the database_scn was not updated in the control file after applying the incremental backup while the scn for datafiles were updated. Consequently, the standby database was still looking for the old file to apply.

A good MOSC note for re-creating the controlfile in such a scenario is 734862.1.
Steps to recreate the standby controlfile and start the managed recovery on standby:

1. Take the backup of controlfile from primary
rman target sys/oracle@proddb catalog rman/cat@emrep
backup current controlfile for standby;

2. Copy the controlfile backup to the standby system (or if it is on the common NFS mount, no need to transfer or copy) and restore the controlfile onto the standby database
Shutdown all instances (If standby is RAC) of the standby.

sqlplus / as sysdba
shutdown immediate
exit
Startup nomount, one instance.
sqlplus / as sysdba
startup nomount
exit
Restore the standby control file.
rman nocatalog target /
restore standby controlfile from '/tmp/o1_mf_TAG20070220T151030_.bkp';
exit

3. Startup the standby with the new control file.

sqlplus / as sysdba
shutdown immediate
startup mount
exit

4.  Restart managed recovery in one instance (if standby is RAC) of the standby database:
sqlplus / as sysdba

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;

The above statement may succeed without errors but the MRP process will still not start. The reason is that since the controlfile has been restored from the primary, it is looking for datafiles at the same location as are in primary instead of standby. For example, if the primary datafiles are located at '+DATA/proddb_1/DATAFILE' and standby datafiles are at '+DATA/proddb_2/DATAFILE', the new controlfile will show the datafile?s location as '+DATA/proddb_1/DATAFILE'. This can be verified from the query "select name from v$datafile" on the standby instance. We need to rename all the datafiles to reflect the correct location.

There are two ways to rename the datafiles:

1. Without using RMAN

Change the parameter standby_file_management=manual in standby?s parameter file.
ALTER DATABASE RENAME FILE '+DATA/proddb_1/datafile/users.310.620229743' TO '+DATA/proddb_2/datafile/USERS.1216.648429765';

2. Using RMAN

rman nocatalog target /
Catalog the files, the string specified should refer to the diskgroup/filesystem destination of the standby data files.

RMAN> catalog start with '+diskgroup/<dbname>/datafile/';
e.g.:
RMAN> catalog start with '+DATA/proddb_2/datafile/';

This will give the user a list of files and ask if they should all be cataloged. The user should review and say YES if all the datafiles are properly listed.

Once that is done, then commit the changes to the controlfile

RMAN> switch database to copy;
Now start the managed recovery as:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT
and check for processes in the view v$managed_standby.  MRP process should be there. It will also start applying all the archived logs that were missing since last applied log.  This process might take hours.

5. Another check to verify that primary and standby are in sync:
Run the following query on both standby and primary after all logs in v$archived_log show APPLIED=YES:

SQL> select max(sequence#) from v$log_history.
Output should be same on both databases.

*****************************************************************************************
STEP #3

After recreating the controlfile, you still find that logs are being transmitted but not being applied on the standby. Check the alert log of standby. For example, see if you find something similar to below snippet:

Fetching gap sequence in thread 1, gap sequence 74069-74095
Wed Sep 17 06:45:47 2008
RFS[1]: Archived Log: '+DATA/ipwp_sac1/archivelog/2008_09_17/thread_1_seq_74093.259.665649929'
Wed Sep 17 06:45:55 2008
Fetching gap sequence in thread 1, gap sequence 74069-74092
Wed Sep 17 06:45:57 2008
RFS[1]: Archived Log: '+DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74094.258.665649947'
Wed Sep 17 06:46:16 2008
RFS[1]: Archived Log: '+DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74095.256.665649957'
Wed Sep 17 06:46:26 2008
FAL[client]: Failed to request gap sequence
 GAP - thread 1 sequence 74069-74092

The contents of alert log shows that logs sequence# from 74069 to 74092 may have been transmitted but not applied. The view v$archived_log shows the sequence# starting from 74093 and APPLIED=NO.

So this situation means that logs up to 74068 were applied as part of the incremental backup and from 74069 to 74093 have been transferred to standby server but they must have failed to register with standby database. Try the following steps: 

  1. Locate the log sequence# shown in alert log (for example 74069 to 74092). For example, +DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74069.995.665630861
  2. Register all these archived logs with the standby database.
alter database register logfile '+DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74069.995.665630861';
alter database register logfile '+DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74070.998.665631405';
alter database register logfile '+DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74071.792.665633755';
alter database register logfile '+DATA/proddb_2/archivelog/2008_09_17/thread_1_seq_74072.263.665633713';
??..
?.and so on till the last one.
  1. Now check the view v$archived_log and finally should see the logs being applied. The status of MRP should change from ARCHIVE_LOG_GAP to APPLYING_LOGS and eventually WAITING_FOR_LOGS.


*******************************************************************************************
Contributed by:
 
Name: Manpreet Singh
Current title : Solution Architecture (DBA)
Overall DBA Experience : 11+ yrs 
  

Key Skills: RMAN, DATAGUARD, RAC 11GR2 Installations and Upgrades, Oracle ASM , Oracle RAC node expansion

How to Tune Shared Pool Cache





How to Tune Shared Pool Cache ?


The one basic rule for tuning the shared pool is to
avoid or minimize parses of any kind, whether they be
soft parses or hard parses.


Parses can only be avoided by your developers by efficiently coding
the application.

As a DBA you can only minimize parses.

So, if you have lot have hard parses, Then you have to convert
them to soft one's.

And, if you have lot of soft parses then you have to minimize them.

This article focuses on tuning your shared pool using AWR or stats pack reports
along with some init.ora parameters.



Load profile

Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
   DB Time(s):         0.8                   2.5
      DB CPU(s):                   0.8                   2.5
                  Redo size:              26,727.5              85,874.4
              Logical reads:              14,003.8              44,993.8
              Block changes:                  53.2                 171.0
             Physical reads:                 111.3                 357.7
            Physical writes:                  11.2                  36.0
                 User calls:                   7.1                  22.7
                     Parses:                 820.8               2,637.1
                Hard parses:                 743.5               2,388.9
    W/A MB processed:   828,525.2      2,662,020.7
                     Logons:                   0.1                   0.3
                   Executes:               1,025.7               3,295.4
    Rollbacks:                   0.0                   0.0
               Transactions:                   0.3


The important values to check here are "Parses & Hard Parses".
So, out of total 90% of the parses are hard parses (90.582% to be precise).

Possible reason for that is maybe cursors are not being shared.

The general rule is that

In an OLTP system number of hard parses should be few.

and

In and DWH and DSS environments the percentage of hard parses is normally higher.



Instance Efficiency

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:  100.00       Redo NoWait %:  100.00
            Buffer  Hit   %:   99.21    In-memory Sort %:  100.00
            Library Hit   %:   65.63        Soft Parse %:    9.41
         Execute to Parse %:   19.98         Latch Hit %:   99.98
Parse CPU to Parse Elapsd %:    0.01     % Non-Parse CPU:   22.67


Indicators which mean that there is problem with shared pool.

Library Hit% - Shows you in percentage the number of times a requested object
was found in the shared pool(ideal range (95 to 100)%).

Soft Parse % - Shows you in percentage the number of times a cursor was found
and reused (ideal range (95 to 100)%).  

A low percentage means that cursors are not being reused.

Execute to Parse%  - The execute to parse ratio should be very high in a ideal database (ideal range (95 to 100)%).

The execute to parse ratio is basically a measure between the number
Of times a sql is executed versus the number of times it is parsed.

The ratio will move higher as the number of executes go up, while
The number of parses either go down or remain the same.

The ratio will be close to zero if the number of executes and parses
Are almost equal.

The ratio will be negative executes are lower but the parses are higher.


Parse CPU to Parse Elasped% - Parse CPU means amount of CPU time used for parsing.
Parse Elapsed means amount of clock time used for parsing.

So, if for example if this percentage is high
say 87.88% then (1/.8788 = 1.13791534) that means for every 1 cpu second 1.13 seconds
of wall clock time has elapsed in order to do parses. 

Low percentage in this ratio may be an indicator of latch problems.

Ideal Range is between 95% and 100%.


Top 5 Timed Events

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
DB CPU                                               67         98.21
db file sequential read               8,371           0      0   0.52 User I/O
latch: row cache objects                 16           0      8   0.19 Concurrency 
latch: shared pool                      956           0      0   0.15 Concurrency
log file sync                            25           0      2   0.06 Commit

          -------------------------------------------------------------

Both latch: row cach objects & latch: shared pool indicate that there is 
some issue with shared pool (latch contention).

So, according to the given slide  (0.19 + 0.15) 0.34% of the CPU time
is being eaten by latch contention. 

"Row cache objects" is a latch which is used to protect the access to
data dictionary cache in the SGA.

A high value for this may indicate the following things.

1) There is excessive use of data dictionary information.
2) There is lot of hard parsing.

General solution to this problem is to increase the size of shared pool.


Latch: shared pool

Shared pool latch is used to protect memory allocation.

shared pool latch contention may indicate the following:

1) There is excessive hard parsing because the application is using literals
   instead of bind variables.
2) Cursors are not being shared.


Time Model Statistics

Time related statistics presents the various operations which are consuming most of the database time.



Statistic Name                                       Time (s) % of DB Time
------------------------------------------ ------------------ ------------
DB CPU                                                    67.39      98.21
sql execute elasped time                                  61.13      89.09
parse time elasped                                        56.36      82.14
hard parse elasped time                                   49.99      72.86
PL/SQL execution elasped time                              4.30       6.26
PL/SQL compilation elasped time                            0.76       1.10
connection management call elasped time                    0.26       0.38
hard parse (sharing criteria) elasped time                 0.15       0.22
hard parse (bind mismatch) elasped time                    0.09       0.13
repeated bind elasped time                                 0.06       0.08
sequence load elasped time                                 0.03       0.04
DB time                                                   68.62    
background elasped time                                    1.50                  
background cpu time                                        0.12                     

          -------------------------------------------------------------


According to oracle docs

Parse time elapsed - Amount of elapsed time spent parsing SQL statements. It includes both soft and 

hard parse time.

Hard parse elapsed time - Amount of elapsed time spent hard parsing SQL statements.

 
So, according to the slide ((72.86*100)/82.14) 88.70% of DB time is spent on hard parses
which is not good.



Library Cache Activity.

Namespace     Get Requests   Pct Miss  Pin Requests Pct Miss  Reloads  Invalidations
----------    ------------   --------  ------------ --------  -------  -------------
BODY                     26      30.77           354     6.50       15              0  
CLUSTER                  576       1.04           322     1.86        0              0                  
INDEX                 35,320       0.07        53,320     0.07        0              0                  
SQL AREA             130,580      97.43       225,838    85.78      705            479                  

TABLE/PROCEDURE      405,501       0.35       313,834     1.31      908              0                 
TRIGGER                   90      15.56           100    14.00        0              0          


All the compiled cursors are stored in SQL Area.

High Pct Miss can indicate the following things.

1) Cursors ar not being used.
2) Cursors are being aged out or reloaded frequently because of low memory.

The number of reloads should not be more than 1% of the number of pins.

In our case it is ((705*100)/225838) 0.31%.

The reloads to pin ration can be in excess of 1% in the following scenarios.

1) Shared parsed areas have been aged-out because of lack of space. Main solution
   to this problem is to increase the shared pool size.

2) Shared parsed areas are invalidated. Possible solution for this is to do 
   house keeping service like index creation & gathering statistics when there is
   low database activity.


So, in order to tune shared pool we have to avoid hard parses.

There are three main methods to do it.

1) Make sure your shared pool is sufficiently sized.
2) Tell your developers to use bind variables instead of literals.
3) The third method is to use cursor_sharing only if the application
   has already been developed and does not use bind variables.
   This method should be used as a last resort.



Sharing your Cursors.

Cursors can be shared by using the init parameter "cursor_sharing".

It can be set to three values.

1) EXACT(default) - cursors can be shared only when the sql statements text are exactly similar.

2) Similar(Deprecated in 11g) - cursors are shared when sql statements are identical.
   The execution plan can be different depending on the literal value.

3) FORCE - SQL statements that are similar will share cursors and their is going
   to be only one execution plan for all the sql statements. 

Recommendations

1) cursor_sharing should be set to "EXACT" for DSS environments if complex
   queries are being used.
2) cursor_sharing should be set to "FORCE" for OLTP environments if bind
   variables are not being used.



Adaptive Cursor sharing (11g)

It is always recommended to use bind variables while developing applications
in order to gain performance benefits.

But sometimes instead of improving, The performance is degraded because of a combination of 
bind peeking and skewness of data in certain columns.

In order to solve this problem oracle 11g has come up with a new feature
called "adaptive cursor sharing".

For this to work properly histograms should be created on columns where the
skewness of data is high. They are collected automatically in 11g.

Adaptive cursor sharing works by observing the execution plans of statements across
various executions. If it finds any suboptimal plans, then it allows the bind variables
to use different execution plans for the same statement in order to improve performance.

Minimizing overhead of soft parses.

SESSION_CACHED_CURSORS

The overhead on soft parses and aging out of cursors can be avoided by using the parameter "session_cached_cursors".

This works by keeping a cache of the closed cursors in the session memory.
So when a query is executed, The session searches the session cache first and if
the cursor is found, the soft parse is avoided.

NOTE: Even if the cursor is found in the session cache, It will still require 
validation. It does that by validating that the opened cursor points to the right sql statement
in the shared pool. So, in essence its more like a "softer" soft parse. 



SHARED_POOL_RESERVED_SIZE

The SHARED_POOL_RESERVED_SIZE parameter specifies the shared pool space that is reserved for large contiguous 
requests for shared pool memory. This static parameter should be set high enough to avoid performance 
degradation in the shared pool from situations where pool fragmentation forces Oracle to search for 
free chunks of unused pool to satisfy the current request.

Ideally, this parameter should be large enough to satisfy any request scanning for memory on the reserved list 
without flushing objects from the shared pool.

The default value is 5% of the shared pool size, while the maximum value is 50% of the shared pool size.
For interMedia applications, a value at or close to the maximum can provide performance benefits.

Recommended for databases with large chunks of data or LOB objects(audio, video, images).



SQL query result cache (11g).

SQL query result cache enables the database to store the query result sets in the shared pool.

In brief it stores the result of a query for future reuse.

This cache can be used by multiple statements from the same session as well as 
by multiple sessions.

For example if the first session executes, after fething the data from the database it caches the result 
in the SQL query result.

So, if any other instance executes the exact query, the query result will be fetched
from the SQL query result caches instead of the datafiles.


SQL query result cache in RAC.

Each RAC node has it's own private SQL query result cache.
One node cannot share the SQL query result cache of another node.


Initialization parameters.

RESULT_CACHE_MODE

MANUAL: You have to specify the RESULT_CACHE hint in the SQL statement in
order to store result in cache i.e "select /*+ RESULT_CACHE */ dept_id, emp_id, sal from employees;".


FORCE: All the result sets are stored in the cache by default.
if with setting you do not want a particular query to use this feature then
mention the NO_RESULT_CACHE hint i.e "select /*+ NO_RESULT_CACHE */ dept_id, emp_id, sal from employees".


RESULT_CACHE_MAX_SIZE

This parameter defines the memory allocation of the result cache.

Value of "0" means it is disabled.
Maxlimit is 75% of shared pool.

Default value is derived as following.

0.25% of memory_target or
0.5% of sga_target or
1% of shared_pool_size.


RESULT_CACHE_MAX_RESULT

Defines the max memory allocation for a single result.
Default is 5% of RESULT_CACHE_MAX_SIZE 


DBMS_RESULT_CACHE package.

To check the status of the cache.

SQL> select dbms_result_cache.status from dual;

STATUS
--------------------------------------------------------------------------------
ENABLED


To check the cache memory usage.

SQL> set serveroutput on

SQL> exec dbms_result_cache.memory_report(detailed => true);

R e s u l t   C a c h e   M e m o r y   R e p o r t
[Parameters]
Block Size          = 1K bytes
Maximum Cache Size  = 672K bytes (672 blocks)
Maximum Result Size = 33K bytes (33 blocks)
[Memory]
Total Memory = 107836 bytes [0.135% of the Shared Pool]
... Fixed Memory = 9440 bytes [0.012% of the Shared Pool]
....... Memory Mgr = 124 bytes
....... Bloom Fltr = 2K bytes
....... Cache Mgr  = 4416 bytes
....... State Objs = 2852 bytes
... Dynamic Memory = 98396 bytes [0.123% of the Shared Pool]
....... Overhead = 65628 bytes
........... Hash Table    = 32K bytes (4K buckets)
........... Chunk Ptrs    = 12K bytes (3K slots)
........... Chunk Maps    = 12K bytes
........... Miscellaneous = 8284 bytes
....... Cache Memory = 32K bytes (32 blocks)
........... Unused Memory = 23 blocks
........... Used Memory = 9 blocks
............... Dependencies = 3 blocks (3 count)
............... Results = 6 blocks
................... SQL     = 6 blocks (6 count)

PL/SQL procedure successfully completed.


To remove the contents from result cache.

SQL> execute dbms_result_cache.flush;

PL/SQL procedure successfully completed.

NOTE: The contents from the query result cache will only be removed
if the cache is not in use. To purge successfully set "RESULT_CACHE_MAX_SIZE=0".

In order to reactivate the previous parameter you will have to restart
the instance after giving it a non zero value.

Things to consider for Query Result Cache.

Result cache is disabled for queries containing:

1) Temporary or dictionary tables
2) Nondeterministic PL/SQL functions.
3) CURRVAL and NEXTVAL(sequences).
4) SQL functions i.e current_date,sysdate etc.


Flashback queries can also be cached.


Result cache grows until it reaches its maximum size.


Objects or results in result cache are aged out based on the LRU algorithm.


References:

Oracle documentation.
Tuning shared pool cache

Steps to Analyze AWR Report in Oracle



Recommendations before getting an AWR Report.


1. Collect Multiple AWR Reports: It's always good to have two AWR Reports, one for good time (when database was performing well), second when performance is poor. This way Remote DBA can easily compare good and bad report to find out the culprit.

2. Stick to Particular Time: "Database is performing slow" will not help anymore to resolve performace issues. We have to have a specific time like Database was slow yesterday at 1 Pm and continue till 4Pm. Here, DBA will get a report for these three hours.

3. Split Large AWR Report into Smaller Reports: Instead of having one report for long time like one report for 4hrs. it's is better to have four reports each for one hour. This will help to isolate the problem.

In case of RAC env. generate one report for each instance. Once, you have generated AWR report. Now, it's time of analyze the report. Since, AWR report is a huge report and area to look into AWR is also depends on problem to problem. Here, I am list most common area for a DBA to look into which will give a clear picture of the issue.

Steps to Analyze AWR Report

1. Database Details:

After getting an AWR Report This is first and Top part of the report. In this part cross check for database and instance and and database version with the Database having performance issue.This report also show RAC=YES if it's an RAC database.

 

2. Host Configuration:

This will give you name, platform CUP, socket and RAM etc. Important thing to notice is number of cores into the system. In this example there are 12 CUP's in Cores.



3. Snap Shot Detail:

This are the detail about snap shot taken, Snap start time and end time. Difference between them is as "Elapsed". Here is a new term "DB Time"

DB Time= session time spent in database.

DB Time= CPU Time + Non IDLE wait time.

You can find, DB time is very large as compared to Elapse time, which is not a concern. Check if you have taken a report for the time having performance problem. If yes fine, other wise take a report for performance problem time.

Next is Cache Sizes, which is just detail about SGA components.

4. Load Profile:

Here are few important stats for a DBA to look into. Fist is "DB CPU(s)" per second. Before that let's understand how DB CUP's work. Suppose you have 12 cores into the system. So, per wall clock second you have 12 seconds to work on CPU.


 So, if "DB CPU(s)" per second in this report > cores in (Host Configuration (#2)).

means env is CPU bound and either need more CPU's or need to further check is this happening all the time or just for a fraction of time. As per my experience there are very few cases, when system is CPU bound.

In this case, machine has 12 cores and DB CPU(s) per second is 6.8. So, this is not a CPU bound case.

Next stat to look at are Parses and Hard parses. If the ratio of hard parse to parse is high, this means Database is performing more hard parse. So, needs to look at parameters like cursor_sharing and application level for bind variables etc.

5. Instance Efficiency Percentages:


In these statistics, you have to look at "% Non-Parse CPU". If this value is near 100% means most of the CPU resources are used into operations other than parsing, which is good for database health.


6. Top 5 Timed Foreground Events:

This is another most important stats to consider while looking at AWR Report for any database performance related issue. This has a list of top 5 foreground wait events.


Here, first of all check for wait class if wait class is  User I/O , System I/O,  Others etc this could be fine but if wait class has value "Concurrency" then there could be some serious problem. Next to look at is Time (s) which show how many times DB was waiting in this class and then Avg Wait (ms). If Time(s) are high but  Avg Wait (ms) is low then you can ignore this. If both are high or Avg Wait (ms) is high then this has to further investigate.

In the above screen shot, most of the resource are taken by DB CPU = 64% DB time. Taking resource by DB CUP is a normal situation.

Let's take an example,  In which event is "log file switch (checkpoint incomplete) " which has high waits, huge Time (s) and large values in Avg Wait (ms) and wait class is configuration. So, here you have to investigate and resolve log file switch (checkpoint incomplete).

Host CPU, Instance CPU and Memory Statistics are self explanatory.  Next is RAC Statistics, I did not find any issue in these stats most of the time.

7. Time Model Statistics:

This is a detailed explanations of system resource consumptions. Stats are order by Time (s) and % of DB Time.
 
A noticeable result Sum of all  % of DB time is > 100%. why is this ?

Because this is cumulative time i.e. In this case SQL execute elapsed time is taking 89% of DB time, which includes it sub parts like parse time elapsed, hard parse elapsed time etc. So, if you find Hard parse time elapsed is taking more %. So investigate further so on and so forth.

DBA has to look for stat which is taking abnormal % of DB time. 

8. Operating System Statistics - Detail:

This is the information related to OS, what is the load status on System shown here.


This report shows, system is 62 and 70% idle at time of report taken, So, there is no resource crunch at system level. But if, you found very high busy, user or sys % and indeed this will led to low idle %. Investigate what is causing this. OS Watcher is the tool which can help in this direction.

Next, very crucial part of AWR report for a DBA is SQL Statistics. Which has all sql query details executed during report time interval.

We will explore few of them, To understand, how to analyzed these reports. Let's start with

9. SQL Ordered by Elapsed Time:

As explained by name itself, this lists SQL queries ordered by Elapsed time into reported time interval.


In this report, look for query has low executions and high Elapsed time per Exec (s) and this query could be a candidate for troubleshooting or optimizations. In above report, you can see first query has maximum Elapsed time but no execution. So you have to investigate this.

In Important point, if executions is 0, it doesn't means query is not executing, this might be the case when query was still executing and you took AWR report. That's why query completion was not covered in Report.

10. SQL Ordered by CUP Time:

In this report, SQL queries are listed on the basis of CPU taken by the query i.e. queries causing high load on the system. The top few queries could be the candidate query for optimization.








From above stat, look for queries using highest CPU Times, If a query shows executions 0, this doesn't means query is not executing. It might be same case as in SQL queries ordered by Elapsed time. The query is still executing and you have taken the snapshot.

However, There are so many other stats in AWR Report which a DBA needs to consider, I have listed only ten of them but these are the most commonly used stats for any performance related information.

Please share you view about this article, Does it helps you to understand, How to analyze AWR Report.

Friday, 23 January 2015

Scanning FC/SAN LUNS in Solaris





Before scanning LUNS, we see some useful commands which are related to Fibre Channel(FC).





1.List the connected HBA’s.
root@Unixarena-SOL11:~# luxadm -e port |grep CONNECTED
/devices/pci@1d,700000/SUNW,qlc@1,1/fp@0,0:devctl CONNECTED
/devices/pci@1d,700000/SUNW,qlc@3/fp@0,0:devctl CONNECTED




2.Verify FC channel ‘s are connected and configured.
Unix@sol# cfgadm -al -o show_FCP_dev |grep fc-fabric
c2                             fc-fabric    connected    configured   unknown
c4                             fc-fabric    connected    configured   unknown




3.To find the HBA’s World wide Node number (WWN),use fcinfo command,
Unix@sol# fcinfo hba-port |grep Port
HBA Port WWN: 10000000c884bb48
HBA Port WWN: 10000000c884bb49
HBA Port WWN: 10000000c884b85c
HBA Port WWN: 10000000c884b85d




4.You can also find the WWN form luxadm command, if  HBA is  already connected to FC switch.
Unix@sol# luxadm -e dump_map /dev/cfg/c4
Pos  Port_ID Hard_Addr Port WWN         Node WWN         Type
0    29900   0        50080e8008cfb814 50080e8008cfb814 0x0  (Disk device)
1    27400   0        10000000c884b85c 20000000c884b85c 0x1f (Unknown Type,Host Bus Adapter)
Unix@sol#
From the above output, the last line shows the HBA information.In the same you can find the other controller information as well.

5.Zoning can be verified using the below command.
Unix@sol# cfgadm -al -o show_FCP_dev c2 c4 

6.If you see any controller port “WWN” showing as “unconfigured” ,then  you can initiate FC session using below mentioned command.
Unix@sol# cfgadm -c configure c2::50080e8008cfb814
Unix@sol# cfgadm -c configure c4::50080e8008cfb814

Let’s see how we can re-scan the SAN/FC luns on solaris 10/11 hosts.
To scan new FC luns, Just execute the below commands
table.tableizer-table { border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif font-size: 12px; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #ccc; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; }


Scanning FC/SAN LUNS
1cfgadm -al                      To scan FC luns
2devfsadm -c disk                 To make sure all the device files are created
3tail /var/adm/messages          To see the new LUN’s information
4echo |format                    To get the new LUN’s information
5ls -lrt /dev/rdsk |grep s2|tailTo get the new LUN’s information
“luxadm probe” also used to scan FC luns but i am happy to use cfgadm.

If you still not able to see the new LUNS/DISK ,then you can try to reset the HBA as last option but its very dangerous.Sometimes system loose SAN paths to that specific HBA. 

1.List the connected HBA.

root@Unixarena-SOL11:~# luxadm -e port |grep CONNECTED
/devices/pci@1d,700000/SUNW,qlc@1,1/fp@0,0:devctl CONNECTED
/devices/pci@1d,700000/SUNW,qlc@3/fp@0,0:devctl CONNECTED

2.Reset the HBA using forcelip option.

root@Unixarena-SOL11:~# luxadm -e forcelip /devices/pci@1d,700000/SUNW,qlc@1,1/fp@0,0:devctl 
Forcelip can be issued to the controller names as well.
Unix@sol# cfgadm -al -o show_FCP_dev |grep fc-fabric
c2                             fc-fabric    connected    configured   unknown
c4                             fc-fabric    connected    configured   unknown
Unix@sol#
Unix@sol# luxadm -e forcelip /dev/cfg/c2


3.Verify the controller status using “cfgadm -al”.Make sure disks didn;t loose any SAN paths after the HBA reset. If everything seems to be okay ,then isssue the forcelip to the another controller.


If still you are not able to see the new FC/SAN LUNS ,then reboot the server and try.
Once you see the new lun, make sure that  multiple FC paths are enabled to that.
Minimum two FC paths required for SAN disks.
” luxadm display /dev/rdsk/c1txxxxxxd0s2″  – To verify the FC lun details and multipathing.

If you are using veritas volume manager,then you can verify the multipathing using vxdmp commands.

Hope this post is informative for you .Please leave a comment if you have any doubt. I will get back to you.