| |
| | Oracle Database 10g wait interface | Oracle 10g Wait Interface Contents Introduction Oracle Wait Interface Concepts Investigating Common I/O Wait events Appendix A Introduction The purpose of this document is to understand oracle data dictionary internals related to performance tuning statistics in order to efficiently focus on performing root cause analysis, thereby solving most of the Database performance issues and bottlenecks. I have managed several high-end OLTP databases and felt that there was a need to have a sort of key guidelines for DBAs to start looking into Oracle dictionary tables effectively. Most of the DBAs I have come across mainly rely on scripts that are available at different web sites for capturing oracle performance statistics. While these scripts are handy, yet not providing the DBAs a complete picture of the problem in hand. A typical scenario for handling major performance issues revolves around the following lines. - Running several scripts against the troublesome database which makes things even worse.
-
At last the DBA resorts to opening a Service Request with Oracle which starts up the typical path of providing N number of script outputs, RDA reports etc. Certainly Oracle Engineers will do their best to identify the problem(s), in most cases I have observed that the two end up applying or making the production database up-to date with the latest patches and somehow the problem is solved or at least delayed until it appears again some time later and then the same cycle would start again. At the end it was identified as an Oracle BUG and that’s it!. No one took the time to really understand the actual cause of the problem. -
At times when a DBA experiences severe production issues (relating to performance), he would be asked to provide system dumps, trace dumps, error stacks etc. The irony of the situation is that most of the times a DBA is unable to execute these scripts because doing so will render the database completely unavailable to the end users. Previously it was performing with say 50% less power but running these scripts make situations even worse. This was the case with me for one database that had 1200 connections and performing trace dumps literally choked the database as it had to dump the trace information for all sessions. On the other hand Oracle support would not be able to provide any assistance if you did not provide them with all of the information. -
Most of the critical performance issues are not re-producible as they depend on several factors related to application load. If you can reproduce the problem at will, that means you have basically nailed down the problem. In earlier case, a DBA will never be able to predict when the problem happens again and he can only rely on Performance stats or AWR reports, sort of finding database health checks at the problematic time. Interpreting Statspack or AWR reports is not an easy task. You can not just act on one section of these reports and perform changes to your production database. This would be like closing one hole and creating pressure on other holes to burst. You got to understand the whole report and put things in the right perspectives. -
And finally a DBA would take the word of OS and Network Admins and usually that means no issues on those areas, look inside your database. Had the DBA enough knowledge to review the statistics from Network and OS sides, he would be more comfortable in understanding the issue in hand. I always advocate DBAs to perform end-to-end performance tuning. With the complex architecture around the database (couple of middleware, many application servers and then application code being wrapped) could easily make life of a DBA very difficult in terms of finding the root cause. While the demand from all sorts of people will be to exponentially solved the performance issue. There could be more areas in the list, however I feel these are the major ones and based on my personal experience working as a DBA need to understand the following points: -
Solving Performance issues is an art of Programming, I always termed Developers turned DBAs as the best of them all. Reason being they look at things with structured logical approach. Database is a placeholder of application data. Yes there are logics built inside database in terms of stored procedures but there is no difference between a stored procedure or a C program, both manipulate data. While solving performance issues, you need to understand that most of the problems originate from bad application design. In my experience, 70-80% performance issues arise from application not being able to handle exceptions or create blocking-locking scenarios when the application is put to stress test in real live environment. There is 5-10% chance that the problem is due to oracle database and underlying Operating System not being able to perform well for the amount of application load subjected upon. This should be resolved with proper tuning of oracle instance and OS parameters. I would attribute rest of the 10% to those applied solutions which solved one area of problem yet create many more areas waiting to explode. One example would be to overdo in reducing physical IO and have everything performed as logical IOs. If you stretch this concept too far, you will likely to end up with latches (or mutex in 10g). Let the disk sub system also performs its part which is performing physical IOs to some extent. -
Build skills in underlying OS knowledge from performance point of view. I am not talking about learning few UNIX commands here, nor am I talking about becoming a Unix Administrator. What a DBA should do is to know about OS performance parameters and how to monitor them. Let me give you an example. Let’s say it is an AIX or HP operating system. You need to know kernel parameters which are required for oracle to work well, and also monitor them for arriving at the best value. Ask UNIX admin to provide you with the access or have they sent you the report. Understand the SAR and VMTUNE command parameters and use them very often at the peak time of your database. Create a baseline for these parameters so that during the problem time, you can create a comparison. Most of the OS has tools to provide statistics per minute for these parameters. Create a better relationship with UNIX admin and it does not harm if you sit with him while he installs and creates a new server for your database. -
Same goes for Network areas. There is a nice network tool at www.Ethereal.com. Down load and use it against your database. Understand firewall concepts, packet size, latency time for your network. If you are managing a RAC or Data Guard environment then you just simply can not monitor them effectively without understanding the underlying network parameters. -
If you have not come from development background, then go and sit for the PL/SQL course and write some of the programs yourself. I would not really rely on a DBA who does not know how to read and fine tune a PL/SQL block. There are tools available from oracle to trace SQL and PL/SQLs. Use these tools more often and coordinate with the application developers on a weekly (if not daily) basis to understand the application both from business and coding side. If it is vendor supported, then you should definitely build a good working relationship with them and ask about their experiences with other clients. You can even trace database sessions and then use tkprof to create a report. I always used to send the application vendor some of tkprof reports and they really appreciated it and come back with the changes that will help us and also help them at other clients. -
Finally the most important aspect of all is to fully understand oracle dictionary views that store performance data and to be able to experiment testing database with different loads and understand the changes in these performance views. This is called Oracle Wait Analysis, which is the core concept that every DBA must know. Oracle Wait Interface Concepts Oracle wait interface is a framework which allows you to identify bottlenecks better knows as waits in the database. These waits can be network wait, IO wait, CPU wait, and latches and so on. Oracle has well defined data dictionary objects which stores information about these waits for every database sessions when the sessions encounters waits. Off course waits will always be there otherwise database is idle. It is the interpretation of the wait statistics that will help you to pinpoint the actual problematic area. For example while reading from a table with full scan, you may have “scattered read” wait in the database but that does not mean there is an issue. You need to establish a baseline health of your database and then keep monitoring how the database deviates from it. I have developed an application Oracle Tuner, which is based on Oracle Wait Interface. Information about the application is provided in Appendix A. Some of the common wait events are summarized below: -
Db file sequential read: Occurs when a session is busing performing single block read from index, undo, and table access by row id. You should note that for a database which is free from contention on latches and locking issue, your top 2 waits should be the db file sequential and scattered read which is in actual face a sign of a healthy database. However if your database is an OLTP type with small sized transaction and you have tuned the SGA properly these waits will appear in low waits and you should see another wait on the top of the list which is log file sync which is explained below. -
Latch free: Occurs when a process waits to acquire a latch (method/list/pointer to protect data structure) which is on hold by another process. However the waiting process waits for a short time, and then timed out. This short time is called “spin”. The process will again try and this continues. I have had some bad experiences with this latch as the “spin” time consumes CPU and if you have many sessions waiting for a latch which is on hold by 1 or 2 sessions, your database server can consume all of the CPU and it might reach 100% usage. This will definitely cause database slow down near to hung state. Killing the session holding the latch temporarily resolves the issue. However you should check with oracle for any bugs related to the issue. This is one of the few cases where I got help from applying oracle patches. -
Log file parallel write/log file sync: Occurs when you have a heavy OLTP databases and the rate of commits are relatively higher like 1000+ trans/sec. You need to make sure to size the redo logs large enough, and also to spread them over the disk storage which offers direct access method like IBM AIX “CIO” option. There is a new feature in 10g which basically makes the log synch process from log buffer to redo log in asynchronous mode. I used that for a batch process and was able to reduce the batch time to half. You should not use the feature for online transactions unless you know the repercussion. Investigating Common I/O Wait events -
Db file sequential read (Single Block IO): As explained before, this wait is caused by SQL statements that performs single block Read operations against Index blocks, undo, table access via rowid, control files and data files header. Most of the databases has this wait appear in the top5 wait events in the Statspack report. There is no straight forward method to identify when this wait becomes problematic. One indicator would be when this wait represents a large percentage of the total waits or when you hear users complaining about application being slow. All of this points to the face that a DBA should have enough knowledge about the databases he is monitoring so that he has the sort of baseline setup when all is well. Should you need to resolve this wait, you must optimize the SQLs in terms of reducing the physical or logical reads. If this wait is caused by excessive indexes reads then this could be an application issue, look at the SQLs whether full table scan is better etc. On the other hand disk subsystem could be slow, check for fragmentations levels in the index and at OS level. To give you an example, clustering factor plays an important role as it tells optimizer how many IOs are required for the whole operation. If the Clustering factor column value from dba_indexes for that particular index is close to number of blocks, then it means rows in the table are ordered, if the value is close to number of rows, that means rows are randomly distributed and you likely to have more IO scans. In the later case you need to rebuild the table and then the index. -
Db file scattered read (Multi Block IOs): This wait is caused by SQL statements that are waiting for multiple contiguous blocks (up to DB_FILE_MULTIBLOCK_READ_COUNT) value, to be read from Disk into SGA. Multi block requests are usually associated with Full table scans. One obvious solution is to tune the SQL statement to use an index scan, however during the batch and depending on the application requirements; a full traverse of the table is a must. If all of a sudden your database starts to wait on this event and the application users complaint of slow response, yet there was no coding change, more likely your database may have lost an index or index might have become unusable. -
Direct path read: This wait is caused by SQL statements that are performing reads into session’s PGA bypassing SGA. This maybe due to operations of blob, order by, group by, union, distinct. You can also find some background processes like CKPT, LGWR, DBWR waiting on these waits as well. -
Db file parallel write: This wait belongs to DBWR background process and could indicate slow IO performance which will cause foreground processes to get impacted. Lot of waits for this type points to I/O issues at OS level. You should find out the I/O throughput for your database and it should be lot less than 10 ms. There is an SQL in the appendix for finding this information for this event. You can improve the response time for DBWR by enabling A-sync IO, mounting Direct IO mount points on UNIX, spawning multiple DBWR processes. -
Log file parallel write/log file sync: This wait belongs to LGWR background process and could indicate slow IO performance which will cause foreground processes to get impacted upon commits. Lot of waits for this type points to an IO issue. The average waits should be less than 10 ms. High end OLTP databases can have significant number of commits per second which can cause higher waits for log file sync as well as log file parallel write. There are several recommendations to reduce these waits which include moving redo log files to disks which allow A-sync IO or Direct Read operations by passing OS buffer cache, increasing the number as well as size of redo log files, setting up batch write parameter to value of “BATCH” in 10g which will make LGWR act as asynchronously in writing redo data from redo log buffer to disk. Log buffer size parameter should not be set more than 3MB. It is important to note that higher log switches can also cause another wait event which is control file parallel write; therefore increasing size of redo log files will also reduce this wait. -
Latch free: Latches only apply to memory structures in SGA and not to database objects like tables, indexes etc. Oracle SGA has many types of latches to protect data from being corrupted by concurrent access. Latch wait means an oracle session is waiting to acquire a certain latch which is currently being held by another session, which also points to the fact that the latch access is a serialized mechanism. Latches are somewhat locks to these memory structures and oracle has to latch a certain memory structure to prevent it from being corrupted. However Latches differ from locks (Enqueue); Latch purpose is very straightforward i.e. allow a session to have exclusive single access to memory structure whereas locks allow multiple sessions to share the locks resources (database objects); latch operates at instance level and not at database level so its information is valid only in the instance and not across databases unlike Enqueue(locks); latches are supposed to be held for very short time like micro seconds as they are serialized process while locks can be held for extended period of time depending on user transactions; latches are not subjected to latches while locks are. Latches are not the callers but they are acquired by oracle process. A process/session may request a certain latch in two waiting modes; willing-to-wait or no-wait. DBAs usually query V$LATCH, V$LATCH_PARENT, V$LATCH_CHILDREN, V$LATCH_MISSES dictionary views. Willing-to-wait statistics are stored in GETS and MISSES columns while no-wait statistics are stored in IMMEDIATE-GETS and IMMEDIATE-MISSES columns. NO-WAIT wait type is normally limited to few latches like redo copy latches and are associated with multiple children latches. So for example a latch was requested in no-wait state and was not granted then that process will request for the next child latch and so on until all child requests are denied, only then the willing-to-wait state will be requested after the last child latch request in no-wait failed. GETS or IMMEDIATE-GETS column is incremented by 1 if a latch request was made in either state. If the latch is acquired in the first attempt, the process before modifying the memory structure, writes current info in the latch recovery area, in case process dies and PMON had to clear the latch lock as part of cleanup method, But if the latch is not acquired in the first attempt, requesting process will spin on CPU for short interval before retrying again. This request and spin-wait on CPU cycle goes on and on until n number of times. Value of n depends on a hidden init.ora parameters called _SPIN_COUNT which is default to 20000. In short, if the latch is obtained in one of the above (2-20000) tries, the process also increments value of SPIN_GETS and MISSES columns by 1, so that we know how long it took to acquire the latch and in how many spins. However if not then the process will post a wait event in the v$session_wait which you normally see as latch waits. During this the process goes to sleep (remember CPU might still be on hold) and you continue to see the latch wait events). Sleep period is short but after it is over, the process again retries to acquire latch for another _SPIN_COUNT times. The SPIN, RETRY, SLEEP and wakeup operations continue until process gets hold of the latch. All this time you may likely see the CPU being used. Remember that columns SLEEP, SLEEP1... SLEEPS3 (SLEEP4 ... SLEEPS11 ARE RESERVED FOR now and not updated) are only updated when the latch is acquired. So you can only establish some sort of performance degradation that already happened by viewing these sleeps columns ONLY when latch is acquired (meaning latch GETS succeeded). As you can see it’s a dead end as the only way out for a process is acquire the latch otherwise it will keep trying. Now if you kill that process, PMON will perform clean up. Every latch is divided into sub-levels from 0 to 13. It is interesting to note the sleep algorithm oracle uses. For example most latches are short-wait latches and other processes really should not have to wait for so long. Oracle sleeps in exponential fashion i.e. 1,2,4,8, 16 centi-seconds; however the max sleep time is like 2-4 seconds after that the sleep algorithm will reset the time. There are only few long-wait latches i.e. these can get held up for longer time. Examples are shared pool and library cache latches. So what if a process goes to sleep (after so many tries~2000) while waiting to acquire a latch which is in the category of long-wait latch. In this case the 1, 2, 4 formulas does not hold true then this sleeping process will depend on another internal process to wake it up and try again. This concept is known as latch-posting. Latches that are subjected or use latch-posting have statistics in waiters woken column. Which means to find such latches you should have a filter criteria where this column > 0. When you see latch waits in the v$session_wait, it simply means a process failed to acquired a latch after trying _SPIN_COUNT times and went to sleep mode. This is for willing to wait request mode only. During this time CPU resources will be heavily used because of spinning. So the total waits column in the v$session_wait tells number of time processes failed to acquire a latch in the willing to wait mode. SLEEPS column in V$LATCH means number of times the process sleeps on that latch. Because a process will have nothing to do except request latch for 20000 times and then sleeps, therefore the total waits and sleeps column should have equal values. You should be worried only when time waited column is relatively higher in v$session_Wait. Shared pool and library cache latches can be reduced by eliminating hard parsing of SQLs. Find those SQLs candidate for bind variables and then modify the code(get rid of literals) to reduce hard parsing. Cache buffer chain latches are most common which can be explained as; Data which is put in buffer cache have memory addresses which are placed in a buffer (linked lists/hash chains). This memory structure is protected by a latch called cache buffer chains. Inefficient SQLs cause cache buffer chain latches. As you know latches are in Memory only, therefore if your database is almost performing logical IOs, you may incur this latch more often. That means you should try to reduce the number of logical reads for such SQLs if you do not want to introduce some level of physical IO. There is another somewhat non-application reason for this latch which is hot block issue which happens when multiple sessions repeatedly need to access same block(s) of data. So if you have a cache buffer chain latch waits, you need to isolate the issue by finding hot blocks issue to isolate the problem. If the sessions are waiting on the same latch address in v$latch (you need to find the p1p2, p1raw value from v$session_Wait then map to v$latch). Check if there is a better alternative for application to avoid this situation, usually not. Otherwise you need to distribute the hot blocks into different locations by delete and insert same rows by rowid, export the table, increase pctfree and import back, minimize number of rows per block, building indexes with higher pctfree (which will increase height of index which is not a good way also), reducing database block size, or move those tables to smaller block size tablespaces. There are other latches like Row cache latches which gets reduced by less dictionary access like caching sequences etc. Main idea behind these solutions is also reduce the number of logical and physical reads for sqls that are associated with these latches. Appendix A Find out objects that are part of the SQLs waiting on specified wait event. You can always run the following SQL on production system to see what wait/event/operation Database is currently performing. In fact the heart of tuning an oracle database starts from this query. | SELECT EVENT,SID,P1,P1RAW,P2,P2RAW,P3,P3RAW,SECONDS_IN_WAIT FROM V$SESSION_WAIT where event not like '%timer' and event not like 'rdbms%' and event not like 'SQL*%client%' and event not like 'unspecified event%' and event not like 'Streams AQ%' and event not like 'LNS ASYNC%'; | You can replace ‘db file sequential read’ with any other wait event. | select b.sid, nvl(substr(a.object_name,1,30), 'P1='||b.p1||' P2='||b.p2||' P3='||b.p3) object_name, a.subobject_name, a.object_type from dba_objects a, v$session_wait b, x$bh c where c.obj = a.object_id(+) and b.p1 = c.file#(+) and b.p2 = c.dbablk(+) and b.event = 'db file sequential read' union select b.sid, nvl(substr(a.object_name,1,30), 'P1='||b.p1||' P2='||b.p2||' P3='||b.p3) object_name, a.subobject_name, a.object_type from dba_objects a, v$session_wait b, x$bh c where c.obj = a.data_object_id(+) and b.p1 = c.file#(+) and b.p2 = c.dbablk(+) and b.event = 'db file sequential read' order by 1; | Find out sessions full table scan rate. | select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.value <> 0 and b.name = 'table scan blocks gotten' order by 3,1; | Find out sessions waited on direct path read | select a.name, b.sid, b.value, round((sysdate - c.logon_time) * 24) hours_connected from v$statname a, v$sesstat b, v$session c where b.sid = c.sid and a.statistic# = b.statistic# and b.value > 0 and a.name = 'physical reads direct' order by b.value; | Find out IO throughput for db file parallel write | select event, time_waited, average_wait from v$system_event where event in ('db file parallel write','free buffer waits', 'write complete waits' ,'log file parallel write','log file sync'); | Find out latches stats summary | select name, gets, misses, immediate_gets, immediate_misses, sleeps from v$latch order by sleeps desc | Latches contentions finding sqls | Chech hard parses ration which also shared pool latches select a.*, sysdate-b.startup_time days_old from v$sysstat a, v$instance b where a.name like 'parse%' select a.sid, c.username, b.name, a.value, round((sysdate - c.logon_time)*24) hours_connected from v$sesstat a, v$statname b, v$session c where c.sid = a.sid and a.statistic# = b.statistic# and a.value > 0 and b.name = ’parse count (hard)’ order by a.value; select hash_value, substr(sql_text,1,80) from v$sqlarea where substr(sql_text,1,40) in (select substr(sql_text,1,40) from v$sqlarea having count(*) > 4 group by substr(sql_text,1,40)) order by sql_text; Find if latch waits are causes by hot block where address will be same for all waits select sid, p1raw, p2, p3, seconds_in_wait, wait_time, state from v$session_wait where event = 'latch free' order by p2, p1raw; Next find which is the hot block: Use the P1RAW from the above select a.hladdr, a.file#, a.dbablk, a.tch, a.obj, b.object_name from x$bh a, dba_objects b where (a.obj = b.object_id or a.obj = b.data_object_id) and a.hladdr = '<p1raw>' union select hladdr, file#, dbablk, tch, obj, null from x$bh where obj in (select obj from x$bh where hladdr = '<p1raw>' minus select object_id from dba_objects minus select data_object_id from dba_objects) and hladdr = '<p1raw>' order by 4; | Smart Oracle Solutions www.oraclefusions.com support@oraclefusions.com |
|
| |
|
| |
| |
|
| |
|
|