I mentioned at the very end of part 1 that ‘log file parallel write’ wait times are in fact unlikely to be close to ‘log file sync’ wait times. There have been many questions and blog entries on this topic over the years. One common starting point seems to be Metalink article 34592.1. Another good article is this one:
However, it seems that everybody has missed a fundamental complication of the changes in commit processing introduced in 10g. The point has probably been missed because it is normally very hard to demonstrate.
Oracle 9i and below Oracle had the concept of a ‘log buffer’. You would be in good company if you thought that later versions had a log buffer as well. After all, there is an initialisation parameter LOG_BUFFER that specifies how big it is.
In reality, Oracle now uses a combination of ‘private redo strands’ and ‘public redo strands’.
Private redo strands do not exist in instances within a RAC cluster. In a non-RAC environment each transaction gets its own redo strand (subject to availability). These strands are used in combination with in-memory undo (IMU) to reduce the amount of data that is logged but more importantly to reduce latching conflicts; data is normally copied from a private redo strand only at commit time. Data will also be copied out if the private redo strand fills up before a commit is issued. These private redo strands are allocated from shared pool.
RAC or no RAC there is also a minimum of two public redo strands as well. In larger systems there may be more (number of CPUs divided by 16). These public redo strands also contribute to latch reduction. The number of these strands can be changed by use of the hidden parameter “_log_parallelism_max”.
In relatively quiet systems only one public redo strand may be in use but even with as few as two concurrent sessions busy generating redo you can see the second redo strand being used. When that happens, LGWR writes out the data in the public redo strands sequentially!
How do I know this?
Well, I was performing a rather artificial insert-only performance test on one instance of an 11.2 RAC database on a 64-CPU system with a suspected faulty I/O subsystem (don’t ask). LGWR was spending 90% of its time in ‘log file parallel write’ (LFPW) and about 10% of the time on the CPU. I had taken the advice in Kevin’s article above and “reniced” LGWR so I was fairly sure that LGWR rescheduling was not a big issue: throughput was limited by the I/O subsystem.
Nevertheless, ‘log file sync’ (LFS) wait events averaged 4 or 5 times the LFPW times despite the average CPU utilisation only being about 25%. Theoretically, one would expect that the average LFS time would be somewhat more than 50% higher than LFPW in a system that is limited by LGWR I/O: if LGWR has just started an I/O at the time that you commit then you have to wait for that I/O to finish as well as your own that follows. If you are luckier LGWR may just finish an I/O as you commit and now you just have to wait for your own I/O. This gives an average LFS/LFPW ratio of 1.5. Given the fact that you are still in LFS when LGWR is on the CPU doing its post processing I thought that LFS should be a little higher than 50% above LFPW but certainly not 300% or 400% higher.
Explanations revolving around ‘CPU storms’ didn’t seem plausible so I had to dig deeper. Well, by ‘digging deeper’ I mean a) posting a question on OTN and asking for help from ‘knowledgeable friends’. When that failed b) guessing. I eventually guessed correctly when I set “_log_parallelism_max” to 1 (it was previously 4 on my 64 CPU system).
The faulty I/O subsystem had (has) a characteristic that throughput is several times higher when the I/O size is larger. The reason for this is not yet known. However, when the initialisation parameter change was made:
- The average size of the I/O increased several times
- The (faulty) I/O subsystem throughput increased several fold
- The LFS time came back down to just over 50% of the LFPW time.
- LGWR started waiting on the idle event ‘rdbms ipc message’; the system was no longer limited by I/O.
- Latching increased thus proving that the long term solution was to fix the I/O subsystem not make unsupported changes to initialisation parameters!
I then set about trying to devise some kind of experiment that others can repeat to verify my conclusions. The problem is that the only way to get more than one public redo strand in play is to have sufficient redo being generated that a redo allocation latch collision occurs. You can easily generate this scenario. You can also prove that the number of transactions being committed each write (redo size/commit size) is much less than the number of processes in LFS but I can’t think of a way to prove that this discrepancy is caused by I/O serialisation without a faulty I/O subsystem. Sure you can run a trace and see that there are multiple LFPW events between idle ‘rdbms ipc message’ events but that is hardly conclusive.
Any ideas on how to devise an experiment would be warmly welcomed. In the meantime I guess you will just have to take my word for it!