Tony’s Oracle Tips

July 24, 2011

‘log file sync’ and ‘log file parallel write’ – part 2

Filed under: Uncategorized — tonyhasler @ 8:24 pm

I mentioned at the very end of part 1 that ‘log file parallel write’ wait times are in fact unlikely to be close to ‘log file sync’ wait times.  There have been many questions and blog entries on this topic over the years.  One common starting point seems to be Metalink article 34592.1.  Another good article is this one:

http://kevinclosson.wordpress.com/2007/07/21/manly-men-only-use-solid-state-disk-for-redo-logging-lgwr-io-is-simple-but-not-lgwr-processing/.

However, it seems that everybody has missed a fundamental complication of the changes in commit processing introduced in 10g.  The point has probably been missed because it is normally very hard to demonstrate.

Oracle 9i and below Oracle had the concept of a ‘log buffer’.  You would be in good company if you thought that later versions had a log buffer as well.  After all, there is an initialisation parameter LOG_BUFFER that specifies how big it is.

In reality, Oracle now uses a combination of ‘private redo strands’ and ‘public redo strands’.

Private redo strands do not exist in instances within a RAC cluster.  In a non-RAC environment each transaction gets its own redo strand (subject to availability).  These strands are used in combination with in-memory undo (IMU) to reduce the amount of data that is logged but more importantly to reduce latching conflicts; data is normally copied from a private redo strand only at commit time.  Data will also be copied out if the private redo strand fills up before a commit is issued.  These private redo strands are allocated from shared pool.

RAC or no RAC there is also a minimum of two public redo strands as well.  In larger systems there may be more (number of CPUs divided by 16).  These public redo strands also contribute to latch reduction.  The number of these strands can be changed by use of the hidden parameter “_log_parallelism_max”.

In relatively quiet systems only one public redo strand may be in use but even with as few as two concurrent sessions busy generating redo you can see the second redo strand being used.  When that happens, LGWR writes out the data in the public redo strands sequentially!

How do I know this?

Well, I was performing a rather artificial insert-only performance test on one instance of an 11.2 RAC database on a 64-CPU system with a suspected faulty I/O subsystem (don’t ask).  LGWR was spending 90% of its time in ‘log file parallel write’ (LFPW) and about 10% of the time on the CPU.  I had taken the advice in Kevin’s article above and “reniced” LGWR so I was fairly sure that LGWR rescheduling was not a big issue: throughput was limited by the I/O subsystem.

Nevertheless, ‘log file sync’ (LFS) wait events averaged 4 or 5 times the LFPW times despite the average CPU utilisation only being about 25%.  Theoretically, one would expect that the average LFS time would be somewhat more than 50% higher than LFPW in a system that is limited by LGWR I/O:  if LGWR has just started an I/O at the time that you commit then you have to wait for that I/O to finish as well as your own that follows.  If you are luckier LGWR may just finish an I/O as you commit and now you just have to wait for your own I/O. This gives an average LFS/LFPW ratio of 1.5.  Given the fact that you are still in LFS when LGWR is on the CPU doing its post processing I thought that LFS should be a little higher than 50% above LFPW but certainly not 300% or 400% higher.

Explanations revolving around ‘CPU storms’ didn’t seem plausible so I had to dig deeper.  Well, by ‘digging deeper’ I mean a) posting a question on OTN and asking for help from ‘knowledgeable friends’.  When that failed b) guessing.  I eventually guessed correctly when I set “_log_parallelism_max” to 1 (it was previously 4 on my 64 CPU system).

The faulty I/O subsystem had (has) a characteristic that throughput is several times higher when the I/O size is larger.  The reason for this is not yet known.  However, when the initialisation parameter change was made:

  • The average size of the I/O increased several times
  • The (faulty) I/O subsystem throughput increased several fold
  • The LFS time came back down to just over 50% of the LFPW time.
  • LGWR started waiting on the idle event ‘rdbms ipc message’; the system was no longer limited by I/O.
  • Latching increased thus proving that the long term solution was to fix the I/O subsystem not make unsupported changes to initialisation parameters!

I then set about trying to devise some kind of experiment that others can repeat to verify my conclusions.  The problem is that the only way to get more than one public redo strand in play is to have sufficient redo being generated that a redo allocation latch collision occurs.  You can easily generate this scenario.  You can also prove that the number of transactions being committed each write (redo size/commit size) is much less than the number of processes in LFS but I can’t think of a way to prove that this discrepancy is caused by I/O serialisation without a faulty I/O subsystem.  Sure you can run a trace and see that there are multiple LFPW events between idle ‘rdbms ipc message’ events but that is hardly conclusive.

Any ideas on how to devise an experiment would be warmly welcomed.  In the meantime I guess you will just have to take my word for it!

About these ads

4 Comments »

  1. Hi Tony,

    So, what’s your environment – platform and where are are the log files located, ASM, filesystem?

    The first thing would be to check whether Oracle would attempt asynchronous IO with LGWR. LGWR is capable of doing asynch IO, for example in cases where you have multiple log members in each group.

    So, if you’re using a filesystem for log files – what’s the filesystemio_options paramater value?

    The next step would be to run strace / truss (in non-production) on LGWR and see which syscalls its using …)

    Comment by Tanel Poder — July 24, 2011 @ 11:58 pm | Reply

    • Tanel,

      The system under test is 11.2 RAC running ASM. I have tried the tests with filesystemio_options not set (none) and also set to SETALL. The results are the same.

      I have run SQL trace and can see multiple ‘log file parallel write’ wait events between ‘rdbms ipc message’ wait events but the difficulty is demonstrating that the writes are from different public strands and that the second I/O could have been started earlier. I fear that strace / truss would not help me here.

      I realise that the evidence is somewhat circumstantial and that there is no obvious explanation as to why this serialialisation would be required so if you have any more ideas I would welcome them!

      Comment by tonyhasler — July 25, 2011 @ 4:47 pm | Reply

  2. Tony,

    The code for handling multiple log buffers gets fairly complicated – especially since the number of active log buffers varies dynamically with load. A key point, though is that the starting blocks in the free space are mapped to the log file – which means the individual block buffers are pre-formattted and labelled for specific blocks in the file. You can check this by looking at x$kcrfstrand which shows you the current start of free space in each public log buffer, and then dumping a few blocks of memory from that point onwards (assuming you can freeze your processing).

    It’s possible, therefore, that when the log writer is posted to write, the N separate buffers do not hold a contiguous area of the file that has to be written. This would explain the strategy of writing each log buffer with a separate call. It’s a little surprising that your tests suggest that lgwr isn’t able to dispatch all these writes asynchronously and poll for completion – because that way you wouldn’t have to worry about it being one write or N writes – but maybe it is doing this anyway and it’s just the anomaly in your I/O subsystem that’s introducing the unexpected delay

    Comment by Jonathan Lewis — August 13, 2011 @ 8:51 pm | Reply

    • Jonathan,

      I agree with everything you say – including the problem of drawing conclusions from a broken I/O subsystem. I will be seeing Tom Kyte in a few days. I’ll see if I can get an authoratitve answer as tro what actually happens.

      Comment by tonyhasler — August 14, 2011 @ 5:38 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 32 other followers

%d bloggers like this: