INF: SQL Server 7.0 Logging and Data Storage Algorithms Extend Data Reliability

ID: Q230785


The information in this article applies to:


SUMMARY

SQL Server 7.0 restructured and redesigned the logging and data algorithms from earlier Microsoft SQL Server releases to improve data reliability and integrity.

To learn more about the underlying concepts of the SQL Server 7.0 engine, see "ARIES (Algorithm for Recovery and Isolation Exploiting Semantics): A Transaction Recovery method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging", of ACM Transactions on Database Systems. This document was written by Chunder Mohan.

This document addresses the SQL Server 7.0 techniques to extend data reliability and integrity as related to failures.

It is recommended that you read the following articles in the Microsoft Knowledge Base for further clarification on caching and alternate failure mode discussions:

Q86903 INF: SQL Server and Caching Disk Controllers
Q46091 INF: Using Hard Disk Controller Caching with SQL Server
Q234656 INF: Using Disk Drive Caching with SQL Server


MORE INFORMATION

Before beginning the in-depth discussion, some of the terms as used throughout this article are defined in the following section.

Term Definition
Battery-backed Separate and localized battery backup facility directly available and controlled by the caching mechanism to prevent data loss.
NOTE: This is NOT an uninterruptible power supply (UPS). A UPS does not guarantee any write activities and can be disconnected from the caching device.
Cache Intermediary storage mechanism used to optimize physical I/O operations and improve performance.
Dirty Page Page containing data modifications that have yet to be flushed to stable storage. For more information pertaining to dirty page buffers, see the SQL Server 7.0 Books Online documentation.
Failure Anything that might cause an unexpected outage of the SQL Server process. Examples include: power outage, computer reset, memory errors, other hardware issues, bad sectors, drive outages, OS failures, and so forth.
Flush Forcing of a cache buffer to stable storage.
Latch Synchronization object used to protect physical consistency of a resource.
Nonvolatile storage Any medium that remains available across system failures.
Pinned page Page that remains in data cache and cannot be flushed to stable storage until all associated log records are secured in a stable storage location.
Stable storage Same as nonvolatile storage.
Volatile storage Any medium that will not remain intact across failures.

SQL Server 7.0, earlier versions of SQL Server, and many mainstream database products on the market today use the Write-Ahead Logging (WAL) protocol.

Write-Ahead Logging (WAL) Protocol

The term protocol is an excellent way to describe WAL. It is a specific and defined set of implementation steps necessary to ensure data is stored and exchanged properly and can be recovered to a known state in the event of a failure. Just as a network contains a defined protocol to exchange data in a consistent and protected manner, so too does the WAL describe the protocol to protect data.

The ARIES document defines the WAL as:
The WAL protocol asserts that the log records representing changes to some data must already be in stable storage before the changed data is allowed to replace the previous version of the data in nonvolatile storage. That is, the system is not allowed to write an updated page to the nonvolatile storage version of the page until at least the undo portions of the log records which describe the updates to the page have been written to stable storage.
For more information about Write-Ahead Logging, see the SQL Server 7.0 Books Online documentation.

SQL Server and the WAL

SQL Server 7.0 and earlier SQL Server releases use the WAL protocol. To ensure proper committal of a transaction, all log records associated with the transaction must be secured in stable storage.

To clarify this, consider the following specific example (for this example assume that there is no index and the page affected is page 150).

BEGIN TRANSACTION
   INSERT INTO tblTest VALUES (1)
COMMIT TRANSACTION 
Now break the activity down into simplistic logging steps:

Statement Actions performed
BEGIN TRANSACTION Written to the log cache area but there is no need to flush to stable storage because the SQL Server has not made any physical changes.
INSERT INTO tblTest
  1. Data page 150 is retrieved into SQL Server data cache, if not already available.


  2. The page is latched, pinned, and marked dirty, and appropriate locks are obtained.


  3. An Insert Log record is built and added to the log cache.


  4. A new row is added to the data page.


  5. The latch is released.


  6. The log records associated with the transaction or page does not need to be flushed at this point because all changes remain in volatile storage.


COMMIT TRANSACTION
  1. A Commit Log record is formed and the log records associated with the transaction must be written to stable storage. The transaction is not considered committed until the log records are correctly assigned to stable storage.


  2. Data page 150 remains in SQL Server data cache and is not immediately flushed to stable storage. When the log records are properly secured recovery can redo the operation if necessary.


  3. Transactional locks are released.




Do not be confused with locking and logging. Although important, locking and logging are separate issues when dealing with the WAL. In the example above, SQL Server 7.0 generally holds the latch on page 150 for the time necessary to perform the physical insert changes on the page, not the entire time of the transaction. The appropriate lock type is established to protect the row, range, page, or table as necessary. Refer to the SQL Server 7.0 Books Online locking sections for more details on lock types.

Looking at the example in more detail, you might ask what happens when the LazyWriter or CheckPoint processes execute. SQL Server 7.0 issues all appropriate flushes to stable storage for transactional log records associated with the dirty and pinned page. This ensures the WAL protocol a data page can never be written to stable storage until the associated transactional log records have been flushed.

SQL Server and Stable Storage

SQL Server 7.0 enhances log and data page operations by including the knowledge of disk sector sizes (commonly 512 bytes).

To maintain the ACID properties of a transaction, the SQL Server must account for failure points. During a failure many disk drive specifications only guarantee a limited amount of sector write operations. Most specifications guarantee completion of a single sector write when a failure occurs.

SQL Server 7.0 uses 8-KB data pages and the log (if flushed) on multiples of the sector size. (Most disk drives use 512 bytes as the default sector size.) In the case of a failure, SQL Server can account for write operations larger than a sector by employing log parity and torn write techniques.

Torn Page Detection

The following section comes from the SQL Server 7.0 Books Online describing torn page detection:
This option allows SQL Server to detect incomplete I/O operations caused by power failures or other system outages. When true, it causes a bit to be flipped for each 512-byte sector in an 8-kilobyte (KB) database page whenever the page is written to disk. If a bit is in the wrong state when the page is later read by SQL Server, then the page was written incorrectly; a torn page is detected. Torn pages are usually detected during recovery because any page that was written incorrectly is likely to be read by recovery.

Although SQL Server database pages are 8 KB, disks perform I/O operations using a 512-byte sector. Therefore, 16 sectors are written per database page. A torn page can occur if the system fails (for example, due to power failure) between the time the operating system writes the first 512-byte sector to disk and the completion of the 8 KB I/O operation. If the first sector of a database page is successfully written before the failure, the database page on disk will appear as updated, although it may not have succeeded.

Using battery-backed disk controller caches can ensure that data is successfully written to disk or not written at all. In this case, do not set torn page detection to true, for it is not needed.
NOTE: Torn page detection is not enabled by default in SQL Server 7.0. See sp_dboption for how to enable the detection on your system.

Log Parity

Log parity checking is very similar to torn page detection. Each 512-byte sector contains parity bits. These parity bits are always written with the log record and evaluated when the log record is retrieved. By forcing log writes on a 512-byte boundary, SQL Server 7.0 can ensure that committal operations are completely written to the physical disk sectors.

The changes provide increased data consistency, even over prior versions of SQL Server.

Versions of SQL Server Earlier Than 7.0

Versions of SQL Server earlier than 7.0 did not provide log parity or torn bit detection facilities. In fact, those versions can write the same log page multiple times until the log records fill the 2-KB log page. This can expose transactions that have successfully committed. If the log page is being rewritten during a failure, a sector with the committed transaction may not get rewritten properly.

Performance Impacts

All versions of SQL Server open the log and data files using the Win32 CreateFile function. The dwFlagsAndAttributes member includes the FILE_FLAG_WRITE_THROUGH option when opened by SQL Server.
FILE_FLAG_WRITE_THROUGH
Instructs the system to write through any intermediate cache and go directly to disk. The system can still cache write operations, but cannot lazily flush them.

The FILE_FLAG_WRITE_THROUGH option ensures that when a write operation returns successful completion the data is correctly stored in stable storage. This aligns with the WAL protocol ensuring the data.
Many disk drives (SCSI and IDE) contain onboard caches of 512 KB, 1 MB, or larger. However, the drive caches usually rely on a capacitor and not a battery-backed solution. These caching mechanisms cannot guarantee writes across a power cycle or similar failure point. They only guarantee the completion of the sector write operations. This is specifically why the torn write and log parity detection was built into SQL Server 7.0. As the drives continue to grow in size, the caches become larger, and they can expose larger amounts of data during a failure.

Many hardware vendors provide battery-backed disk controller solutions. These controller caches can maintain the data in the cache for several days and even allow the caching hardware to be placed in a second computer. When power is properly restored, the unwritten data is completely flushed before the further data access is allowed. Many of them allow a percentage of read versus write cache to be established for optimal performance. Some contain large memory storage areas. In fact, for a very specific segment of the market, some hardware vendors provide high-end battery-backed disk caching controller systems with 6 GB of cache. These can significantly improve database performance.

Advanced caching implementations will handle the FILE_FLAG_WRITE_THROUGH request by not disabling the controller cache because they can provide true rewrite capabilities in the event of a system reset, power failure, or other failure point.

I/O transfers without the use of a cache can be significantly longer due to the mechanical time needed to move the drive heads, spin rates, and other limiting factors.

Sector Ordering

A common technique used to increase I/O performance is sector ordering. To avoid mechanical head movement the read/write requests are sorted, allowing a more consistent motion of the head to retrieve or store data.

The cache can hold multiple log and data write requests at the same time. The WAL protocol and the SQL Server implementation of the WAL protocol require flushing of the log writes to stable storage before the page write can be issued. However, use of the cache might return success from a log write request without the data being written to the actual drive (that is, written to stable storage). This may lead to SQL Server issuing the data page write request.

With the write cache involvement, the data is still considered to be in volatile storage. However, from the Win32 API WriteFile call, exactly how SQL Server sees the activity, a successful return code was obtained. SQL Server or any process using the WriteFile API call can only deduce the data has correctly obtained stable storage.

For discussion purposes, assume that all sectors of the data page are sorted to write before the sectors of the matching log records. This immediately violates the WAL protocol. The cache is writing a data page before the log records. Unless the cache is fully battery-backed, a failure can cause catastrophic results.

When evaluating the optimal performance factors for a database server, there are many factors to consider. The foremost of these considerations is "Does my system allow valid FILE_FLAG_WRITE_THROUGH capabilities?"

NOTE: Any cache you are using must fully support a battery-backed solution. All other caching mechanisms are suspect to data corruption and data loss. SQL Server makes every effort to ensure the WAL by enabling FILE_FLAG_WRITE_THROUGH.

Testing has shown that many disk drive configurations may contain write caching without proper battery backup. SCSI, IDE, and EIDE drives take full advantage of write caches.

In many configurations, the only way to properly disable the write caching of an IDE or EIDE drive is with a specific manufacturer utility or by using jumpers located on the drive itself. To ensure that the write cache is disabled for the drive itself, contact the drive manufacturer.

SCSI drives also have write caches but these caches can commonly be disabled by the operating system. If there is any question, contact the drive manufacturer for appropriate utilities.

Write Cache Stacking

Write Cache Stacking is similar to Sector Ordering. The following definition was taken directly from a leading IDE drive manufacturer Web site:
Normally, this mode is active. Write cache mode accepts the host write data into the buffer until the buffer is full or the host transfer is complete.

A disk write task begins to store the host data to disk. Host write commands continue to be accepted and data transferred to the buffer until either the write command stack is full or the data buffer is full. The drive may reorder write commands to optimize drive throughput.

Automatic Write Reallocation (AWR)

Other common technique used to protect data is to detect bad sectors during data manipulation. The following explanation comes from the same leading IDE drive manufacturer Web site:
This feature is part of the write cache and reduces the risk of data loss during deferred write operations. If a disk error occurs during the disk write process, the disk task stops and the suspect sector is reallocated to a pool of alternate sectors located at the end of the drive. Following the reallocation, the disk write task continues until it is complete.
This can be a very powerful feature if battery backup is provided for the cache, allowing proper modification upon restart. It is preferable to detect the disk errors, but the data security of the WAL protocol would again require this to be done real time and not in a deferred manner. Within the WAL parameters, the AWR technique cannot account for a situation where a log write fails due to a sector error but the drive is full. The database engine must immediately know about the failure so the transaction can be properly aborted, the administrator can be alerted, and proper steps taken to secure the data and correct the media failure situation.

Data Safety

There are several precautions a database administrator should take to ensure the safety of the data.

Testing Drives

To fully secure your data, you should ensure that all data caching is properly handled. In many situations, this means you must disable the write caching of the disk drive.

NOTE: Ensure that an alternate caching mechanism can properly handle multiple types of failure.

Microsoft has performed testing on several SCSI and IDE drives using the SQL70IOStress utility. This utility simulates heavy asynchronous read/write activity to a simulated data device and log device. Test performance statistics show the average write operations per second between 50 and 70 for a drive with disabled write caching and an RPM range between 5,200 and 7,200.

For additional information and complete details about SQL70IOStress, please see the following article in the Microsoft Knowledge Base:
Q231619 INF: SQL70IOStress Utility to Stress Disk Subsystem
Many PC manufacturers (for example, Compaq, Dell, Gateway, HP, and so on) order the drives with the write cache disabled. However, testing shows that this may not always be the case so always test completely.

Data Devices

In all but non-logged situations, SQL Server will require only the log records to be flushed. When doing non-logged operations, the data pages must also be flushed to stable storage; there are no individual log records to regenerate the actions in the case of a failure.

The data pages can remain in cache until the LazyWriter or CheckPoint process flushes them to stable storage. Using the WAL protocol to ensure that the log records are properly stored ensures that recovery can recover a data page to a known state.

This does not mean that it is advisable to place data files on a cached drive. When the SQL Server flushes the data page(s) to stable storage, the log records can be truncated from the transaction log. If the data pages are stored on volatile cache, it is possible to truncate log records that would be used to recover a page in the event of a failure. Ensure that both your data and log devices accommodate stable storage properly.

Increasing Performance

The initial question that arises is: "I have an IDE drive that was caching but when I disabled it, my performance became significantly less than expected -- why?"

Many of the IDE drives tested by Microsoft run at an RPM rate of 5,200, and the SCSI drives at an RPM of 7,200. When you disable the write caching of the IDE drive the mechanical performance can become a factor.

There is a very clear area to address the performance difference: "Address the transaction rate."

There are many online transaction processing (OLTP) systems that require a high transaction rate. For these systems, consider using a caching controller that can properly support a write cache and provide the performance boost while ensuring data integrity.

To significantly encounter performance changes with SQL Server on a caching drive, the transaction rate was increased using small transactions.

Testing shows that high write activity of buffers smaller than 512 KB or above 2 MB can cause slow performance. Consider the following example:

CREATE TABLE tblTest ( iID int IDENTITY(1,1), strData char(10))
GO

SET NOCOUNT ON
GO

INSERT INTO tblTest VALUES ('Test')
WHILE @@IDENTITY < 10000
   INSERT INTO tblTest VALUES ('Test') 
The following are sample test results for SQL Server:
SCSI(7200 RPM) 84 seconds
SCSI(7200 RPM) 15 seconds (Caching controller)

IDE(5200 RPM) 14 seconds (Drive cache enabled)
IDE(5200 RPM) 160 seconds
Wrapping the entire series of INSERT operations in a single transaction runs in approximately 4 seconds in all configurations.

The reason is the number of log flushes required. Without the transaction, each INSERT is a transaction in and of itself, and each time the log records for the transaction must be flushed. Each flush is 512 bytes in size, which requires significant mechanical drive intervention.

When a single transaction is used, the log records for the transaction can be bundled and a single, larger write can be used to flush the gathered log records. The mechanical intervention is significantly reduced.

WARNING: It is not recommended that you increase your transaction scope. Long-running transactions can lead to excessive and unwanted blocking as well as increased overhead. Use the SQL Server 7.0 performance counters SQL Server:Databases to view the transaction log-based counters. Specifically, Log Bytes Per Flush? can indicate many small transactions leading to high mechanical disk activity.

Look at the statements associated with the log flush and determine if the number of log flushes can be reduced. In the above example, a single transaction was implemented. However, in many scenarios this can lead to undesired locking behavior. Look at the design of the transaction. Perhaps something like the following to perform batches to reduce the frequent and small log flush activity:

BEGIN TRAN
GO

INSERT INTO tblTest VALUES ('Test')
WHILE @@IDENTITY < 50
BEGIN
   INSERT INTO tblTest VALUES ('Test')

   if(0 = cast(@@IDENTITY as int) % 10)
   BEGIN
      PRINT 'Commit tran batch'
      COMMIT TRAN
      BEGIN TRAN
   END
END
GO

COMMIT TRAN
GO 
SQL Server 6.x may not see the same performance impact from frequent and small transaction log writes. SQL Server 6.x rewrites the same 2-KB log page as transactions are committed. This can reduce the size of the log significantly compared to the 512-byte sector boundary flushes in SQL Server 7.0. Reducing the size of the log directly relates to the amount of mechanical drive activity. However, as explained above, the SQL Server 6.x algorithm may expose committed transactions.

Additional query words:


Keywords          : 
Version           : winnt:7.0
Platform          : winnt 
Issue type        : kbinfo 

Last Reviewed: July 20, 1999