INF: How to Determine When SQL Server Causes a Windows NT Blue Screen

ID: Q170576


The information in this article applies to:


SUMMARY

Infrequently Windows NT may halt with a STOP screen, commonly called a "blue screen", or it may hard hang, where the console is completely frozen and non-responsive. This may sometimes happen on a computer where SQL Server is running, or may coincide with a particular SQL Server operation such as a DUMP or LOAD, BCP, a long-running query, and so on.

The vast majority of time, this indicates an operating system, device driver, or hardware problem and should be pursued as such. Windows NT user or kernel mode process isolation ensures that a user mode application problem will not cause the operating system to stop responding. This article discusses exceptions to this and ways to determine whether to troubleshoot the problem at the system or application layer.

Sometimes the cause of a machine hard hang or blue screen may be an NMI (non-maskable interrupt) error. This is sometimes visible as an error code stating NMI, parity check or I/O parity check. NMI errors are almost always hardware. Usually they are caused by a memory failure but can originate in other hardware subsystems such as video boards. Even if the NMI error only happens during certain SQL Server operations, and if the system passes initial hardware diagnostics, it should still be considered a hardware problem and pursued as such. It may be necessary to use a dedicated memory SIMM testing device which can often find a transient memory error that eludes software-based diagnostics. For more information see the Windows NT Resource Kit under the heading "Memory Problems", and the following article in the Microsoft Knowledge Base:

Q101272 : "Memory Parity Errors: Causes and Suggestions"


MORE INFORMATION

Processes exist on Windows NT in either user mode or kernel mode (sometimes called supervisor or privileged mode). In the Intel i386 architecture, user mode maps to ring 3 and kernel mode to ring 0 of the 4-ring protection system. The i386 architecture has been carried forward with little change in all Intel and compatible processors to date, including the Pentium Pro and Pentium II. RISC processors such as the Alpha AXP likewise typically have unprivileged and privileged modes.

Kernel mode is a privileged processor mode in which a thread has access to system-wide memory (including that of all user-mode processes) and to hardware. By contrast, user mode is a nonprivileged processor mode in which a thread can only access system resources by calling system services.

A user mode process cannot access kernel mode memory, nor can it access memory of another user mode process. This is enforced by processor hardware, in conjunction with kernel mode data structures such as Page Tables. For information on this see the 80386 Programmer's Reference Manual, the 80386 System Software Writer's Guide, or equivalent Alpha AXP documentation.

As a result of this protection system, a user mode application generally cannot stop responding, cause a blue screen, or otherwise cause a failure in the Windows NT operating system. Such problems should be primarily pursued at the system layer as an operating system, device driver, or hardware issue.

While an application error cannot cause a failure in the operating system, an operating system error can cause an application to stop responding. This is because of the general rule: applications must call inward (to kernel mode), but the operating system can reference outward to user mode freely at any time. A microkernel-influenced architecture like Windows NT may in turn dispatch certain work to a user-mode system process rather than perform the work in kernel mode. However, the overall principal remains the same: processor hardware enforces process context isolation, which prevents one process from causing a failure in another, whether one or both are in user mode.

If a user mode application passes an invalid parameter in a Win32 API call, it is the operating system's responsibility to validate this parameter. In very rare cases, passing an invalid parameter may cause a Windows NT blue screen error. However, this is an operating system issue, and should be debugged and pursued as such.

There are a few narrow exceptions to the above guidelines. These exceptions can be easily and quickly eliminated:

A Winlogon Problem Caused by SQL Extensible Performance Counters

Current Windows NT architecture stipulates that any extensible performance counters added by a service will run in the process context of the Windows NT Winlogon process. Because Winlogon is a vital component of the operating system, a bug or resource leak in any performance counter DLL may disrupt Winlogon, and hence the operating system. The SQL Server extensible performance counter DLL is called Sqlctr60.dll, and it exports several SQL Server-specific objects from SQL Server to the operating system. You cam use Performance Monitor to monitor these objects. For more details on extensible performance counters, see volume 4 of the Windows NT Resource Kit, titled "Optimizing Windows NT," by Russ Blake.

Although it is very rare for Sqlctr60.dll to cause a Winlogon problem, you can expedite problem identification when pursuing a Windows NT failure or blue screen problem on a computer running SQL Server by renaming this DLL. Doing this eliminates use of SQL Server performance counters, but you can still use Performance Monitor to monitor SQL Server by using regular Windows NT performance counters (such as threads, process, memory, and so on).

If renaming Sqlctr60.dll fixes the problem, and if this is confirmed by reinstating and removing the DLL several times, the problem should be pursued as a SQL Server issue. Otherwise, it should be pursued as a system- layer issue.

A Resource Leak

If a resource leak continues for a period of time, the operating system should return the appropriate return code to the application, which should log this. For example, if you receive operating system error 8 "not enough storage," the operating system should handle the situation gracefully by not granting further resource requests. However a continued application resource leak may not be handled gracefully by the operating system under all conditions, resulting in a blue screen or operating system or application failure.

Almost all resource leaks will manifest themselves as a gradual increase in consumption of some resource, such as handles, virtual memory, private bytes, and so on. Therefore, the easiest way to rule in or out a resource leak is to run Performance Monitor and log all objects to a file. When the problem occurs, examine the logged performance data for signs of a leak. Some good counters to examine are: handle count, page file bytes, pool paged bytes, pool nonpaged bytes, private bytes, thread count, virtual bytes, and working set for each process running on the computer.

It is not necessary to classify certain values as normal or abnormal. Focus on identifying leaks by the continuous nature of the increase, not by the absolute value at a given time. Remember it's normal for the perfmon "private bytes" counter for SQL Server to start well below the configured sp_conigure "memory" value, then increase with activity until it roughly approaches, but doesn't significantly exceed, that value.

If one of the logged Performance Monitor counters continuously increases for the Sqlservr.exe process, and if reaching a certain value repeatedly coincides with a Windows NT blue screen or operating system failure, it should be pursued temporarily as a SQL Server issue until the cause of the continuous SQL Server resource leak is understood. Otherwise, it should be pursued as a system layer problem.

CPU Monopolization

If a process spawns high priority threads that are continuously in a runnable state, this process can dominate the computer and prevent the operating system from running. A properly configured SQL Server will not cause this problem. However if SQL Server "SMP concurrency" is set to -1, and if a number of long-duration CPU intensive queries is run equal to or greater than the number of processors, under some conditions the operating system may appear to stop responding. In general, you should leave SQL Server the "SMP concurrency" option at the default setting of 0. Likewise, leave the "priority boost" setting at the default. For more information, see the following articles in the Microsoft Knowledge Base:
Q111405 : SQL Server and Windows NT Thread Scheduling

Q166967 : Proper SQL Server Configuration Settings

When pursuing an operating system hang problem, first ensure that the SQL Server configuration settings mentioned above are at their default values. After making this check, if the operating system or application failure recurs, it should be pursued as a system layer problem.

REFERENCES

For more information, see the following documents:

Additional query words: hang hung bulk copy program crash app winnt perfmon


Keywords          : kbenv kbinterop SSrvGen 
Version           : WINNT:6.5
Platform          : winnt 
Issue type        : kbhowto 

Last Reviewed: April 14, 1999