Skip to main content

Timeout Detection and Recovery of GPUs through WDDM

One of the most common stability problems in graphics is when the system appears completely "frozen" or "hung" while processing an end-user command or operation. We generally wait a few seconds and then reboot the system by pressing the Power button. Usually the graphics processing unit (GPU) is "busy" processing intensive graphical operations, typically during gameplay. This results in nothing being updated on the screen, thus appearing to the user that the system is frozen.

Timeout Detection and Recovery

Windows Vista attempts to detect these problematic hang situations and recover a responsive desktop dynamically. In this process, the Windows Display Driver Model (WDDM) driver is reinitialized and the GPU is reset. No reboot is necessary.The only visible artifact from the hang detection to the recovery is a screen flicker, which results from resetting some portions of the graphics stack, causing a screen redraw. Some older Microsoft DirectX applications may render to a black screen at the end of this recovery.

The following is a brief overview of the TDR process:

  1. Timeout detection: The Video Scheduler component of the Windows Vista graphics stack detects that the GPU is taking more than the permitted quantum time to execute the particular task and tries to preempt this particular task. The preempt operation has a "wait" timeout—the actual "TDR timeout." This step is thus the "timeout detection" phase of the process. The default timeout period in Windows Vista is 2 seconds. If the GPU cannot complete or preempt the current task within the TDR timeout, then the GPU is diagnosed as hung.
  2. Preparation for recovery: The operating system informs the WDDM driver that a timeout has been detected and it must reset the GPU. The driver is told to stop accessing memory and should not access hardware after this time. The operating system and the WDDM driver collect hardware and other state information that could be useful for post-mortem diagnosis.
  3. Desktop recovery: The operating system resets the appropriate state of the graphics stack. The Video Memory Manager component of the graphics stack purges all allocations from video memory. The WDDM driver resets the GPU hardware state. The graphics stack takes the final actions and restores the desktop to the responsive state. As mentioned earlier, some older DirectX applications may now render just black, and the user may be required to restart these applications. Well-written DirectX 9Ex and DirectX 10 applications that handle "Device Remove" continue to work correctly. The application must release and then recreate its Microsoft Direct3D device and all of its objects. DirectX application programmers can find more information in the Windows SDK.

Minor changes were made to improve the user experience in cases of frequent and rapidly occurring GPU hangs. Repetitive GPU hangs indicate that the graphics hardware has not recovered successfully. In these instances, the system must be shut down and restarted to fully reset the graphics hardware. If the operating system detects that six or more GPU hangs and subsequent recoveries occur within 1 minute, then the following GPU hang is treated as a system bug check.

Error Messaging

Throughout the process of GPU hang detection and recovery, the desktop is unresponsive and thus unavailable to the user. In the final stages of recovery, a brief screen flash occurs that is similar to the one when the screen resolution is changed. After the desktop has been successfully recovered, the following informational message appears.

nvidiacrash

 

Preventing this error

  • Ensure that graphics operations (that is, DMA buffer completion) take no more than 2 seconds in end-user scenarios such as productivity and gameplay.
  • Ensure that the DirectX graphics application does not run at a low frames per second (FPS) rate. As the FPS decreases, the likelihood of the GPU getting reset increases. If the application is running at 10 FPS or lower and a complex graphics operation is about to start, then a flush can be inserted.
  • For running benchmark tests on low-end GPUs, use the aforementioned registry keys that control the TDR timeout. Remember that they should not be used in production systems because it would affect overall system stability and robustness. Use these keys only as a final solution.

Comments

Anonymous said…
Timeout detection and recovery

I have 2 comp with same graffic cards but 1 with XP other with Vista64 same problem both. But installed OLD nvidida driver 191.07 and solved problems good luck to you other with that solotion.