The OOPS syndrome
What is OOPS?

OOPS means whenever any module or piece of code running in kernel mode undergoes some fault of any kind which could not be repaired beyond a certain point then kernel panics with a message thrown on console. This situation arises many a times for anyone programming in kernel mode specially device driver and core kernel functionalities like schedulers, memory management etc.

Reasons for a crash

There could be multiple reasons of a crash. A crash occurs when a kernel level data structure or service is tampered with in a wrong way. For example, while allocating a large memory chunk via kmalloc, if the memory allocation fails then kernel will panic. Consider another scenario. Suppose you hold a spinlock and enter an interrupt while holding and these some problem occurs then system will panic as other threads would be waiting for this spinlock which would never be released.

OOPS analysis

On occurrence of OOPs, the kernel throws a snapshot of the system at point of crash. You cannot sync this information in /var/log/messages file because the kernel has crashed by now. There is a tool called ksymoops that takes the information from oops and combines it with information about your kernel image (vmlinux), symbol map (System.map), and other sources to give a stack trace and a disassembly of the offending code. The output of kymsoops combined with tool objdump will give u the assembly of your complete code and this will help u locate which instruction has caused crash. Now to locate real reason of crash, look few instructions above crash point as the fault would have been spawned before and propagated down to show its real face at crash point.

Kernel Debugging techniques

Debugging in user mode is very easy and there are many tools available like gdb, ddd, insight etc but debugging kernel is very difficult due to difficulties in running debugger on kernel or tracing kernel code. When kernel crashes, all system services are installed and most of the information is inaccessible to the programmer which is very essential to track down the errors. Debugging can be done in following ways:

1. Using printks

This is the simplest and crudest yet most widely used way of debugging any kernel code. Most of times using printk will help narrow down the problem giving you the approximate location of the fault. However they fail give real cause of the crash. Printks can't be used effectively in interrupt handlers as these will slow interrupt handling which will have some other side effects. Again excessive use of printk can slow down the system noticeably. Further printks may add race conditions in timing.

2. Program Trace

Sometimes simply watching the behavior of an application can help track down minor problems. strace command can be used to show all the system calls issued by a user application as well as return values. strace can be useful if kernel code being debugged is registered for any system call i.e. open, read, write etc. call registered with any device driver.

3. Using oops dump

Kernel oops [5, 6] will always print lot of useful debug information, though it requires some practice to decipher meaning of dump. Most useful information is eip or instruction pointer value. Sometime klogd will map eip value to possible function name. It may not work in situations where klogd is started before faulty module is loaded. Though it can be forced to use latest symbols by send SIGUSR1 signal to klogd process. klogd gives relative position of faulty instruction in the function.

4. Using ksymoops

ksymoops [6, 7] is another tool which can be useful for tracing down the problem after oops. Its operation is dependent main map file, modules loaded at the time of crash and kernel symbols. Before causing crash, user is expected to save /proc/modules and /proc/ksyms and later use these file for post oops diagnostics. ksymoops will generate trace in assembly (use objdump and netdump). The list file is pretty useful as it gives C code mapping with corresponding assembly instruction.

References

[1] http://www.kernelnewbies.org. This contains various FAQ for newbies and link to useful resources. Sample kernel code for intercepting system calls and sharing kernel's memory buffer with user's application.
[2] Linux Device Drivers by Rubini, 2nd Edition. Great book for learning device drivers in Linux. http://www.oreilly.com/catalog/linuxdrive2/chapter/book/index.html points to free online version.
[3] http://www.moses.uklinux.net/patches/lki.html discusses about booting up of Linux, process management and VFS implementation in Linux. Also talks about system call implementation on x86.
[4] http://www.linuxdoc.org/LDP/tlk/tlk.html discusses general Linux kernel architecture. Almost all Linux subsystems are covered in this document. Its pretty old, discussions are based on Kernel 2.0 but most of information given there is still relevant.
[5] Linux Debugging Techniques by Ross Mikosh, IBM Linux Change Team
[6] Linux Kernel Debugging by and James Washer
[7] Kernel debugging with netdump and crash by Jeff Moyer

The Author of above article is Nikhil Bhargava. He is currently pursuing his Master's degree in computer Science. His research interests include system software and computer networks. Comments and Suggestions are always welcome.

Copyright © 2005, Nikhil Bhargava.

Hosted by www.Geocities.ws