Saturday, March 21, 2009

Finding the source of an error in an inlined function using gdb

Yesterday at work, I was trying to find the source of a crash from a coredump using gdb and discovered that if the crash was caused by inlined code, the stack trace that gdb shows does not show you the information you might need to determine the exact location. In this case, gdb told me that the crash occurred in a call to std::string::size(). However, there was no call to this method in the next function up in the stack trace. The reason was that std::string::size() was being called from another inlined function, but gdb wasn't showing what that function that was. After fumbling around for about 30 minutes I was able to figure out that the crash was happening inside a std::map::find() call (the map's key was a std::string). I want to document the procedure that I used here.

Let's use a simple program that illustrates the problem (in a file named gdbtest.cpp):

#include <cstdio>

inline void f1( int *p)
{
(*p)++; // Line 5
}

inline void f2( int * p)
{
std::printf("%p\n", p); // Line 10
f1(p);
}

int main()
{
int i;

f2(&i);
std::printf( "%d\n", i); // Line 19
f2(0);
std::printf( "%d\n", i);
f2(&i);
std::printf( "%d\n", i);

return 0;
}

Here it's obvious which line is going to cause the segmentation fault, but the program is useful to demonstrate the general technique.

Build the executable with:
g++ -g -finline -o gdbtest gdbtest.cpp
The -finline flag tells g++ to inline the functions even in debug mode.

Make sure that core files are enabled by typing:
ulimit -c unlimited

Now run the program:
./gdbtest

The output I get from a run (I'm running 64-bit Linux):
0x7fff2cdee1fc
1
(nil)
Segmentation fault (core dumped)

Start up gdb with the corefile:
gdb gdbtest core

gdb displays it's header, etc. at the bottom appears:
#0 0x00000000004005fb in main () at gdbtest.cpp:5
5 (*p)++;
(gdb)

The hex address will probably be different for you. If you type where (or info stack) to see the stack trace you get (user input is in italics):
(gdb) where
#0 0x00000000004005fb in main () at gdbtest.cpp:5

This is not very useful, there is only one stack frame shown, main's. This does not show us which line in the actual body of the main function is the source of the segmentation fault. Line 5 is the line in the inlined f1 function that is called multiple times.

The key to determining the real location is the hex address listed. This is the address of the machine instruction that caused the fault. You can use the info line gdb command to show the source line that maps to a code address.

Typing:
info line *0x4005fb
displays:
Line 5 of "gdbtest.cpp" starts at address 0x4005f7 and ends at 0x400606 .

Again the actual address values will probably be different for you. Here gdb is telling you that line 5 maps to the code from address 0x4005f7 to 0x400606. For ordinary code, a source line will map to only a single range of code. However, inline code will be mapped to multiple places. Every place that the code is inlined to will be mapped to the inline source. In the example above, line 5 will mapped to three different places since it is used it f2 which itself is inlined and used in three different places. The problem is to find which of these three places is causing the segmentation fault.

The key is to look at the code around the faulting code to determine where we are in main. From the above, we see that line 5 starts at address 0x4005f7, so the previous instruction will end at address 0x4005f6, we can use info line to see what line that is:
(gdb) info line *0x4005f6
Line 10 of "gdbtest.cpp" starts at address 0x4005dc and ends at 0x4005f7 .

This tells us that the line executed before the crashing line was line 10. This is just what we expected, but doesn't tell us which call to f2 caused the crash. If we continue moving backward, using address 0x4005db (the address right before 0x4005dc where line 10 starts):
(gdb) info line *0x4005db
Line 19 of "gdbtest.cpp" starts at address 0x4005c2 and ends at 0x4005dc .

Line 19 is in the body of main and this tells us that it is second call to f2 (line 20) that caused the crash.

It appears that gdb has added a new option to the disassemble command, /m that would display the assembly code with addresses mixed in with the source code. This would allow one to easily determine the location. However, this option is not in the most recent version of gdb I have access to (6.8)

No comments: