Saturday, March 21, 2009

Finding the source of an error in an inlined function using gdb

Yesterday at work, I was trying to find the source of a crash from a coredump using gdb and discovered that if the crash was caused by inlined code, the stack trace that gdb shows does not show you the information you might need to determine the exact location. In this case, gdb told me that the crash occurred in a call to std::string::size(). However, there was no call to this method in the next function up in the stack trace. The reason was that std::string::size() was being called from another inlined function, but gdb wasn't showing what that function that was. After fumbling around for about 30 minutes I was able to figure out that the crash was happening inside a std::map::find() call (the map's key was a std::string). I want to document the procedure that I used here.

Let's use a simple program that illustrates the problem (in a file named gdbtest.cpp):

#include <cstdio>

inline void f1( int *p)
{
(*p)++; // Line 5
}

inline void f2( int * p)
{
std::printf("%p\n", p); // Line 10
f1(p);
}

int main()
{
int i;

f2(&i);
std::printf( "%d\n", i); // Line 19
f2(0);
std::printf( "%d\n", i);
f2(&i);
std::printf( "%d\n", i);

return 0;
}

Here it's obvious which line is going to cause the segmentation fault, but the program is useful to demonstrate the general technique.

Build the executable with:
g++ -g -finline -o gdbtest gdbtest.cpp
The -finline flag tells g++ to inline the functions even in debug mode.

Make sure that core files are enabled by typing:
ulimit -c unlimited

Now run the program:
./gdbtest

The output I get from a run (I'm running 64-bit Linux):
0x7fff2cdee1fc
1
(nil)
Segmentation fault (core dumped)

Start up gdb with the corefile:
gdb gdbtest core

gdb displays it's header, etc. at the bottom appears:
#0 0x00000000004005fb in main () at gdbtest.cpp:5
5 (*p)++;
(gdb)

The hex address will probably be different for you. If you type where (or info stack) to see the stack trace you get (user input is in italics):
(gdb) where
#0 0x00000000004005fb in main () at gdbtest.cpp:5

This is not very useful, there is only one stack frame shown, main's. This does not show us which line in the actual body of the main function is the source of the segmentation fault. Line 5 is the line in the inlined f1 function that is called multiple times.

The key to determining the real location is the hex address listed. This is the address of the machine instruction that caused the fault. You can use the info line gdb command to show the source line that maps to a code address.

Typing:
info line *0x4005fb
displays:
Line 5 of "gdbtest.cpp" starts at address 0x4005f7 and ends at 0x400606 .

Again the actual address values will probably be different for you. Here gdb is telling you that line 5 maps to the code from address 0x4005f7 to 0x400606. For ordinary code, a source line will map to only a single range of code. However, inline code will be mapped to multiple places. Every place that the code is inlined to will be mapped to the inline source. In the example above, line 5 will mapped to three different places since it is used it f2 which itself is inlined and used in three different places. The problem is to find which of these three places is causing the segmentation fault.

The key is to look at the code around the faulting code to determine where we are in main. From the above, we see that line 5 starts at address 0x4005f7, so the previous instruction will end at address 0x4005f6, we can use info line to see what line that is:
(gdb) info line *0x4005f6
Line 10 of "gdbtest.cpp" starts at address 0x4005dc and ends at 0x4005f7 .

This tells us that the line executed before the crashing line was line 10. This is just what we expected, but doesn't tell us which call to f2 caused the crash. If we continue moving backward, using address 0x4005db (the address right before 0x4005dc where line 10 starts):
(gdb) info line *0x4005db
Line 19 of "gdbtest.cpp" starts at address 0x4005c2 and ends at 0x4005dc .

Line 19 is in the body of main and this tells us that it is second call to f2 (line 20) that caused the crash.

It appears that gdb has added a new option to the disassemble command, /m that would display the assembly code with addresses mixed in with the source code. This would allow one to easily determine the location. However, this option is not in the most recent version of gdb I have access to (6.8)

Sunday, March 08, 2009

UML Considered Harmful

At work, we use Unified Modeling Language (UML) to document our designs. I think UML has its place as part of the design process and documentation, but I believe that UML is not in itself sufficient to document a non-trivial design. I have been to numerous design reviews where a few UML class and sequence diagrams are the only description of the design. I find these inadequate to communicate more than a vague idea of the overall design.

For a complete description of the design, I feel that a full design document with words and pseudocode is a requirement. I get the impression that UML diagrams are being used as a way to say that a design is documented (for documentation requirements like CMMI) without really providing any meaningful documentation.

If anyone is out there, how is your company documenting design? Do you think your process works?

Monday, December 22, 2008

Jar files and indexing

Recently at work, I ran into an interesting "feature" of how jar files and indexing work with Java and ant.

For background on indexing see:
http://closingbraces.net/2007/05/13/jarclasspathandindex/

On some other developer's machines, some java applications were failing with NoClassDefFoundError exceptions. However, I personally never saw this error. The jar's with the classes that were not being found were listed in the Class-Path entry of the manifest of the jar that used them. I was extremely puzzled by this. I could fix the problem by explicitly adding the jar's to the classpath, but I didn't understand why I needed to do this when everything worked fine on my machine.

From the title of this post you can probably guess that the problem involved indexing. Our build process (using ant), creates a jar (let's call in myjar.jar) with the <jar> task, then it runs "jar -i" on it to index it (using an ). If all this happens everything works fine. However, there seems to be a race condition between when ant closes the jar file that it created with the <jar> task and when the "jar -i" command is run. When this happens, the "jar -i" command fails because the file is still locked and the index is not created.

I will go into exactly why this caused the error in a minute, but first I want to look at how I first tried to fix this because it also brings up an important point on the problem. The <jar> task has an optional attribute named "index" which can be used to have ant index the file. I tried using this instead of calling "jar -i" on it, but it did not fix the problem. Why didn't it? Because setting index="true" in the <jar> task does not do the same thing as running "jar -i" on the jar! This is very uninituitive! So what is the difference? Using the <jar> task, on the classes in the jar itself are added to the index; however, using "jar -i" the classes in the jar and all the classes used by the jar are added to the index.

Why does this make a difference? The key is in the second link above. Quoting:
The executable jar must either not have a META-INF/INDEX.LST file, or if it does have such an INDEX.LST file this needs to list the contents of both the executable jar and all of the other jars as well. Anything not in this list will not be found on the class path, regardless of the “Class-Path” entry in the manifest...

Two things to note, the filename I see is META-INF/INDEX.LIST (not LST) and the link goes on to say:
and regardless of any command-line “-classpath” (which is ignored anyway).
This did not apply in my case. We are using java -cp file.jar file.MainClass to run our apps, not java -jar file.jar. In a nutshell, if the jar has an index, it's Class-Path entry is ignored.

In my case, our application uses jacorb.jar (from JacORB). It uses two other jars: logkit-1.2.jar and avalon-framework-4.1.5.jar. These are the two jars I had to add to the classpath to fix the problem. My jar has jacorb.jar listed in its Class-Path entry.

If everything builds properly and myjar.jar is indexed using "jar -i", everything works because entries for all the classes used by myjar.jar are included in the index (including the ones in logkit-1.2.jar and avalon-framework-4.1.5.jar. The Class-Path entry is ignored since the index is present.

However, if the "jar -i" command fails, then the Class-Path entry is in effect and classes in jacorb.jar can be resolved, but not the classes that jacorb.jar uses. Why not? Because jacorb.jar was indexed using ant. So, it has an index, but it only contains its own classes, not the ones in the other two jars.

If you want to see this problem yourself. You can download the code I used to verify this from here. It has 3 jars: myjar.jar, dependjar.jar and depend2jar.jar. myjar uses dependjar which uses depend2jar. This is equivalent to myjar.jar which uses jacorb.jar which uses logkit-1.2.jar.

The ant script by default will build the jars without indexing and use jar -i to create indexed versions of the jars named imyjar.jar, idependjar.jar and idepend2jar.jar.
If you run ant with -Dindex="yes" it will index the plain jars (myjar.jar, dependjar.jar and depend2jar.jar) using ant's index method.
To recreate the problem I saw, type:
ant
ant build.dependjar -Dindex="yes"
java -cp myjar.jar myjar.MyClass
The first line builds all the plain jars with no indexing. The second line replaces dependjar.jar with one using ant's index funciton (just like jacorb.jar). The last line demonstrates the error that occurs. You can also verify that adding dependjar.jar to the classpath doesn't fix the error either.

In general I see several ways to fix this problem.
  1. Avoid using indexing
  2. Put all the jars on the classpath
  3. Fully index all jars with "jar -i"
Hopefully this post will help anyone else running into this problem.

Sunday, December 21, 2008

Time to start using this blog?

I guess it might be time to start using this blog.

Here's some background. I'm a software developer and ex-CS professor. When I was teaching, I wrote a book on 32-bit x86 assembly language that I provide free on the web.

In my current job, I mostly program in C++ with some Java. I'm also a big fan of Python. I expect that most of my posts with be dealing with these languages (and assembly of course).