JRE and propagate.so using linux

Hello,

I have been having trouble with a sporadic error that seems to be linked to the Identify Primary Objects and Identify Secondary Objects modules. I have traced the error to the cpmath files, the most common of which is _propagate.c. I have also seen an error pointing to " C [python+0xa7f6d]" but this is much less common. I have tried using Sun JRE 1.6 and 1.5.

To work around the problem, we break up the image sets into single batches and re-run the program only on the batches where JRE errors occur. Typically, the error does not re-occur on any single batch. We can reliably reproduce the error with a large enough number of batches, but the error does not occur regularly on any given batch.

Thanks for your help,
Chuck

On about 1-2% of jobs, the std. output provides this error:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0xca516644, pid=1913, tid=4159952576

JRE version: 6.0_20-b02

Java VM: Java HotSpot™ Server VM (16.3-b01 mixed mode linux-x86 )

Problematic frame:

C [_propagate.so+0x8644]

An error report file with more information is saved as:

/nfs/mai/vielhauer/20090817-01test/output/hs_err_pid1913.log

If you would like to submit a bug report, please visit:

java.sun.com/webapps/bugreport/crash.jsp

Subversion revision: 9870
Mon May 17 13:26:42 2010: Image # 1173, module LoadImages # 1: 1.29 sec (bg)
Mon May 17 13:26:43 2010: Image # 1173, module RescaleIntensity # 2: 0.13 sec (bg)
Mon May 17 13:26:43 2010: Image # 1173, module IdentifyPrimaryObjects # 3: 1.51 sec (bg)
Mon May 17 13:26:45 2010: Image # 1173, module Smooth # 4: 0.14 sec (bg)
Mon May 17 13:26:45 2010: Image # 1173, module IdentifyPrimaryObjects # 5: 1.13 sec (bg)
Mon May 17 13:26:46 2010: Image # 1173, module MeasureObjectSizeShape # 6: 2.19 sec (bg)
Mon May 17 13:26:48 2010: Image # 1173, module ExpandOrShrinkObjects # 7: 0.17 sec
Mon May 17 13:26:49 2010: Image # 1173, module Crop # 8: 0.17 sec
Mon May 17 13:26:49 2010: Image # 1173, module IdentifyPrimaryObjects # 9: 0.96 sec (bg)
/bio/tools/5.1/cellprofiler/9871/CellProfilerVirtualenv/CellProfiler/python-2.6.sh: line 18: 1913 Aborted python “$@”

Can you send us the error log it mentions:
/nfs/mai/vielhauer/20090817-01test/output/hs_err_pid1913.log

To make sure I understand, does the error happen if you re-run the exact same batch (i.e., don’t split it into single image sets for those that crash)?

Hi, Ray

We have made Cellprofiler runs and observed the error when splitting up the image sets and also when running as a single batch with all image sets. The full error output is attached as a tar archive.

Chuck
hs_err_pi1913.log.tar (70.5 KB)

Thanks. When you built the _propagate.so (using python setup.py build_ext, I assume), do you recall if the C compilation command included “-g” as an option? From the stack trace, it’s nearly impossible to tell where it’s actually crashing.

The crash is not consistent, correct? If a batch crashes as the third image, will it crash at that same image if you rerun the same batch? This would most likely indicate some sort of memory management bug on our part.

Thanks.

Correct. The crash is not consistent. This is why we are able to work around the problem, just by re-running the image sets that fail.

The _propagate.c file is compiled with “-g”. I dug up the whole line we use for compilation. To get it to compile, we use LD_FLAGS="-m32" CFLAGS="-m32".

gcc -pthread -DNDEBUG -m32 -g -fwrapv -O3 -Wall -Wstrict-prototypes -m32 -fPIC -Isrc -I/bio/tools/5.1/cellprofiler/9871/CellProfilerVirtualenv/lib/python2.5/site-packages/numpy/core/include -I/bio/tools/5.1/python/2.5.4/include/python2.5 -c _propagate.c -o build/temp.linux-x86_64-2.5/_propagate.o -O3

Do you get a core dump, or have some way to run CP under gdb so we can find the actual crashing line? I’m somewhat at a loss as to what to try at this point. We could add some logging output to propagate to try to track down what’s going wrong, but that’s not a very efficient way to proceed.

Hi, Ray

We are having some trouble replicating the error under gdb. It crashes when we try to launch CellProfiler (see output below). I have recompiled python using “–with-pydebug” to try to capture the python symbols. However, we still can’t get any useful output from the trace. We can run CellProfiler and use gdb to attach to an existing pid, but it’s hard to use this method to replicate the Java crash since it is very manual/time intensive, and the error occurs on such a small percentage of cases.

We would like to try adding logging to propagate to try to identify the problem. Can you recommend good locations in the code and variables to print? Do you know any gdb tricks with python debugging that I’ve overlooked?

When running “gdb python” Cellprofiler crashes at an early stage, unrelated to the Java crash we’ve been experiencing. The output from one such attempt is shown below.

(CellProfilerVirtualenv_debug)~]$ gdb python
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)

Reading symbols from /bio/tools/5.1/cellprofiler/x86/CellProfilerVirtualenv_debug/bin/python…done.
(gdb) run /bio/tools/5.1/cellprofiler/x86/CellProfilerVirtualenv_debug/CellProfiler/CellProfiler.py -p 20090713-01/output/Batch_data.mat -c -r -b -f 1 -l 1
Starting program: /bio/tools/5.1/cellprofiler/x86/CellProfilerVirtualenv_debug/bin/python /bio/tools/5.1/cellprofiler/x86/CellProfilerVirtualenv_debug/CellProfiler/CellProfiler.py -p 20090713-01/output/Batch_data.mat -c -r -b -f 1 -l 1
[Thread debugging using libthread_db enabled]
[New Thread 0xf6bc0b90 (LWP 29024)]
Detaching after fork from child process 29029.
Detaching after fork from child process 29031.
Detaching after fork from child process 29033.
Detaching after fork from child process 29035.
[New Thread 0xf1660b90 (LWP 29161)]
[New Thread 0xc9933b90 (LWP 29162)]
[New Thread 0xc98b2b90 (LWP 29163)]
[New Thread 0xc969bb90 (LWP 29164)]
[New Thread 0xc961ab90 (LWP 29168)]
[New Thread 0xc95c9b90 (LWP 29169)]
[New Thread 0xc9578b90 (LWP 29171)]
[New Thread 0xc9527b90 (LWP 29172)]
[New Thread 0xc94a6b90 (LWP 29173)]
[New Thread 0xc9425b90 (LWP 29174)]
[New Thread 0xc93d4b90 (LWP 29175)]
Detaching after fork from child process 29183.
Detaching after fork from child process 29185.
Subversion revision: 9982

Program received signal SIGSEGV, Segmentation fault.
0xedc4aae2 in ?? ()
(gdb) where
#0 0xedc4aae2 in ?? ()
#1 0xffff6aa0 in ?? ()
#2 0xedbe808d in ?? ()
#3 0x00000010 in ?? ()
#4 0xe3237da8 in ?? ()
#5 0x00000000 in ?? ()
(gdb)

The first thing is probably to turn index checking back on. Remove the line:
@cython.boundscheck(False)
from _propagate.pyx

After that, I would suggest adding assertions to heap.pxi after each malloc/realloc to make sure the pointers returned are not NULL (and I’ll add checks in there to the current code for that; they should have already been there).

The next step is to print out
i2 = i1+delta_i[idx]
j2 = j1+delta_j[idx]
And see if there’s a consistent location the crash happens (I doubt it, based on your report).

Note that you might have to remove the “with nogil” to add prints and assertions to the .pyx code.

Another possibility is to use Electric Fence (linux.die.net/man/3/efence) or one of the other debugging mallocs.

I have tested the memory allocation by adding the assertion statements to heap.pxi and found that the error still occurs, without triggering any of the assertions. So, I believe the various alloc’s are working correctly. Then, it might be that we are accessing an array out of bounds.

However, I am more curious about the role that Java plays in general. From the wiki, I see that Java is used in order to run the Bio-Formats library (cellprofiler.org/wiki/index.php/Installing_Java). But I don’t see what that has to do with the propagate algorithm in the first place.

Could you help me to understand the code a little better, so I could make a better guess as to why the Java exception tends to happen during propagate.so?

Chuck

I believe the JRE information is being reported by the crash logger, but is not relevant to the problem at hand.

I think trying to run it under gdb again might be worthwhile. What format are your images? We might be able to disable the JRE temporarily (by setting JAVAHOME to something nonsensical), as it’s only used for loading a limited number of image types.