33 messages in net.java.dev.jna.usersRe: [jna-users] Linux SIGSEGV under J...
FromSent OnAttachments
Mark ClickJun 4, 2009 4:59 pm.txt
Timothy WallJun 4, 2009 6:47 pm 
Mark ClickJun 4, 2009 8:50 pm 
Timothy WallJun 5, 2009 4:12 am 
Mark ClickJun 5, 2009 11:10 am 
Timothy WallJun 5, 2009 11:30 am 
Mark ClickJun 5, 2009 1:27 pm 
Timothy WallJun 5, 2009 1:41 pm 
Mark ClickJun 5, 2009 1:54 pm 
Timothy WallJun 5, 2009 3:21 pm 
Mark ClickJun 5, 2009 3:24 pm 
Timothy WallJun 5, 2009 3:28 pm 
Mark ClickJun 5, 2009 3:45 pm 
Mark ClickJun 5, 2009 4:36 pm 
Timothy WallJun 5, 2009 5:02 pm 
Mark ClickJun 7, 2009 3:55 pm 
Timothy WallJun 7, 2009 5:30 pm 
Mark ClickJun 8, 2009 9:22 am 
Mark ClickJun 8, 2009 1:46 pm 
Timothy WallJun 8, 2009 2:13 pm 
Timothy WallJun 8, 2009 2:16 pm 
Mark ClickJun 8, 2009 3:07 pm 
Mark ClickJun 8, 2009 3:08 pm 
Timothy WallJun 8, 2009 3:33 pm 
Timothy WallJun 9, 2009 5:18 am 
Timothy WallJun 9, 2009 6:13 am 
Timothy WallJun 9, 2009 7:08 am 
Mark ClickJun 9, 2009 9:54 am 
Mark ClickJun 9, 2009 10:04 am 
Mark ClickJun 9, 2009 10:07 am 
Mark ClickJun 9, 2009 10:09 am 
Mark ClickJun 9, 2009 2:05 pm 
Mark ClickJun 10, 2009 1:46 pm 
Actions with this message:
Paste this link in email or IM:
Paste this link in email or IM:
Atom feed for this thread
Paste this URL into your reader:
Subject:Re: [jna-users] Linux SIGSEGV under JNA but not C?Actions...
From:Mark Click (mark@gmail.com)
Date:Jun 9, 2009 9:54:23 am
List:net.java.dev.jna.users

On Mon, Jun 8, 2009 at 3:33 PM, Timothy Wall <twal@dev.java.net> wrote:

On Jun 8, 2009, at 6:07 PM, Mark Click wrote:

On Mon, Jun 8, 2009 at 2:14 PM, Timothy Wall <twal@dev.java.net> wrote:

On Jun 8, 2009, at 4:47 PM, Mark Click wrote:

A further note: Our grad student just wrote some JNI around the library call and it successfully runs from a Java test program without crashing. What could be different between JNI and JNA here?

dynamic load of the library and entry point?

Take a look at native/dispatch.c:dispatch, which formats the incoming (Java) arguments, passes them to ffi_call, which dispatches to the native library function.

Found this method, but what should we be looking for here? Sorry, I'm just unsure of what to be suspicious of.

It's possible, although unlikely, that mapping WString as an argument results in an improper argument type passed to ffi_prep_cif before ffi_call. The type should be ffi_type_pointer. You can easily check this by changing the argument to Pointer and passing in null as a value, instead of WString and some string value.

We just tried this and mapping to a Pointer and passing Pointer.NULL still produces the same crash.

Even if there were something wrong with the mapping, you'd more likely get a fault immediately, not ten levels deep in the call hierarchy.

I agree, the crash a few levels down is really odd. We used JNI to get past the first crash, then most of the next calls seem to work in JNA, but then when attempting to actually use the library (after all the initilization calls) it crashes in libSWImodel.so again. Trace:

Stack: [0xa2a11000,0xa2a62000], sp=0xa2a600a4, free space=316k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libSWImodel.so.3+0x4992f] C [libSWImodel.so.3+0x4a334] C [libSWIdecoder.so.3+0xd9e5] C [libSWIdecoder.so.3+0x7254] HmmDecoderUpdateSet+0x94 C [libSWIsr.so.3+0x395be] C [libSWIsr.so.3+0x39f14] C [libSWIsr.so.3+0xda74] C [libSWIsr.so.3+0xea4f] C [libSWIsr.so.3+0x14904] SWIsrRecognizeCompute+0x754 C [libSWIrec.so+0x3fc23] SWIrecRecognizerCompute+0x3a3 C [jna180501683722456509.tmp+0x10ec7] ffi_call_SYSV+0x17 C [jna180501683722456509.tmp+0x10b74] ffi_call+0xb4 C [jna180501683722456509.tmp+0x3671] C [jna180501683722456509.tmp+0x3eb3] Java_com_sun_jna_Function_invokeInt+0x43 j com.sun.jna.Function.invokeInt(I[Ljava/lang/Object;)I+0 j com.sun.jna.Function.invoke([Ljava/lang/Object;Ljava/lang/Class;)Ljava/lang/Object;+309 j com.sun.jna.Function.invoke(Ljava/lang/Class;[Ljava/lang/Object;Ljava/util/Map;)Ljava/lang/Object;+194 j com.sun.jna.Library$Handler.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object;+344 j $Proxy0.SWIrecRecognizerCompute(Lcom/sun/jna/Pointer;ILcom/sun/jna/ptr/IntByReference;Lcom/sun/jna/ptr/IntByReference;Lcom/sun/jna/ptr/PointerByReference;)I+37

We see that it's again referencing something to do with the HMM model, so we're getting suspicious it's something with how that particular library handles memory.

- Mark

- Mark

On Mon, Jun 8, 2009 at 9:23 AM, Mark Click <mark@gmail.com> wrote: Thanks! That would seem to fit the behavior. Sadly, we just tried -Xss100m and it still crashed in the same spot.

- Mark

On Sun, Jun 7, 2009 at 5:31 PM, Timothy Wall <twal@dev.java.net> wrote: See if increasing the Java stack allocation changes the location of the segfault (java -Xss<size>). It's possible the C program is running out of stack, especially if it's doing large alloca calls (as opposed to lots of recursion or nested calls).

On Jun 7, 2009, at 6:55 PM, Mark Click wrote:

We've had a grad student working on this, here's his troubleshooting report:

" Hey guys. I've not been idle since Friday. I've exhaustively tested (down to straight memory comparisons) a variety of things and eliminated a lot of suspects in the whole Nuance-through-JNA-in-linux problem. Below is my `report'. The shortest of the short is that: the very act of being called from JNA is causing the segfault. It is not what is passed between them (as Mark said, even passing null does it-though I was worried it was expecting static memory of a certain size (1024 wchar_t) and that's why it was segfaulting).

I started out working under the assumption that the string etc. that was being passed was causing problems and wrote that shared library to test it. I tried sending static char[] from java (along with various other things--pointers etc.) to see if that was the problem. Even a straight memory check (memcmp()) between what would be processed by the C program and what the java sent turned out negative for any differences. So I basically just did a straight call from MY shared C library to SWIrecInit, and then called that via JNA from the small java program I wrote (I think this is what paul was suggesting). And guess what? seg fault!

So it doesn't even matter what is sent. It is the very act of being called from JNA that gives the segfault. And they're ALWAYS in the same place--even when I just hardcode that wchar_t string into the C library I wrote (i.e. NOTHING is passed between the JNA and the C) it segfaults in same spot. So I guess it IS something in the environment that is not being passed, or JNA is holding something as its own memory that the C program expects to have access to and is accessing manually (read: hard-coded?). (btw: when I just run my C library by itself as a program (making the SWIrecInit call from inside), it runs fine with no segfaults--so it's fine when called from another C library (as the demo program shows us anyways))

Anyways, I'm not sure if it's JNA or the C program that's at fault (I have a feeling that its their C library, hard coded to physically move chunks of memory that it might not have access to), but there is definitely something that JNA does, the environment in which it runs the C code, that messes stuff up, and it only happens on linux. So I guess maybe it's as that other contact you spoke about suggested, we need to see what system calls are being made. Though, before that, now that we have a C program we can do all sorts of prints and such, to try to print out all the environment info that is (could be) related to the problem and compare between the actual C library running by itself and it being called (from my C library) through JNA.

My original suspicion was that it was the string being passed (that it was static, and that it was affecting stuff after it as if they were just manually iterating a pointer through that wchar_t string and getting at something they expected to be stored after it) was causing the problem, and though now through a variety of tests I know that it is not the format of the string or the size of the array that is causing the problem. However, if they try to access memory manually past the end of the string, it may be bound up in the JNA-called environment, but not necessarily so in the C environment. Or maybe JNA has it marked differently and won't let it read it. At any rate, we know that it's probably that ObjectRead we see called immedately before the segfault happens in the stack, that is causing the problem, even if it only manifests when it tries to make the next library call (because something has been written over, or a pointer has moved, or it manually just adds an offset to whereever it was reading/writing from to get to the next point?).

Synopsis of Things We're Sure Of: 1) The literal format of the wchar_t string sent between is not causing the segfault. --I tried passing a string via JNA to a C shared library I wrote, and comparing it to an expected string declared natively in the C code, and they turn out the same down to the bytes (memcmp). Actually it was a bit more involved, as I passed a char[] of the same size (1024 sizeof(wchar_t)) from the java program, which because of different sizes made me write every OTHER char and then pass that. Unless java was padding around the array (which I might not put past it...?) the sizes etc. are the same down to the bytes. --I also tried passing WString and Pointer (which didn't work out). I tried the char[] being static (whatever that is in java...static final or something), which I assumed meant it would not move in memory, though I have no idea what java is doing. In all these cases, the C program would segfault at exactly the same point.

2) The C program works fine when called from C. --wrote a simple program other than the demo to test this. Works fine. I even compiled it into a shared library and called that through ANOTHER C program. Again, things work fine.

3) It does NOT when called via JNA --compiled the simple program into a shared library. Calling the function from JNA (either passing the wchar_t string and using it, or just using one created natively in the C code, or even with NULL!) causes a seg fault in exactly the same place.

4) This is something unique to linux (does not happen in windows (under eclipse?)). Or, perhaps it is related to eclipse? If eclipse holds environmental variables maybe the C program will access them, and it does not happen in just the open linux environment (why this would be so I have no idea but perhaps it is worth a try as it would not take much time?). Have you tested it outside of eclipse on windows?

I'd like ideas of what types of environmental things might be useful to print out from the C program, if any of you have ideas. I'll try them on Monday. Or, any other ideas. "

To answer that last question, everything works just fine outside of Eclipse on Windows, so that doesn't seem to be a factor. - Mark

On Fri, Jun 5, 2009 at 5:02 PM, Timothy Wall <twal@dev.java.net> wrote: It's certainly possible that a low memory condition is causing a native malloc() to return a NULL value. However, C code can typically malloc memory until the system slows to a crawl first, unless a single request is abnormally large.

On Jun 5, 2009, at 7:37 PM, Mark Click wrote:

Random question: The successful initialization of these libraries causes a few hundred MB of memory to be consumed (on the Windows machine). Is it possible that in reading these HMM data files (probably some kind of deserialization, I don't know), we're running out of memory and it's showing up as a segfault in the native code? We tried -Xmx1024m with no result, but I'm still curious.

- Mark

On Fri, Jun 5, 2009 at 3:45 PM, Mark Click <mark@gmail.com> wrote: Yes, it does.

On Fri, Jun 5, 2009 at 3:29 PM, Timothy Wall <twal@dev.java.net> wrote: When you do the LD_PRELOAD for libjsig, does the message in the crash dump re: libjsig go away?

On Jun 5, 2009, at 6:25 PM, Mark Click wrote:

On Fri, Jun 5, 2009 at 3:21 PM, Timothy Wall <twal@dev.java.net> wrote:

On Jun 5, 2009, at 4:54 PM, Mark Click wrote:

At any rate, I doubt that's the issue. Looking at the stack trace, readObject is apparently pulling in some bogus data (does the library have a backing db or file-based preferences/settings?).

Indeed, the library has a large number of data files it uses. It appears to be crashing just after loading some Hidden Markov Model data files, which do exist in the filesystem and came straight from the library installer. Notably, those files are successfully loaded by the same call from the C demo, so I'm not sure in what other way they could be bogus.

Unless it's not loading the same files when run under a VM. Try dumping the environment variables for both setups (use getenv, not the shell's dump).

We tried that and all the environment variables are present and have correct values.

- Mark

On Fri, Jun 5, 2009 at 11:31 AM, Timothy Wall <twal@dev.java.net> wrote: Rewrite your C test program to dynamically load the 3rd party library and its entry point (using dlopen/dlsym), then call it via function pointer.

If that doesn't provide any additional information, determine what other environmental differences there are between the VM-initiated call and the C-initiated call (I can't think of any at the moment, other than CPU and memory usage).

On Jun 5, 2009, at 2:11 PM, Mark Click wrote:

On Fri, Jun 5, 2009 at 4:12 AM, Timothy Wall <twal@dev.java.net> wrote:

On Jun 4, 2009, at 11:50 PM, Mark Click wrote:

Timothy, thanks for the thoughtful response! Let me address each point:

According to Native.isProtected(), I'm not in protected mode.

You should probably do the libjsig setup anyway, since it appears your native code is using signals, which can interfere with the JVM's use of signals.

Good idea! But I tried that and no improvement, unfortunately.

It sure does look like it's dereferencing a null pointer, but like I said the exact same library works fine when called from C with the same parameters. Your suggestion about LD_LIBRARY_PATH is a good one, but the var looks correct whenever I check it, and the libraries only exist in one directory. I agree it's probably not a mapping problem, but the C version is:

SWIrecFuncResult SWIrecInit(const wchar_t *uri);

where SWIrecFuncResult is effectively an int.

The stack trace looks like the native code is attempting to deserialize an object. Perhaps it's relying on some on-disk (db?) data which is set up differently (or not at all) on the linux system.

Agreed, but the libraries are designed for linux. (And we are using the recommended distro of RedHat.)

Something else that might be relevant is that the libs and JRE are 32bit, but the OS is 64bit. I'm not sure I'd expect a problem there, but I figured I should mention it.

Could be an issue if the native code is getting system resources based on a 64-bit setup rather than a 32-bit one. You can always run on a 32-bit OS or virtual machine to make sure.

We tried a 32-bit system this morning, exact same crash.

Thanks again for the reply! Any other ideas you might have are welcome.

- Mark

On Thu, Jun 4, 2009 at 6:47 PM, Timothy Wall <twal@dev.java.net> wrote: If you're running with jna.protected=true, you'll need to preload libjsig.so, e.g.

LD_PRELOAD=/path/to/libjsig.so java {java program args}

You have a single entry point, SWIrecInit(), which presumably expects a wchar_t* argument. On linux, wchar_t is 4 bytes by default, although you should check your library's documentation since it's possible to force a difference size. If it's expecting a char* argument and the windows version expecting wchar_t*, then you'll need to optionally use a type mapper on the windows version.

Apparently your third-party library could do more thorough error checking, since it looks like it's dereferencing a NULL pointer.

Given the simple entry point, the problem is more likely environmental than anything to do with the JNA mapping itself (assuming the mapping is correct -- you didn't provide a C prototype). LD_LIBRARY_PATH will likely be different between a C program and a java program (java adjusts the value and may fork a new process to ensure the path includes its own paths).

On Jun 4, 2009, at 8:00 PM, Mark Click wrote:

Hello!

I have a puzzling problem I'm hoping someone can help with! I've searched the archives and the googles but no luck.

I'm working with a set of commercial libraries in both Windows and Linux. I have a C test program and Java/JNA test program. Both work in Windows, but only the C test program works in Linux. The JNA code works _great_ in Windows, but it SIGSEGVs in Linux, in the native code. Everything else about the two installations is otherwise identical as far as I can tell, and we've tried a fair amount of obvious things (permissions, paths, etc) that are usually cause cross platform problems. It's so puzzling that the C test program works in Linux, but JNA accessing the same libraries doesn't.

If you want more detail, I've attached a dump of the crash. If you have any ideas, no matter how far fetched, I'd love to hear them!

<dump.txt>--------------------------------------------------------------------- To unsubscribe, e-mail: user@jna.dev.java.net For additional commands, e-mail: user@jna.dev.java.net