JVM lockup- how do debug further

Hi folks,

We're seeing a jvm lockup- meaning, the jvm stops executing java code, doesn't respond to thread dump requests, truss of the jvm process (below) suggests some broken locking internal to the jvm- the question is, how do we debug this further, e.g., to figure out what java code was last executed, or what jvm subsystem owned the locks that jvm lwps are sleeping against?

The application is a custom multithreaded java server application running on jdk 1.4.2.08 with the DST patch, on sparc solaris 10. We're running one instance of the application apiece on each of 3 v440s.

The application services RMI and TCP socket clients, and talks to an Oracle 10g database. The subsystem which supports TCP socket clients is built on java.nio, which, because of bug reports against nio on 1.4.x, we suspect for causing this problem. We also have never seen the problem on an RMI-only workload, but because of how the application works we can't run a TCP-only workload, and we're otherwise not sure how to isolate the issue to the TCP/nio- or other- subsystem.

The lockups occur at random at a rate of about one per day, at all times of day and night, with nothing discernably unique in the request pattern or in the workload prior to the lockup. We don't have control over most of the source of the app, so adding logging isn't an option. Attempts to reproduce the problem outside of the production environment have so far been unsuccessful.

Java is:

user@host: /usr/local/java14_home/bin/java -version

java version "1.4.2_08"

Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_08-b03)

Java HotSpot(TM) Client VM (build 1.4.2_08-b03, mixed mode)

Solaris is:

user@host: uname -a

SunOS host 5.10 Generic_118833-22 sun4u sparc SUNW,Sun-Fire-V440

Truss looks like:

user@host: truss -p `cat /xxx.pid`

/37:lwp_cond_wait(0x017BADB8, 0x017BADA0, 0x00000000, 0)

(sleeping...)

/47:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/54:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/15:lwp_cond_wait(0x00039BA0, 0x00039B88, 0x00000000, 0)

(sleeping...)

/40:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/6:lwp_cond_wait(0x00038EB8, 0x00038EA0, 0xF94FFDD8, 0)

(sleeping...)

/16:lwp_cond_wait(0x00039C08, 0x00039BF0, 0x00000000, 0)

(sleeping...)

/31:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/26:pollsys(0xACB7F7A8, 0, 0xACB7F810, 0x00000000) (sleeping...)

/13:lwp_cond_wait(0x00039098, 0x00039080, 0x00000000, 0)

(sleeping...)

/105:lwp_cond_wait(0x0162F820, 0x0162F808, 0x00000000, 0)

(sleeping...)

/5:lwp_cond_wait(0x000BE138, 0x000BE120, 0x00000000, 0)

(sleeping...)

/23:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/4:lwp_cond_wait(0x000BE138, 0x000BE120, 0x00000000, 0)

(sleeping...)

/58:lwp_cond_wait(0x01A8F760, 0x01A8F748, 0x00000000, 0)

(sleeping...)

/30:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/143398:lwp_cond_wait(0x0051C3F8, 0x0051C3E0, 0x00000000, 0)

(sleeping...)

/149245:lwp_cond_wait(0x017E8C48, 0x017E8C30, 0x00000000, 0)

(sleeping...)

/83:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/11:accept(6, 0xB0B7FDB8, 0xB0B7FDB4, SOV_DEFAULT) (sleeping...)

/149244:lwp_cond_wait(0x00FE8E30, 0x00FE8E18, 0x00000000, 0)

(sleeping...)

/149231:lwp_cond_wait(0x00F95888, 0x00F95870, 0x00000000, 0)

(sleeping...)

/95:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/149360:lwp_cond_wait(0x00039768, 0x00039750, 0x00000000, 0)

(sleeping...)

/124531:lwp_mutex_timedlock(0x00039680, 0x00000000)

(sleeping...)

/63:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/149239:lwp_cond_wait(0x00EBD458, 0x00EBD440, 0x00000000, 0)

(sleeping...)

/77955: lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/111:lwp_cond_wait(0x0151D300, 0x0151D2E8, 0x00000000, 0)

(sleeping...)

/124:lwp_cond_wait(0x0088A898, 0x0088A880, 0x00000000, 0)

(sleeping...)

/126:lwp_cond_wait(0x01517DD8, 0x01517DC0, 0xA737F7A8, 0)

(sleeping...)

/84571: lwp_cond_wait(0x007343E0, 0x007343C8, 0x00000000, 0)

(sleeping...)

/84:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/117:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/60:lwp_cond_wait(0x01A8F988, 0x01A8F970, 0x00000000, 0)

(sleeping...)

/68:lwp_cond_wait(0x013567B0, 0x01356798, 0x00000000, 0)

(sleeping...)

/85780: lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/108:lwp_cond_wait(0x01B9AD48, 0x01B9AD30, 0x00000000, 0)

(sleeping...)

/149358:lwp_mutex_timedlock(0x00039680, 0x00000000)

(sleeping...)

/77957: lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/149347:lwp_mutex_timedlock(0x00039680, 0x00000000)

(sleeping...)

/143389:lwp_cond_wait(0x00EBD420, 0x00EBD408, 0x00000000, 0)

(sleeping...)

/67:lwp_cond_wait(0x0120D520, 0x0120D508, 0x00000000, 0)

(sleeping...)

/149346:pollsys(0xA8B7EF08, 1, 0xA8B7EE98, 0x00000000)

(sleeping...)

/62:lwp_cond_wait(0x0120E1F0, 0x0120E1D8, 0x00000000, 0)

(sleeping...)

/66:lwp_cond_wait(0x0120CCE0, 0x0120CCC8, 0x00000000, 0)

(sleeping...)

/115:lwp_cond_wait(0x01245830, 0x01245818, 0xA7D7F7A8, 0)

(sleeping...)

/14148: lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/123502:lwp_mutex_timedlock(0x00039680, 0x00000000)

(sleeping...)

/123:lwp_cond_wait(0x0088A728, 0x0088A710, 0xA767F7A8, 0)

(sleeping...)

/115951:lwp_cond_wait(0x0053F758, 0x0053F740, 0x00000000, 0)

(sleeping...)

/143342:lwp_cond_wait(0x013480E0, 0x013480C8, 0x00000000, 0)

(sleeping...)

/149343:lwp_mutex_timedlock(0x00039680, 0x00000000)

(sleeping...)

/73774: lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/149234:lwp_cond_wait(0x017D1848, 0x017D1830, 0x00000000, 0)

(sleeping...)

/143390:lwp_cond_wait(0x017E67F8, 0x017E67E0, 0x00000000, 0)

(sleeping...)

/149357:lwp_mutex_timedlock(0x00039680, 0x00000000)

(sleeping...)

/125:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/112:lwp_cond_wait(0x012440D0, 0x012440B8, 0x00000000, 0)

(sleeping...)

/149198:lwp_mutex_timedlock(0x00039680, 0x00000000)

(sleeping...)

/149241:lwp_cond_wait(0x0120AE60, 0x0120AE48, 0x00000000, 0)

(sleeping...)

/140951:lwp_cond_wait(0x00A8A788, 0x00A8A770, 0x00000000, 0)

(sleeping...)

/149330:lwp_mutex_timedlock(0x00039680, 0x00000000)

(sleeping...)

/109:lwp_cond_wait(0x01B9AE10, 0x01B9ADF8, 0x00000000, 0)

(sleeping...)

/64:lwp_cond_wait(0x0120B098, 0x0120B080, 0x00000000, 0)

(sleeping...)

/120:lwp_cond_wait(0x01596790, 0x01596778, 0xA797F7A8, 0)

(sleeping...)

/149359:lwp_mutex_timedlock(0x00039680, 0x00000000)

(sleeping...)

/149342:lwp_mutex_timedlock(0x00039680, 0x00000000)

(sleeping...)

/110:lwp_cond_wait(0x010B27A0, 0x010B2788, 0x00000000, 0)

(sleeping...)

/96:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/75:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/69:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/65:lwp_cond_wait(0x01355540, 0x01355528, 0x00000000, 0)

(sleeping...)

/113:lwp_cond_wait(0x01244A30, 0x01244A18, 0x00000000, 0)

(sleeping...)

/129:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/127:lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/143395:lwp_cond_wait(0x01ED19E0, 0x01ED19C8, 0x00000000, 0)

(sleeping...)

/71288: lwp_mutex_timedlock(0x00039680, 0x00000000) (sleeping...)

/59:lwp_cond_wait(0x0198D908, 0x0198D8F0, 0x00000000, 0)

(sleeping...)

What do we do next?

Thanks-

Jonah

[8191 byte] By [jonahbentona] at [2008-2-27]