Stalling timer -- looking for debug suggestions

I have a rather complex system that is having intermittent issues with a timer stalling.

This stalling pauses at random, and seems to last for about 3-5 minutes. After several days of testing (using logs and garbage collector output) it appears to me that during these stalls a significant portion of the JVM this jar is running in is hanging. At least, when I try to run other java applications during one of these stalls they also hang.

This error is very difficult to detect, because the counter may run for several hours before stalling, and the stall may only last for a few minutes. Also, since the stall appears (so far) to happen outside my code I haven't yet been able to find a way to specifically signal that a stall has happened in the log file. I end up watching the program operation in the background instead.

Oddly enough, I am still able to access the gui component during these pauses, and can even successfully execute updates against the database.

Now, two things to state really quick here. One, I'm expecting someone to give me "the answer" to what's wrong, just prescribe some additional debugging steps I might take to find what's going on. Two, there is quite a lot of code involved, so I'm posting some pseudo code of what the program is doing in general (so I don't have to post several files). If the information is insufficient, I'll try to be more specific in a subsequent reply on this thread.

class myMainClassextends TimerTask{

//Private attribute of my window object

//Private attribute of an OS interface that uses JNI

publicvoid run(){

System.out.println("Starting Run Pass");

//Run the JNI OS specific checks to find out what updates are needed

//Run some checks against the database

//Update the gui, if necessary

System.out.println("Ending Run Pass");

}

publicstaticvoid main(){

//Init the initial database connection

//Init new instance of the gui object (passing it db connection)

java.util.Timer timer =new java.util.Timer();

//Set up some gui listeners (closed window, etc)

//Shutdown hooks (cleans up library loads, etc)

MyMainClass classInstance =new MyMainClass();

timer.schedule(classInstance, 0, 1000);

}

}

Now, when I run it actually stalls, the log file always seems to end on a "Ending Run Pass" output. When the stall ends it immediately runs "Starting Run Pass". So I'm fairly certain that the stall isn't directly related to code occuring between those two debug statements (though I have also been looking into garbage collection in the event that too many objects are being created and destroyed).

Does anyone have any suggestions for other tests I could try to run to find out what is causing this stall?

Thanks,

marloke

Message was edited by:

marloke

(I noticed that I left a sentence unfinished)

[3746 byte] By [marlokea] at [2007-11-15]
# 1

I did consulting for many years. The "performance analysis and tuning" gigs were things I tried to stay away from unless I was desperate, very desperate.

It always required weeks of grueling code analysis and testing many, many scenarios. It ALWAYS resulted in showing the client that his design was seriously flawed and receiving many insults.

My suggestion is to log everything in every way all the time until you narrow down the area to search (like a binary search.)

cooper6a at 2007-7-29 > top of java,Core,Core APIs...
# 2

> I did consulting for many years. The "performance

> analysis and tuning" gigs were things I tried to stay

> away from unless I was desperate, very desperate.

>

> It always required weeks of grueling code analysis

> and testing many, many scenarios. It ALWAYS resulted

> in showing the client that his design was seriously

> flawed and receiving many insults.

i always just hooked up a profiler and (quickly and easily) could immediately see bottlenecks, memory leaks, trouble spots, etc. although at some point a code review is probably going to happen and that can take time, especially if there's millions of lines or code

> My suggestion is to log everything in every way all

> the time until you narrow down the area to search

> (like a binary search.)

Not a bad suggestion, but not a great one. In large systems with lots of code, you're going to end up with gigs of log files to search through. that's going to take lots of time as well.

SoulTech2012a at 2007-7-29 > top of java,Core,Core APIs...
# 3

Well I thank you for that suggestion. Its my own code, so I doubt I'll get offended if I find (as I suspect) that I've done something incorrect somewhere.

Other than wrapping every single line of code in a log entry (which I'm pretty close to at this point), and examining verbose garbage collection output, is there something else I should be looking into that I'm missing?

Thanks,

marloke

marlokea at 2007-7-29 > top of java,Core,Core APIs...
# 4

> i always just hooked up a profiler and (quickly and

> easily) could immediately see bottlenecks, memory

> leaks, trouble spots, etc. although at some point a

> code review is probably going to happen and that can

> take time, especially if there's millions of lines or

> code

Is there any particular profiler you'd recommend? (Preferably open source and fairly quick to set up)x.

marlokea at 2007-7-29 > top of java,Core,Core APIs...
# 5

Do you really need this task to run every second? And does it take less than a second to execute? I would examine both these assumptions. And probably if the task decides it has to do something, I would cancel the task while it is doing it and reinstate the task when done.

ejpa at 2007-7-29 > top of java,Core,Core APIs...
# 6

> Is there any particular profiler you'd recommend?

OptimizeIt. Although it's not free. You'll have to use NetBeans IDE if you want a free one. But there could be a new one out there I'm not aware of or more options.

SoulTech2012a at 2007-7-29 > top of java,Core,Core APIs...
# 7

> OptimizeIt. Although it's not free. You'll have to

> use NetBeans IDE if you want a free one. But there

> could be a new one out there I'm not aware of or more

> options.

Ah thanks. I'll keep that one on my short list if the one I just started using doesn't have enough useful information.

For the moment Profiler4j seems to be giving me quite a bit of useful information... though I'm still not certain where the application focus is when the "freeze" occured again (all threads appear to be waiting for something). I'll try watching a few more cycles, and then I'll see what features OptimizeIt has (if it can tell me exactly where the JVM focus is at this moment it might be worth looking into).

marlokea at 2007-7-29 > top of java,Core,Core APIs...
# 8

> Do you really need this task to run every second? And

> does it take less than a second to execute? I would

> examine both these assumptions. And probably if the

> task decides it has to do something, I would cancel

> the task while it is doing it and reinstate the task

> when done.

Actually I don't particularly care if it runs every second. The only reason the timer is running every second is to update a clock display for the user. If it manages to get off by a few seconds, it doesn't matter much because every few minutes I verify the clock against the database.

On your average second though, it only performs a few checks and updates the counter.

I'm more concerned about several minutes worth of delay (which could significantly harm the process). Right now I'm tracking the memory usage to see if its related to garbage collection, but I haven't definitively found that to be the case yet.

marlokea at 2007-7-29 > top of java,Core,Core APIs...
# 9

> (all threads appear to be waiting for something)

You still do not describe the dynamic(a.k.a. runtime) logical or illogical relationships among your threads and resources. A very minor portion of concurrency errors, deadlock, livelock, starvation et al, can be detected by using static tool like FindBugs. And, did you ever see thread dump results?

hiwaa at 2007-7-29 > top of java,Core,Core APIs...
# 10

Clearly you need two scheduled tasks: one to update the clock every second, and another to do whatever the other one does, probably every minute or so. If this one is long-running when it finds some work it should cancel and reinstate itself around the work as I said above.

ejpa at 2007-7-29 > top of java,Core,Core APIs...