Hi Tony,
thanks for your interest. Below is the example GC log.
We always set start and max heap size for each box in cluster explicitly. For example, the box bellow is SunFire V440, 4xCPU, 16GB mem, Sol8 108528-27. Start heap is 4G and max is 8GB.
My theory is that app is filling up YG faster than is the average lifetime of short-lived objects. YGC happens before objects become a garbage. Therefore most (or all) objects are promoted to Tenured generation every time. High ParScavenge time is then time spend allocating objects in CMS. That's the theory ;-)
I plan to upgrade customer's test cluster to 1.5.0_08 and do some tests in tweaking GC setting. The fist one will be to double the YG size.
Regards,
BranoZ
This is how it started this time. 50s is quite extreme example:
21998.046: [ParNew 2765647K->2264533K(4190272K), 0.1930396 secs]
22001.982: [ParNew 2780757K->2268339K(4190272K), 0.0863709 secs]
22004.203: [GC 2534762K(4190272K), 0.3614055 secs]
22006.736: [ParNew 2783998K->2283574K(4190272K), 0.1430293 secs]
22011.286: [ParNew 2794833K->2302833K(4190272K), 0.2536093 secs]
22015.136: [GC 2748687K(4326184K), 2.0708461 secs]
22017.641: [ParNew 2799778K->2303906K(4326184K), 0.1923121 secs]
22022.036: [ParNew 2820130K->2317597K(4326184K), 50.2958672 secs]
22076.085: [ParNew 2833821K->2321863K(4326184K), 50.1316508 secs]
22130.317: [ParNew 2838087K->2324692K(4326184K), 50.4934666 secs]
22185.006: [ParNew 2840916K->2333814K(4326184K), 50.6857353 secs]
22240.106: [ParNew 2850038K->2359014K(4326184K), 50.3882597 secs]
22292.626: [GC 2618318K(4326184K), 0.3510936 secs]
22295.186: [ParNew 2874430K->2384467K(4326184K), 49.8472176 secs]
22349.375: [ParNew 2891323K->2396182K(4326184K), 50.1110282 secs]
22403.254: [ParNew 2828910K->2316157K(4374736K), 50.4352997 secs]
22453.691: [GC 2316157K(4374736K), 0.0074683 secs]
22458.150: [ParNew 2832381K->2338460K(4374736K), 50.2660043 secs]
22512.901: [ParNew 2854684K->2364763K(4374736K), 50.3661762 secs]
22567.657: [ParNew 2880987K->2382561K(4374736K), 50.5471023 secs]
22622.293: [ParNew 2898785K->2385447K(4374736K), 50.3793910 secs]
22676.788: [ParNew 2901671K->2388276K(4374736K), 49.9774859 secs]
22728.806: [GC 2650214K(4374736K), 0.3244817 secs]
22731.177: [ParNew 2903593K->2390040K(4374736K), 50.0629669 secs]
22785.342: [ParNew 2898932K->2384656K(4374736K), 49.9562792 secs]
22839.320: [ParNew 2788365K->2276543K(4374736K), 3.0625406 secs]
22842.385: [GC 2279399K(4374736K), 0.0079225 secs]
22846.460: [ParNew 2792767K->2281581K(4374736K), 2.9592045 secs]
22853.282: [ParNew 2797805K->2284871K(4374736K), 3.0843182 secs]
22860.492: [ParNew 2801095K->2287568K(4374736K), 2.9546525 secs]
22867.708: [ParNew 2803792K->2308495K(4374736K), 3.1377750 secs]
..and this is from the same log now. It didn't drop below 3s. In HPJtune you can clearly see the terace like graph. Sometimes ParNew makes several switches to different levels (up or down) between two restarts.
140128.154: [ParNew 2565163K->2058622K(5711608K), 3.1290494 secs]
140143.496: [ParNew 2574846K->2072318K(5711608K), 3.2323037 secs]
140284.063: [ParNew 2588542K->2105889K(5711608K), 3.2494839 secs]
140287.315: [GC 2111125K(5711608K), 0.0083492 secs]
140335.187: [GC 2375049K(5711608K), 2.0569228 secs]
140342.039: [ParNew 1669011K->1155124K(5711608K), 3.1821818 secs]
140355.287: [ParNew 1138700K->624353K(5711608K), 3.0024189 secs]
140363.757: [ParNew 1140577K->658753K(5711608K), 3.1779254 secs]
140372.735: [ParNew 1174977K->690875K(5711608K), 3.2534315 secs]
140383.115: [ParNew 1207099K->694233K(5711608K), 3.1505525 secs]
140405.618: [ParNew 1210457K->698244K(5711608K), 2.9981982 secs]
140417.257: [ParNew 1214468K->703503K(5711608K), 3.2022825 secs]
140427.270: [ParNew 1219727K->737817K(5711608K), 3.1956219 secs]
140437.566: [ParNew 1254041K->768785K(5711608K), 3.3210778 secs]
140447.477: [ParNew 1285009K->772447K(5711608K), 3.1232145 secs]
140456.386: [ParNew 1288671K->775552K(5711608K), 3.1227987 secs]
140465.315: [ParNew 1291776K->779204K(5711608K), 3.0075379 secs]
140474.177: [ParNew 1295428K->782192K(5711608K), 3.0002545 secs]
140483.006: [ParNew 1298416K->785233K(5711608K), 3.1808492 secs]
140490.427: [ParNew 1301457K->789180K(5711608K), 2.9973702 secs]
140498.923: [ParNew 1305404K->801785K(5711608K), 3.0341292 secs]
140507.208: [ParNew 1318009K->832232K(5711608K), 3.4033004 secs]
140518.370: [ParNew 1348456K->854803K(5711608K), 3.1722517 secs]
140529.946: [ParNew 1371027K->856848K(5711608K), 2.9901605 secs]
140539.483: [ParNew 1373072K->860134K(5711608K), 3.0173354 secs]
140549.256: [ParNew 1376358K->900292K(5711608K), 3.1819922 secs]
140559.872: [ParNew 1416516K->924135K(5711608K), 3.1670494 secs]
140571.467: [ParNew 1440359K->926205K(5711608K), 3.1295034 secs]
140589.407: [ParNew 1442429K->932570K(5711608K), 3.0187656 secs]
140599.497: [ParNew 1448794K->966578K(5711608K), 3.1905812 secs]
140608.176: [ParNew 1482802K->1000361K(5711608K), 3.3525339 secs]
140617.893: [ParNew 1516585K->1005821K(5711608K), 3.1527976 secs]
140627.198: [ParNew 1522045K->1033823K(5711608K), 3.2412903 secs]
140635.415: [ParNew 1550047K->1061601K(5711608K), 3.2880912 secs]
140650.037: [ParNew 1577825K->1093346K(5711608K), 3.1435323 secs]
140683.236: [ParNew 1609570K->1141199K(5711608K), 3.5401014 secs]
140713.168: [ParNew 1657423K->1194808K(5711608K), 3.3053307 secs]
140722.668: [ParNew 1711032K->1210168K(5711608K), 3.1834288 secs]
140761.051: [ParNew 1726392K->1274228K(5711608K), 3.3631744 secs]
140841.084: [ParNew 1790452K->1340929K(5711608K), 3.4981368 secs]
140881.543: [ParNew 1857153K->1375003K(5711608K), 3.1548356 secs]
140911.008: [ParNew 1891227K->1469072K(5711608K), 3.5447226 secs]
140983.170: [ParNew 1985296K->1543070K(5711608K), 4.1797274 secs]
141012.498: [ParNew 2059294K->1581317K(5711608K), 3.2417649 secs]
141068.673: [ParNew 2097541K->1625653K(5711608K), 3.3013306 secs]
141128.463: [ParNew 2141877K->1671267K(5711608K), 3.3046175 secs]
141178.443: [ParNew 2187491K->1710921K(5711608K), 3.2494128 secs]
141237.067: [ParNew 2227145K->1752891K(5711608K), 3.1965221 secs]
141268.104: [ParNew 2269115K->1799532K(5711608K), 3.3200524 secs]
141331.488: [ParNew 2315756K->1820042K(5711608K), 3.1471623 secs]
141405.321: [ParNew 2336266K->1896598K(5711608K), 4.2776222 secs]
141438.659: [ParNew 2412822K->1930054K(5711608K), 3.1951150 secs]
141621.927: [ParNew 2446278K->1986950K(5711608K), 3.2985364 secs]
142399.516: [ParNew 2503174K->2007449K(5711608K), 3.0788738 secs]
142408.417: [ParNew 2523673K->2034172K(5711608K), 3.4990466 secs]
142418.043: [ParNew 2550396K->2036805K(5711608K), 3.1383824 secs]
142506.132: [ParNew 2553029K->2072454K(5711608K), 3.1656547 secs]
142601.308: [ParNew 2588678K->2102850K(5711608K), 3.1486802 secs]
142604.460: [GC 2109933K(5711608K), 0.0082203 secs]
142679.497: [GC 2366531K(5711608K), 1.3149546 secs]
142754.615: [ParNew 932987K->453486K(5711608K), 3.1585706 secs]
143298.743: [ParNew 969710K->482957K(5711608K), 3.2853786 secs]
143308.759: [ParNew 999181K->487386K(5711608K), 3.1410462 secs]
143318.435: [ParNew 1003610K->489341K(5711608K), 3.1152526 secs]
143326.843: [ParNew 1005565K->516596K(5711608K), 3.1784453 secs]
143335.992: [ParNew 1032820K->520940K(5711608K), 3.1349650 secs]
143345.601: [ParNew 1037164K->522456K(5711608K), 2.9855287 secs]
143355.095: [ParNew 1038680K->523980K(5711608K), 2.9844280 secs]
143364.532: [ParNew 1040204K->525973K(5711608K), 2.9859568 secs]
143373.994: [ParNew 1042197K->528024K(5711608K), 2.9859666 secs]
143380.585: [ParNew 1044248K->553935K(5711608K), 3.0960279 secs]
143387.158: [ParNew 1070159K->560324K(5711608K), 3.2001745 secs]
143393.919: [ParNew 1076548K->565960K(5711608K), 3.0185440 secs]
143401.046: [ParNew 1082184K->591880K(5711608K), 3.2659240 secs]
143409.763: [ParNew 1108104K->596278K(5711608K), 3.0180708 secs]
143418.044: [ParNew 1112502K->630460K(5711608K), 3.3750751 secs]
143426.782: [ParNew 1146684K->641672K(5711608K), 3.2145800 secs]
143438.725: [ParNew 1157896K->673599K(5711608K), 3.2033111 secs]
> > My theory is that app is filling up YG faster than
> is the average lifetime of short-lived objects.
>
> I don't think this is the case. If you actually look
> at the GC log the young collections reclaim roughly
> the same amount of space, before and after the times
> spiking. I take it there's no phase change in your
That's what I wasn't sure about. If the reclaimed bytes in ParNew include promoted objects or not.
> application (i.e., the application starting to do
> something radically different) around that time, or
> anything like that?
Yes. The application creates bursts of activity that last for couple of minutes (large DB commits). But after that, ParNew does not return to previous level. No all bursts trigger this problem. Smaller jobs are passed without any effect.
> By looking at the GC log I can see that the problem
> occurs right after a marking / sweeping cycle. This
> is very similar to a bug we are currently looking at.
> The ID is 6433335 if you care to look it up. The
Yes. The bug pretty much describes my problem. Thanks for pointing me to the ID.
> young GCs just hit a pathological performance issue,
> after more free space is made available.
You are right, I had the feeling that it happens around the resize of main heap space. Our NewSize is fixed, though.
> The bug seems to exist in both _08, as well as 6.0.
> So, I don't think upgrading will help. The workaround
> we propose is to try disabling parallel young GCs,
> -XX:-UseParNewGC. This will increase young GC times,
> but it might prevent the GC from hitting the
> worst-case that you're seeing.
Good to know that somebody is looking at it already. Thanks for the hint, I will definitely try that.
So far I haven't been able to consistently reproduce it on test cluster (still on _06). Only one node has jumped from 0.05s to 0.25s level. That's barely recognizable.
Regards,
BranoZ