wait(0) - Timed_Waiting vs Waiting: Troubleshooting Practice


The Problem

Troubleshooting

The first step is to reproduce this problem:

  • verify the problem by running the ant beast command
  • try to reproduce in IDE and debug mode
    • Add @com.carrotsearch.randomizedtesting.annotations.Repeat(iterations = 50) in PeerSyncTest
    • Add breakpoint at throw new RuntimeException(e); in VersionBucket.awaitNanos

Troubleshooting: another clue

Another clue from the threaddump would also help to troubleshoot the issue:

  • we are using wait(timeout) in the code, but the thread was in WAITING status, not TIMED_WAITING status
  • From TIMED_WAITING vs WAITING

    • WAITING: waiting indefinitely:
      • Object#wait() with no timeout
      • Thread#join() with no timeout
      • LockSupport#park()
    • TIMED_WAITING: waiting for up to a specified waiting time
      • Thread#sleep()
      • Object#wait() with timeout
      • Thread#join() with timeout
      • LockSupport#parkNanos()
      • LockSupport#parkUntil()
      [beaster] "qtp261698385-112" #112 prio=5 os_prio=31 cpu=799.10ms elapsed=123.36s tid=0x00007ff5cd66d800 nid=0x1ea03 in Object.wait()  [0x000070001417a000]
      [beaster]    java.lang.Thread.State: WAITING (on object monitor)
      [beaster]   at java.lang.Object.wait(java.base@11.0.2/Native Method)
      [beaster]   - waiting on <0x00000007ed6ae158> (a org.apache.solr.update.VersionBucket)
      [beaster]   at org.apache.solr.update.VersionBucket.awaitNanos(VersionBucket.java:66)
      [beaster]   at org.apache.solr.update.processor.DistributedUpdateProcessor.doWaitForDependentUpdates(DistributedUpdateProcessor.java:593)
      [beaster]   at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$waitForDependentUpdates$1(DistributedUpdateProcessor.java:536)
      [beaster]   at org.apache.solr.update.processor.DistributedUpdateProcessor$$Lambda$524/0x0000000800768040.apply(Unknown Source)
      [beaster]   at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)

The root cause

  • The root cause is in the following code: the precondition is nanosTimeout>0, but after convert it to millseconds, it may be 0
  • wait 0 or -1 would wait for ever.
  • The fix would be simple: only wait if millseconds>0.

Why didn’t capture this problem before?

  • The problem doesn’t always happen: it may happen after run same test about 10 times.

How to prevent it happen again?

What Knowledge we learned?

  • wait
    • negative value: throw IllegalArgumentException
    • 0: wait forever until it’s awakened
  • condition.awaitNanos
    • 0 or negative value: return immediately
  • when we convert numbers between different units, be careful:
    • overflow
    • underflow: 0 in this case
  • be careful of special cases :0

Misc

Threaddump Analysis

Another mistake made

  • When add data/fields, we need understand how this object is used, how many instances in one core/node/clusters
  • Increase 100 bytes in one object may be insignificant, but it would be a problem if there are 10K or more instances.

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)