Synchronization under the hood, and why Java 5 improves it (ctd)

(Continued from our discussion of Java synchronization under the hood.)

To provide low-level synchronization, most modern CPUs offer an instruction (or various instructions) which basically read and write to the same memory location as a single, uninterruptable operation. In other words, you first read what the value of the memory location is. Then, you calculate what you want the updated value to be. This is essentially the moment at which another process could "sneak in" in the meantime. Then, you invoke the atomic read-write operation, which sets the value to what you want it to be and tells you want the value was at that instant before the new one was written. If this previous value isn't what you expect it to be (i.e. not equal to the value you read a moment ago), then you know another process has "stepped in" and you need to repeat the operation. A generic term used for the atomic read-write operation is Compare-And-Swap (CAS) or, in the case of a variant we'll assume here, Compare-And-Set (CAS). In the Compare-And-Set variant, we tell the processor what we think the value of the given memory location should be and the new value to write if and only if the previous value was indeed what we said we expected it to be. The result of the instruction is a boolean indicating whether or not the processor could indeed write our new value (i.e. if our expected value was indeed the value prior to writing). To illustrate this, let's assume that our block of "lock housekeeping" data consists of three 32-bit words in memory and looks something like this¹:

WORD 0 : Owning thread ID
WORD 1 : Lock count
WORD 2 : Operating system lock object ID

Now, our logic for accessing the lock data can go something like this:

while (not done) {
  read previous thread ID;
  if (previus thread ID == my thread ID) {
    lock count++;
    done = true;
  } else if (previous thread ID == 0) {
    // No previous thread owner, so atomically set us as the owner
    done = CAS(WORD-0, 0, my thread ID);
    lock count = 1;
  } else {
    // Another thread already has the lock; somehow wait for it to be released
  }
}

The important lines are in bold. What we essentially do is: (a) read the previous thread owner ID from Word 0 and check that it is zero, meaning no other thread currently has the lock; (b) use the CAS instruction to say "write my thread ID to Word 0 if and only if the previous value was zero, and tell me if it was written"; (c) if the CAS instruction tells us that it did go ahead and write it, because the previous value was still zero, then we have the lock; (d) if not, that means another thread "snuck in" and set its thread ID, and we have to loop round again to read what the new value of thread ID is. If the previous thread ID is not zero but is not our thread ID, then somebody else already has the lock and we have to wait for it to release it.

To wait for the lock to be released, one option is to sleep for a while and then try again. Often, we can actually ask the operating system to help us with this task. On mnay systems, we can create an operating system monitor object and ask the OS to tell us when another thread (which in our case would be the thread that released the synchronization lock at the end of its synchronized block) sends a "notify" message to it. We'd only want to create at most one OS monitor object per Java object, even though multiple threads might be trying to wait for the lock. And to keep resources down, we'd probably only want to create the OS monitor object when it was actually needed. Another CAS operation can be used for this, so that a simple implementation of "wait for the lock" could go something like this:

while (not created_os_monitor) {
  os_monitor_object = WORD 2;
  if (os_monitor_object == NULL) {
    os_monitor_object = new OSMonitor();
    created_os_monitor = CAS(WORD-2, NULL, os_monitor_object);
    if (!created_os_monitor) {
      // If CAS fails, somebody else has snuck in and created the monitor,
      // so delete the one we've just created.
      delete os_monitor_object;
    }
  }
  ask OS to wait for a "notify" message to os_monitor_object;
}

Again, the CAS operation in bold sits inside a loop. In the unlikely event that the method fails because another thread has snuck in, the CAS returns false and we go round the loop again and pick up the reference to the OSMonitor object created by the other thread. If this happened, there'd be a slight inefficiency because our thread would create an OSMonitor which it would then immediately discard. But we live with this inefficiency because we think it's unlikely to occur, and because the important condition of keeping on to a maximum of one OSMonitor object per Java object would still hold. The code to deal with leaving the synchronized block would, as well as setting the owning thread ID to zero when the lock count reached zero, have to check for the presence of an "OS monitor object" and, if one existed, ask the OS to send a "notify" message to any waiter.

Note that in these cases, we know that going round the loop again immediately is OK because we know that the next time round we're very likely to succeed– or at least, very likely not to loop again. There was probably only one other thread that snuck in, and since the value has changed, it has finished its "sneaking in". We're generally not going to sit in the loop burning CPU: if the CAS isn't successful the first time, then it either will be the second time, or else we'll have to take some other action anyway.

Contrast this with what happens if we actually do have to wait for the lock. If the JVM knew for certain that the thread that had the lock was currently running on another processor and about to release the lock, then the same "loop round and try again" strategy would probably be the most efficient. But it can't generally make this kind of assumption– or at least, not without wasting time deciding¹. So if we just spin in a loop waiting for the lock, there is a risk that we'll burn quite a lot of CPU while waiting for the other thread to release the synchronization lock.

This is why the most general case is to use the operating system facilities to wait for a "notify". That way, we don't burn CPU. But on the downside, if the operation being performed by the other thread is trivial, chances are we'll wait much longer than strictly necessary. Depending on how the operting system's thread scheduling works, it is likely to mean suspending our thread until at least the next interrupt (and in fact, even if the thread doesn't need to be suspended, the thread can still be penalised)². So even though the Java code inside the competing synchronized block is simply reading or updating a single integer variable which could involve nothing more than a couple of machine code instructions, our thread's going to have to wait several milliseconds for it. And all this is without considering that, whether we have to wait or not, we have to synchronize local copies of variables with main memory at the point of acquiring and releasing the lock. In particular, we have no way of saying to the JVM "I'm only going to change the variable count, so this is the only one you need to refresh to/from your caches, and you can still re-order access to other variables for the sake of optimisation".

On the next page, we see how Java 5 improves this situation by exposing CAS operations to the Java programmer.

Notes:
1. Note that this is a purely hypothetical structure and probably different from the way any specific JVM implements the lock housekeeping data. For example, Hotspot actually combines flags for locking and flags for garbage collection into the data accessed via CAS and rather than using CAS to swap a thread ID, actually swaps the pointer to lock data. Our general description of how synchronization works still generally holds, however.
2. I'm actually painting a slightly pesimistic picture here. In some cases, a JVM can use certain heuristics to make quick decisions between spinning round the loop and actually waiting (suspending the thread). For example, a lock implementation could decide to "spin up to 3 times then wait". Depending on the OS, it may be able to "spin if and only if the other thread is actually running". And it may be able to check if the other thread is waiting for I/O (a slow operation, in which case there's little point in spinning). Improvement to synchronization algorithms has been a key area of research over the last few years, and some progress has been made in fine-tuning these kinds of heuristics. Nonetheless, the point is that, if the JVM has to make a decision, it always wastes a little bit of time doing so, and always risks making an inappropriate one.
3. Thread scheduling generally works by running in a software interrupt. Every interrupt period– defined by the processor and typically around 10 or 15 milliseconds– the thread scheduling code looks at what processes are running and "re-jigs" them to share out the available CPUs over time. If a thread enters the "wait" state, it won't have an opportunity to be considered for running again until at least the next interrupt. So on such as system, calling wait means that in the worst case we'll wait for nearly 10 milliseconds, and in the average case around 5. Note too that when a thread enters the wait state, the scheduler generally has to make an approximation of how much actual CPU time the thread used during that interrupt period before waiting. A thread that uses a tiny amount of CPU and then waits will get "overcharged" for CPU time (on Windows, for example, a thread is "charged" one third of an interrupt period for calling the wait function).

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey