I remember my first encounter with transactional memory. It was a chapter by Simon Peyton Jones in Beautiful code (2007), and it truly was beautiful: an elegant addition to the Haskell programming language, relying heavily on that language's strong type system.
And then I saw how transactional memory looks like in C code. Boy, that hurt. All the elegance went out the window; visually, the code brought back dim memories of embedded SQL in C back in the 80s. I decided to forget all about transactional memory for now, and try again in a decade or two.
Well, things are definitely changing now. Andi Kleen's paper Scaling Existing Lock-based Applications with Lock Elision shows how things can remain beautiful and elegant with a little bit of hardware support.
Specifically, Intel's latest processors introduce a new group of instructions, called TSX (Transactional Synchronization Extensions). Effectively, these instructions put the processor into a special "collision detection" mode, and it can report when two different threads (or cores), both in this special mode, access the same memory location in a conflicting manner. Any manner except two read attempts would be a conflict.
So, how can this mechanism be used to improve parallelism? You start your code with a special "begin TSX mode" statement (_xbegin() in Intel's C libraries). The code executes normally until _xend() is reached. However, if any conflict (with a different thread) is detected, all changes your code has done to any memory location is reverted, and the code is thrown back to the _xbegin() location, as if _xbegin() was just invoked. Only this time, the function's return value indicates failure. At this stage, you can simply obtain a lock, and run your code using normal lock semantics.
If you make sure that any successful _xbegin() is followed by a read access to the lock, the immediate result is that as soon as any thread obtains the lock, all other threads fail their transaction and are thrown back to the beginning, where they would also attempt to obtain the lock. (That's because the lock-read implies the lock itself is part of your transaction, and so if somebody else writes to it -- in other words, obtains it -- then your transaction is no good.)
How is this helpful? If you replace all lock acquisition/release blocks with transactional attempts like these, any lock-protected access to memory will not, normally, obtain the lock. Two threads accessing the same data structure in a non-mutually-destructive way (none writes to any memory that the other reads or writes), both succeed -- and there's never any waiting for a lock. If there is a conflict, the threads obtain locks as they normally would.
The implication: whereas modern practice implies that complex data structures should use fine-grained locks, it now makes more sense to use coarse-grained locks. Just put a lock over that whole big hashtable or tree, for example. The lock aquisition isn't really performed, though, unless there's any actual conflict. The result is that existing, simply-written code, when using a standard library with lock-elision code on its lock acquisition and release functions, would suddenly scale almost linearly with the number of processor cores.
Of course, this is an over-simplification, and the actual best practice depends on the nature of your data structure and the data-access patterns. Still, it seems to me like in the most common cases, the multi-threaded programmer's life is about to become significantly simpler.