Information about http://www.cs.rochester.edu/~vmarathe/research/papers/2007_PPoPP_MSR.pdf

Featherweight Transactions: …

Tags: composite collection, concurrent computations, concurrent transactions, data parallelism, featherweight, fine grain, good starting point, ior, larus, memory data, microsoft research, priate, programmer, programming abstraction, programming abstractions, subject descriptors, threads, tim harris, transactional memory, university of rochester,
Pages: 2
Language: english
Created: Mon Jan 22 16:54:54 2007
Display cached document
Page 1
image
Page 2
image
                                   Featherweight Transactions:
                               Decoupling Threads and Atomic Blocks

                            Virendra J. Marathe                                      Tim Harris        James R. Larus
                          University of Rochester                                         Microsoft Research
                        vmarathe@cs.rochester.edu                           tharris@microsoft.com     larus@microsoft.com




Categories and Subject Descriptors [D.1.3 Concurrent Pro-                through the retry language construct [1]. However, while inves-
gramming]: Programming Abstractions                                      tigating Chaff we realized that existing abstractions are not appro-
                                                                         priate to express the parallelization of applications that contain a
General Terms       Algorithms, Languages                                large amount of fine-grain data parallelism, but in which concur-
Keywords atomicity, transactional memory, data parallel pro-             rent transactions interact in non-trivial ways. The main difficulty is
gramming, work groups                                                    correctly coordinating the transactions. The retry construct is a
                                                                         good starting point that offers a mechanism to convey information
1. Introduction                                                          between concurrent transactions; however, a programmer must ex-
                                                                         plicitly and carefully tie different concurrent computations (trans-
Transactional memory is a powerful programming abstraction that          actions) together to coherently get the application's desired behav-
enables a programmer to turn a complex, composite collection of          ior. We believe that the abstractions proposed in this paper signif-
statements into an atomic operation. Previous work usually ex-           icantly simplify this difficult aspect of implementing some impor-
presses this abstraction as an atomic block, which offers mutual         tant parallel applications.
exclusion for code running on threads [3]. The implicit connection           Transactions are usually tied to a thread, a rather heavy-weight
between transaction memory and threads has the unfortunate effect        parallel programming abstraction. An executing transaction typi-
of limiting the use of transactions because threads are expensive to     cally owns the host thread. No other transaction (with the exception
create and use in most systems.                                          of nested transactions) can be executed by the thread simultane-
    This paper introduces an alternative form of transactional mem-      ously because the owner transaction uses the host thread's runtime
ory that supports much finer grain transactions. Featherweight           stack. Applications that might be best expressed as a large number
transactions are atomic operations that execute to completion (com-      (potentially in the order of millions) of fine-grain transactions are
mit, abort, or retry). They mesh very nicely with a data parallel pro-   simply infeasible in this environment.
gramming style, in which each data parallel operation executes in a          A key insight is that completed (committed, aborted, or retried)
featherweight transaction. Section 2 describes our implementation.       transactions do not require a runtime stack. They require a stack
    Executing each data parallel operation in a separate memory          when they execute, but when terminated their stack frames can
transaction introduces asynchrony and composable synchroniza-            be reused for another transaction. To eliminate the stack frames
tion into data parallel programming. An operation can use the            preceding a transaction, we restrict the programming model by
retry construct to abort its computation and wait for values to          requiring that an entire method body be enclosed in a transaction.
change. When the values change, the operation re-executes. A col-        Transactions can now be multiplexed onto a thread provided each
lection of asynchronously executing operations can be treated prof-      is guaranteed to run to completion on the thread.
itably as a single data parallel operation.                                  A transaction performs a unit of work atomically. In data par-
    To illustrate this, we built a highly-parallel implementation of     allel programming, each data parallel operation can be executed in
the Chaff [4] SAT solver. Section 3 describes the parallel version       a transaction. We define a new abstraction called an atomic work
of Chaff constructed using our new techniques. Chaff is typical          item (henceforth called "work item") that associates a transaction
of an important group of applications such as theorem provers            with a particular data item in a data parallel aggregate. In our im-
and constraint optimization systems. These applications naturally        plementation, a work item is an instance of class WorkItem and
exhibit large degrees of data-level parallelism that is difficult to     is instantiated by passing a function delegate (an abstraction for
exploit using existing data parallel paradigms.                          type-safe function pointers) and the data as the WorkItem's con-
                                                                         structor parameters. The work item's data object is essentially the
2. Atomic Work Items                                                     sole parameter to its delegate function. The runtime system exe-
Memory transactions provide a concise mechanism for mutual ex-           cutes a work item by invoking its delegate function and passing its
clusion and have been extended to offer condition synchronization        data object as a parameter.
                                                                             In Chaff, it is necessary to repeatedly executed work items when
                                                                         they are successful. To permit repeated execution of work items,
                                                                         we introduce the notion of daemon workers, which repeatedly
                                                                         execute work items after they commit (a work item is re-executed
                                                                         when it aborts due to data conflicts or blocks because of retry).
                                                                         In our implementation, a daemon is an instance of class Txn-
                                                                         Worker. The programmer must suspend a daemon using a special
Copyright is held by the author/owner(s).                                TxnWorker.Suspend() method call. This prevents subsequent
PPoPP'07 March 14­17, 2007, San Jose, California, USA.                   execution of the suspended daemon's work items.
ACM 978-1-59593-602-8/07/0003.
    Another useful abstraction is a group of work items, which we        involved. The BCP component is highly parallel (thousands or mil-
call a work group (referred to as TxnGrps in our implementation).        lions of clauses), but the computations are fine grained. Directly
Work groups correspond to aggregates in data parallel program-           expressing them may lead to an unmanagably large number of
ming. Work groups provide a programmer with useful collective            threads (computational units). In addition, writing such an appli-
operations, such as starting execution of all work items in a group,     cation is difficult because of the difficulty of explicitly controlling
waiting for all members of a group reach a quiescent state (us-          co-ordination among these computational units. We believe that
ing say the TxnGrp.WaitForAll() method), suspend all work                our atomic work item abstractions mitigate these difficulties.
items in a specific group, perform group level joins, splits and re-         Using our abstraction, parallelizing ZChaff is simple: dedicate
ductions, etc. We believe that there is great potential for far richer   a distinct daemon worker for each clause in the formula. The
semantics of work groups, which we leave for future work.                computation for each work item starts by reading the variables
    A key problem was the semantics of exception handling in             in its clause. If there exists a literal assignment that may lead to
work groups. In prior work, exceptions reaching boundaries of            an implied literal assignment, make that literal assignment and
an atomic block aborts the work done within the block and is             commit. If there is no such literal assignment, then simply retry
rethrown to the enclosing context. In the context of atomic work         (and wait for the clause's variables to change). A co-ordinator
items, an exception generated by a work item is considered to be         thread manages explicit literal assignments in the formula. After
an exception generated by the enclosing group. Thus, when a work         making the literal assignment, the main thread waits for completion
item throws an exception, the entire group's activity is suspended       of BCP activity by calling WaitForAll() on the work group.
and the exception is percolated to the thread that waits for the             If a clause evaluates to false during BCP, an exception is
group to reach a quiescent state. Notice that multiple work items        raised by the relevant work item, which in turn suspends execution
may simultaneously generate exceptions in a group. All but one           of the entire work group. The WaitForAll() method called by
exception is suppressed. It may be valuable to permit dispatch of        the main thread returns this exception. On receiving an exception,
multiple exceptions from a work group, but we leave that design          the main thread generates a conflict clause, adds it to the existing
for future work.                                                         list of clauses, rolls back literal assignments up to a point at which
    There are several other important operations on these abstrac-       there is no conflicting literal assignment, and resumes with more
tions that are useful for the underlying runtime system as well as       explicit literal assignments. Conflict clauses are of great value in
for user programmers, but we will not discuss them due to space          pruning large search spaces in SAT solvers. The input formula is
restrictions. We have implemented our abstractions in the Bartok         designated unsatisfiable if rollback happens all the way beyond the
STM system [2].                                                          very first explicit literal assignment made by the main thread. If
                                                                         the algorithm cannot make any new explicit literal assignments
3. Parallelizing ZChaff                                                  and there are no outstanding conflicts, the formula is designated
                                                                         satisfiable.
Since the boolean satisfiability problem (SAT) is NP-complete,               We are implementing our runtime system and parallel ZChaff.
there is no efficient solution to this problem. Existing solvers rely
on heuristics to make literal assignment decisions. However, most,
if not all, SAT solvers rely on the standard boolean constraint          References
propagation (BCP) algorithm to propagate implied literal assign-         [1] T. Harris, S. Marlow, S. Peyton-Jones, and M. Herlihy. Composable
ments once an explicit literal assignment (suggested by the deci-            Memory Transactions. In Proceedings of the 10th ACM SIGPLAN
sion heuristic) is made. It is also widely known that BCP is the             Symposium on Principles and Practice of Parallel Programming,
most time consuming (roughly about 80% of the execution time)                pages 48­60, 2005.
operation in any SAT solver. We focus on this BCP component of           [2] T. Harris, M. Plesko, A. Shinnar, and D. Tarditi. Optimizing Memory
ZChaff (a C# implementation of Chaff) in our parallelization.                Transactions. In Proceedings of the ACM SIGPLAN Conference on
     Our implementation of ZChaff processes formulas in the CNF              Programming Language Design and Implementation, pages 14­25,
SAT form. In the sequential version, whenever an explicit literal            2006.
assignment is made, say l, it is posted in a global implication          [3] J. Larus and R. Rajwar. Transactional Memory. Morgan & Claypool,
queue. The BCP algorithm thereafter gets the implication queue's             2007.
first literal entry and looks up the clauses containing the negation     [4] M. W. Moskewicz, C. F. Madigan, Y. Zhao, L. Zhang, and S. Malik.
of that literal (¬l in our example). Since l is assigned the value           Chaff: Engineering an Efficient SAT Solver. In Proceedings on the
true its negation ¬l is false. ZChaff then determines if any                 38th Design and Automation Conference, pages 530­535, 2001.
clause containing ¬l contains a single unassigned literal and if all
other literals have the value false. If so, the unassigned literal is
implied to be true and is in turn posted in the implication queue.
After processing all clauses corresponding to ¬l, the algorithm
checks if a new implication queue entry was added and processes
it in a similar fashion.
     A coarse-grain method of parallelizing ZChaff is to fork off two
threads at the point at which an explicit literal assignment is made;
one thread takes the literal and the other takes its negation. In ex-
isting implementations, this approach has led to performance im-
provements that vary widely based on the input formula. An alter-
nate, fine-grain parallelization approach focuses on the BCP com-
ponent of SAT solvers wherein "computational units" are dedicated
to process distinct sets of clauses in the SAT formula. An explicit
literal assignment triggers activity in these computational units that
collectively perform the BCP task.
     Fine-grain parallelization has definite benefits, provided the
concurrency achieved is sufficient to offset the co-ordination cost