Information about http://www.cs.wisc.edu/condor/doc/condorg-hpdc10.pdf

Condor-G: A Computation Management Agent for Multi-Institutional…

Tags: argonne il, argonne national laboratory, computational tasks, computer science division, department of computer science, domain resources, dramatic increase, fault tolerance, grids, james frey, madison wi, management agent, mcs, miron livny, personal domain, resource selection, science mathematics, storage resources, todd tannenbaum, wisc,
Pages: 9
Language: english
Created: Tue Jan 1 00:00:00 13
Display cached document
Page 1
image
Page 2
image
Page 3
image
Page 4
image
Page 5
image
Page 6
image
Page 7
image
Page 8
image
Page 9
image
    Condor-G: A Computation Management Agent for Multi-Institutional Grids

 James Frey, Todd Tannenbaum, Miron Livny                            Ian Foster, Steven Tuecke
       Department of Computer Science                        Mathematics and Computer Science Division
            University of Wisconsin                                Argonne National Laboratory
              Madison, WI 53706                                          Argonne, IL 60439
   { jfrey, tannenba, miron }@cs.wisc.edu                         { foster, tuecke }@mcs.anl.gov


                       Abstract                                      dynamically, in the course of their everyday
In recent years, there has been a dramatic increase in the           activities.
amount of available computing and storage resources.           · They do not want to be bothered with the location
Yet few have been able to exploit these resources in an              of these resources, the mechanisms that are
aggregated form. We present the Condor-G system, which               required to use them, with keeping track of the
leverages software from Globus and Condor to allow                   status of computational tasks operating on these
users to harness multi-domain resources as if they all               resources, or with reacting to failure.
belong to one personal domain. We describe the structure       · They do care about how long their tasks are likely
of Condor-G and how it handles job management,                       to run and how much these tasks will cost.
resource selection, security, and fault tolerance.               In this article, we present an innovative distributed
                                                             computing framework that addresses these three issues.
                                                             The Condor-G system leverages the significant advances
                                                             that have been achieved in recent years in two distinct
1. Introduction
                                                             areas: (1) security, resource discovery, and resource
                                                             access in multi-domain environments, as supported within
    In recent years the scientific community has
                                                             the Globus Toolkit [12], and (2) management of
experienced a dramatic pluralization of computing and
                                                             computation and harnessing of resources within a single
storage resources. The national high-end computing
                                                             administrative domain, specifically within the Condor
centers have been joined by an ever-increasing number of
                                                             system [20, 22]. In brief, we combine the inter-domain
powerful regional and local computing environments. The
                                                             resource management protocols of the Globus Toolkit and
aggregated capacity of these new computing resources is
                                                             the intra-domain resource management methods of
enormous. Yet, to date, few scientists and engineers have
                                                             Condor to allow the user to harness multi-domain
managed to exploit the aggregate power of this seemingly
                                                             resources as if they all belong to one personal domain.
infinite Grid of resources. While in principle most users
                                                             The user defines the tasks to be executed; Condor-G
could access resources at multiple locations, in practice
                                                             handles all aspects of discovering and acquiring
few reach beyond their home institution, whose resources
                                                             appropriate resources, regardless of their location;
are often far from sufficient for increasingly demanding
                                                             initiating, monitoring, and managing execution on those
computational tasks such as simulation, large scale
                                                             resources; detecting and responding to failure; and
optimization, Monte Carlo computing, image processing,
                                                             notifying the user of termination. The result is a powerful
and rendering. The problem is the significant "potential
                                                             tool for managing a variety of parallel computations in
barrier" associated with the diverse mechanisms, policies,
                                                             Grid environments.
failure modes, performance uncertainties, etc., that
                                                                 Condor-G's utility has been demonstrated via record-
inevitably arise when we cross the boundaries of
                                                             setting computations. For example, in one recent
administrative domains.
                                                             computation a Condor-G agent managed a mix of desktop
    Overcoming this potential barrier requires new
                                                             workstations, commodity clusters, and supercomputer
methods and mechanisms that meet the following three
                                                             processors at ten sites to solve a previously open problem
key user requirements for computing in a "Grid" that
                                                             in numerical optimization. In this computation, over
comprises resources at multiple locations:
                                                             95,000 CPU hours were delivered over a period of less
  · They want to be able to discover, acquire, and
                                                             than seven days, with an average of 653 processors being
        reliably   manage      computational    resources
                                                             active at any one time. In another case, resources at three
sites were used to simulate and reconstruct 50,000 high-                 protocols for resource discovery and management.
energy physics events, consuming 1200 CPU hours in less                  These protocols support secure discovery of remote
than a day and a half.                                                   resource configuration and state, and secure
    In the rest of this article, we describe the specific                allocation of remote computational resources and
problem we seek to solve with Condor-G, the Condor-G                     management of computation on those resources.
architecture, and the results obtained to date.                          We use the protocols defined by the Globus Toolkit
                                                                         [12], a de facto standard for Grid computing.
2. Large-scale sharing of computational                            · Computation management issues are addressed via
resources                                                                the introduction of a robust, multi-functional user
                                                                         computation management agent responsible for
    We consider a Grid environment in which an                           resource discovery, job submission, job
individual user may, in principle, have access to                        management, and error recovery. This Condor-G
computational resources at many sites. Answering why the                 component is taken from the Condor system [20].
user has access to these resources is not our concern. It          · Remote execution environment issues are addressed
may be because the user is a member of some scientific                   via the use of mobile sandboxing technology that
collaboration, or because the resources in question belong               allows a user to create a tailored execution
to a colleague, or because the user has entered into some                environment on a remote node. This Condor-G
contractual relationship with a resource provider [14]. The              component is also taken from the Condor system.
point is that the user is authorized to use resources at those      This separation of concerns between remote resource
sites to perform a computation. The question that we             access and computation management has some significant
address is how to build and manage a multi-site                  benefits. First, it is significantly less demanding to require
computation that uses those resources.                           that a remote resource speak some simple protocols rather
    Performing a computation on resources that belong to         than to require it to support a more complex distributed
different sites can be difficult in practice for the following   computing environment. This is particularly important
reasons:                                                         given that the deployment of production Grids [4, 18, 27]
   · Different sites may feature different authentication        has made it increasingly common that remote resources
       and authorization mechanisms, schedulers,                 speak these protocols. Second, as we explain below,
       hardware architectures, operating systems, file           careful design of remote access protocols can significantly
       systems, etc.                                             simplify computation management.
   · The user has little knowledge of the characteristics
       of resources at remote sites, and no easy means of        3. Grid protocol overview
       obtaining this information.
   · Due to the distributed nature of the multi-site                In this section, we briefly review the Grid protocols
       computing environment, computers, networks, and           that we exploit in the Condor-G system: GRAM, GASS,
       subcomputations can fail in various ways.                 MDS-2, and GSI. The Globus Toolkit provides open
   · Keeping track of the status of different elements of        source implementations of each.
       a computation involves tedious bookkeeping,
       especially in the event of failure and dependencies       3.1. Grid security infrastructure
       among subcomputations.
    Furthermore, the user is typically not in a position to         The Globus Toolkit's Grid Security Infrastructure
require uniform software systems on the remote sites. For        (GSI) [13] provides essential building blocks for other
example, if all sites to which a user had access ran DCE         Grid protocols and for Condor-G. This authentication and
and DFS, with appropriate cross-realm Kerberos                   authorization system makes it possible to authenticate a
authentication arrangements, the task of creating a multi-       user just once, using public key infrastructure (PKI)
site computation would be significantly easier. But it is        mechanisms to verify a user-supplied "Grid credential."
not practical in the general case to assume such                 GSI then handles the mapping of the Grid credential to the
uniformity.                                                      diverse local credentials and authentication/authorization
    The Condor-G system addresses these issues via a             mechanisms that apply at each site. Hence, users need not
separation of concerns between the three problems of             re-authenticate themselves each time they (or a program
remote resource access, computation management, and              acting on their behalf, such as a Condor-G computation
remote execution environments:                                   management service) access a new remote resource.
   · Remote resource access issues are addressed by                 GSI's PKI mechanisms require access to a private key
       requiring that remote resources speak standard            that they use to sign requests. While in principle a user's
private key could be cached for use by user programs, this    how much standard output and error data has been
approach exposes this critical resource to considerable       received, thus permitting a client to request resending of
risk. Instead, GSI employs the user's private key to create   this data after a crash of client or server.
a proxy credential, which serves as a new private-public
key pair that allows a proxy (such as the Condor-G agent)     3.3. MDS protocols and implementation
to make remote requests on behalf of the user. This proxy
credential is analogous in many respects to a Kerberos           The Globus Toolkit's MDS-2 provides basic
ticket [26] or Andrew File System token.                      mechanisms for discovering and disseminating
                                                              information about the structure and state of Grid resources
3.2. GRAM protocol and implementation                         [9]. The basic ideas are simple. A resource uses the Grid
                                                              Resource Registration Protocol (GRRP) to notify other
    The Grid Resource Allocation and Management               entities that it is part of the Grid. Those entities can then
(GRAM) protocol [10] supports remote submission of a          use the Grid Resource Information Protocol (GRIP) to
computational request ("run program P") to a remote           obtain information about resource status. These two
computational resource, and subsequent monitoring and         protocols allow us to construct a range of interesting
control of the resulting computation. Three aspects of the    structures, including various types of directories that
protocol are particularly important for our purposes:         support discovery of interesting resources. GSI
security, two-phase commit, and fault tolerance. The latter   authentication is used as a basis for access control.
two mechanisms were developed in collaboration with the
UW team and are not yet part of the GRAM version              3.4. GASS
included in the Globus Toolkit. They will be in the
GRAM-2 protocol revision scheduled for later in 2001.            The Globus Toolkit's Global Access to Secondary
    GSI security mechanisms are used in all operations to     Storage (GASS) service [7] provides mechanisms for
authenticate the requestor and for authorization.             transferring data between a remote HTTP, FTP, or GASS
Authentication is performed using the supplied proxy          server. In the current context, we use these mechanisms to
credential, hence providing for single sign-on.               stage executables and input files to a remote computer. As
Authorization implements local policy and may involve         usual, GSI mechanisms are used for authentication.
mapping the user's "Grid id" into a local subject name;
however, this mapping is transparent to the user. Work in     4. Computation management: the Condor-G
progress will also allow authorization decisions to be
made on the basis of capabilities supplied with the
                                                              agent
request.
    Two-phase commit is important as a means of                 Next, we describe the Condor-G                computation
achieving "exactly once" execution semantics. Each            management service (or Condor-G agent).
request from a client is accompanied by a unique
sequence number, which is also included in the associated     4.1. User interface
response. If no response is received after a certain amount
of time, the client can repeat the request. The repeated         The Condor-G agent allows the user to treat the Grid as
sequence number allows the resource to distinguish            an entirely local resource, with an API and command line
between a lost request and a lost response. Once the client   tools that allow the user to perform the following job
has received a response, it then sends a "commit" message     management operations:
to signal that job execution can commence.                      · submit jobs, indicating an executable name,
   Resource-side fault tolerance support addresses the fact          input/output files and arguments;
that a single "resource" may often contain multiple             · query a job's status, or cancel the job;
processors (e.g., a cluster or Condor pool) with                · be informed of job termination or problems, via
specialized "interface" machines running the GRAM                    callbacks or asynchronous mechanisms such as
server(s) that maintain the mapping from submitting client           email;
to local process. Consequently, failure of an interface         · obtain access to detailed logs, providing a complete
machine may result in the remote client losing contact               history of their jobs' execution.
with what is otherwise a correctly queued or executing           There is nothing new or special about the semantics of
job. Hence, our GRAM implementation logs details of all       these capabilities, as one of the main objectives of
active jobs to stable storage at the client side, allowing    Condor-G is to preserve the look and feel of a local
this information to be retrieved if a GRAM server crashes     resource manager. The innovation in Condor-G is that
and is restarted. This information can include details of     these capabilities are provided by a personal desktop
         Job Submission Machine                                               Job Execution Site

                                                                                        Globus
                                                                                      GateKeeper
                Condor-G
                                        End User




                                                                                                        Fo
                                                                                 rk
                Scheduler




                                                                                                          rk
                                                                               Fo
                                        Requests

                                      Persistant                 Globus                                  Globus
                                     Job Queue                 JobManager                              JobManager




                                                                     Submit




                                                                                                                Submit
                Fork




                                                                              Site Job Scheduler
                                                                       (PBS, Condor, LSF, LoadLeveler, NQE, etc.)
               Condor-G
              GridManager

                       GASS
                       Server
                                                                  Job X                                        Job Y




                       Figure 1. Remote execution by Condor-G on Globus-managed resources

agent and supported in a Grid environment, while                     and GRAM callbacks and status calls, while
guaranteeing fault tolerance and exactly-once execution          · authenticating all requests via GSI mechanisms.
semantics. By providing the user with a familiar and              The Condor-G agent also handles resubmission of
reliable single access point to all the resources he/she is   failed jobs, communications with the user concerning
authorized to use, Condor-G empowers end-users to             unusual and erroneous conditions (e.g., credential expiry,
improve the productivity of their computations by             discussed below), and the recording of computation on
providing a unified view of dispersed resources.              stable storage to support restart in the event of its failure.
                                                                  We     have     structured     the    Condor-G        agent
4.2. Supporting remote execution                              implementation as depicted in Figure 1. The Scheduler
                                                              responds to a user request to submit jobs destined to run
    Behind the scenes, the Condor-G agent executes user       on Grid resources by creating a new GridManager
computations on remote resources on the user's behalf. It     daemon to submit and manage those jobs. One
does this by using the Grid protocols described above to      GridManager process handles all jobs for a single user
interact with machines on the Grid and mechanisms             and terminates once all jobs are complete. Each
provided by Condor to maintain a persistent view of the       GridManager job submission request (via the modified
state of the computation. In particular, it:                  two-phase commit GRAM protocol) results in the creation
   · stages a job' standard I/O and executable using
                    s                                         of one Globus JobManager daemon. This daemon
       GASS,                                                  connects to the GridManager using GASS in order to
   · submits a job to a remote machine using the revised      transfer the job's executable and standard input files, and
       GRAM job request protocol, and                         subsequently to provide real-time streaming of standard
   · subsequently monitors job status and recovers from       output and error. Next, the JobManager submits the jobs
       remote failures using the revised GRAM protocol        to the execution site's local scheduling system. Updates
on job status are sent by the JobManager back to the             credential expiration. The Condor-G agent addresses this
GridManager, which then updates the Scheduler, where             requirement by periodically analyzing the credentials for
the job status is stored persistently as we describe below.      all users with currently queued jobs. (GSI provides query
When the job is started, a process environment variable          functions that support this analysis.) If a user's credentials
points to a file containing the address/port (URL) of the        have expired or are about to expire, the agent places the
listening GASS server in the GridManager process. If the         job in a hold state in its queue and sends the user an e-
address of the GASS server should change, perhaps                mail message explaining that their job cannot run again
because the submission machine was restarted, the                until their credentials are refreshed by using a simple tool.
GridManager requests the JobManager to update the file           Condor-G also allows credential alarms to be set. For
with the new address. This allows the job to continue file       instance, it can be configured to e-mail a reminder when
I/O after a crash recovery.                                      less than a specified time remains before a credential
    Condor-G is built to tolerate four types of failure: crash   expires.
of the Globus JobManager, crash of the machine that                  Credentials may have been forwarded to a remote
manages the remote resource (the machine that hosts the          location, in which case the remote credentials need to be
GateKeeper and JobManager), crash of the machine on              refreshed as well. At the start of a job, the Condor-G agent
which the GridManager is executing (or crash of the the          forwards the user' proxy certificate from the submission
                                                                                     s
GridManager alone), and failures in the network                  machine to the remote GRAM server. When an expired
connecting the two machines.                                     proxy is refreshed, Condor-G not only needs to refresh the
    The GridManager detects remote failures by                   certificate on the local (submit) side of the connection, but
periodically probing the JobManagers of all the jobs it          it also needs to re-forward the refreshed proxy to the
manages. If a JobManager fails to respond, the                   remote GRAM server.
GridManager then probes the GateKeeper for that                      To reduce user hassle in dealing with expired
machine. If the GateKeeper responds, then the                    credentials, Condor-G could be enhanced to work with a
GridManager knows that the individual JobManager                 system like MyProxy [23]. MyProxy lets a user store a
crashed. Otherwise, either the whole resource                    long-lived proxy credential (e.g. a week) on a secure
management machine crashed or there is a network failure         server. Remote services acting on behalf of the user can
(the GridManager cannot distinguish these two cases). If         then obtain short-lived proxies (e.g. 12 hours) from the
only the JobManager crashed, the GridManager attempts            server. Condor-G could use these short-lived proxies to
to start a new JobManager to resume watching the job.            authenticate with and forward to remote resources and
Otherwise, the GridManager waits until it can reestablish        refresh them automatically from the MyProxy server when
contact with the remote machine. When it does, it attempts       they expire. This limits the exposure of the long-lived
to reconnect to the JobManager. This can fail for two            proxy (only the MyProxy server and Condor-G have
reasons: the JobManager crashed (because the whole               access to it).
machine crashed), or the JobManager exited normally
(because the job completed during a network failure). In         4.4. Resource discovery and scheduling
either case, the GridManager starts a new JobManager,
which will resume watching the job or tell the                       We have not yet addressed the critical question of how
GridManager that the job has completed.                          the Condor-G agent determines where to execute user
    To protect against local failure, all relevant state for     jobs. A number of strategies are possible.
each submitted job is stored persistently in the scheduler's         A simple approach, which we used in the initial
job queue. This persistent information allows the                Condor-G implementation, is to employ a user-supplied
GridManager to recover from a local crash. When                  list of GRAM servers. This approach is a good starting
restarted, the GridManager reads the information and             point for further development.
reconnects to any of the JobManagers that were running at            A more sophisticated approach is to construct a
the time of the crash. If a JobManager fails to respond, the     personal resource broker that runs as part of the Condor-
GridManager starts a new JobManager to watch that job.           G agent and combines information about user
                                                                 authorization, application requirements and resource
4.3. Credential management                                       status (obtained from MDS) to build a list of candidate
                                                                 resources. These resources will be queried to determine
   A GSI proxy credential used by the Condor-G agent to          their current status, and jobs will be submitted to
authenticate with remote resources on the user's behalf is       appropriate resources depending on the results of these
given a finite lifetime so as to limit the negative              queries. Available resources can be ranked by user
consequences of its capture by an adversary. A long-lived        preferences such as allocation cost and expected start or
Condor-G computation must be able to deal with                   completion time. One promising approach to constructing
                Job Submission Machine                                              Job Execution Site

                                 Persistant
                                Job Queue

                                                                                        Globus Daemons
          End User                                                                               +
          Requests
                                                                                       Local Site Scheduler
                                               Condor-G
                                              GridManager
             Condor-G                                                                     [See Figure 1]
             Scheduler          Fork            GASS
                                                Server
                 Fork




                                              Condor-G               Resource
                                                                                             Job
                                              Collector             Information
                                                                                              Condor
                                                                                             Daemons
                Condor                                        Transfer Job X
               Shadow




                                                                                                 Fork
              Process for
                 Job X
                                                                Redirected                    Job X
                                                                System Call             Condor System Call
                                                                   Data
                                                                                       Trapping & Checkpoint
                                                                                              Library



                                       Figure 2. Remote job execution via GlideIn

such a resource broker is to use the Condor Matchmaking        management tool for Grid computations. However, we
framework [25] to implement the brokering algorithm.           still have not addressed issues relating to what happens
Such an approach is described by Vazhkudai et al. [28].        when a job executes on a remote platform where required
They gather information from MDS servers about Grid            files are not available and local policy may not permit
storage resources, format that information and user            access to local file systems. Local policy may also impose
storage requests into ClassAds, and then use the               restrictions on the running time of the job, which may
Matchmaker to make brokering decisions. A similar              prove inadequate for the job to complete. These additional
approach could be taken for computational resources for        system and site policy heterogeneities can represent
use with Condor-G.                                             substantial barriers.
   In the case of high throughput computations, a simple           We address these concerns via what we call mobile
but effective technique is to "flood" candidate resources      sandboxing. In brief, we use the mechanisms described
with requests to execute jobs. These can be the actual jobs    above to start on a remote computer not a user job, but a
submitted by the user or Condor "GlideIns" as discussed        daemon process that performs the following functions:
below. Monitoring of actual queuing and execution times           · It uses standard Condor mechanisms to advertise its
allows for the tuning of where to submit subsequent jobs               availability to a Condor Collector process, which is
and to migrate queued jobs.                                            queried by the Scheduler to learn about available
                                                                       resources. Condor-G uses standard Condor
5. GlideIn mechanism                                                   mechanisms to match locally queued jobs with the
                                                                       resources advertised by these daemons and to
   The techniques described above allow a user to                      remotely execute them on these resources [25].
construct, submit, and monitor the execution of a task            · It runs each user task received in a "sandbox,"
graph, with failures and credential expirations handled                using system call trapping technologies provided
seamlessly and appropriately. The result is a powerful                 by the Condor system [20] to redirect system calls
       issued by the task back to the originating system. In   sophisticated branch and bound algorithm. This
       the process, this both increases portability and        computation used an average of 653 CPUs during that
       protects the local system.                              week, with a maximum of 1007 in use at any one time.
   · It periodically checkpoints the job to another            Each worker in this Master-Worker application was
       location (e.g., the originating location or a local     implemented as an independent Condor job that used
       checkpoint server) and migrates the job to another      Remote I/O services to communicate with the Master.
       location if requested to do so (for example, when a        A group at Caltech that is part of the CMS Energy
       resource is required for another purpose or the         Physics collaboration has been using Condor-G to
       remote allocation expires) [21].                        perform large-scale         distributed   simulation and
    These various functions are precisely those provided       reconstruction of high-energy physics events. A two-node
by the daemon process that is run on any computer              Directed Acyclic Graph (DAG) of jobs submitted to a
participating in a Condor pool. The difference is that in      Condor-G agent at Caltech triggers 100 simulation jobs on
Condor-G, these daemon processes are started not by the        the Condor pool at the University of Wisconsin. Each of
user, but by using the GRAM remote job submission              these jobs generates 500 events. The execution of these
protocol. In effect, the Condor-G GlideIn mechanism uses       jobs is also controlled by a DAG that makes sure that
Grid protocols to dynamically create a personal Condor         local disk buffers do not overflow and that all events
pool out of Grid resources by "gliding-in" Condor              produced are transferred via GridFTP to a data repository
daemons to the remote resource. Daemons shut down              at NCSA. Once all simulation jobs terminate and all data
gracefully when their local allocation expires or when they    is shipped to the repository, the Condor-G agent at
do not receive any jobs to execute after a (configurable)      Caltech submits a subsequent reconstruction job to the
amount of time, thus guarding against runaway daemons.         PBS system that manages the reconstruction cluster at
Our implementation of this "GlideIn" capability submits        NCSA.
an initial GlideIn executable (a portable shell script),          Condor-G has also been used in the GridGaussian
which in turn uses GSI-authenticated GridFTP to retrieve       project at NCSA to prototype a portal for running
the Condor executables from a central repository, hence        Gaussian98 jobs on Grid resources. This Portal uses
avoiding a need for individual users to store binaries for     GlideIns to optimize access to remote resources and
all potential architectures on their local machines.           employs a shared Mass Storage System (MSS) to store
    Another advantage of using GlideIns is that they allow     input and output data. Users of the portal have two
the Condor-G agent to delay the binding of an application      requirements for managing the output of their Gaussian
to a resource until the instant when the remote resource       jobs. First, the output should be reliably stored at MSS
manager decides to allocate the resource(s) to the user. By    when the job completes. Second, the users should be able
doing so, the agent minimizes queuing delays by                to view the output as it is produced. These requirements
preventing a job from waiting at one remote resource           are addressed by a utility program called G-Cat that
while another resource capable of serving the job is           monitors the output file and sends updates to MSS as
available. By submitting GlideIns to all remote resources      partial file chunks. G-Cat hides network performance
capable of serving a job, Condor-G can guarantee optimal       variations from Gaussian by using local scratch storage as
queuing times to its users. One can view the GlideIn as an     a buffer for Gaussian's output, rather than sending the
empty shell script submitted to a queuing system that can      output directly over the network. Users can view the
be populated once it is allocated the requested resources.     output as it is received at MSS using a standard FTP client
                                                               or by running a script that retrieves the file chunks from
6. Experiences                                                 MSS and assembles them for viewing.

    Three very different examples illustrate the range and     7. Related work
scale of application that we have already encountered for
Condor-G technology.                                              The management of batch jobs within a single
    An early version of Condor-G was used by a team of         distributed system or domain has been addressed by many
four mathematicians from Argonne National Laboratory,          research and commercial systems, notably Condor [20],
Northwestern University, and University of Iowa to             DQS [17], LSF [29], LoadLeveler [16], and PBS [15].
harness the power of over 2,500 CPUs at 10 sites (eight        Some of these systems were extended with restrictive and
Condor pools, one Cluster managed by PBS, and one              ad hoc capabilities for routing jobs submitted in one
supercomputer managed by LSF) to solve a very large            domain to a queue in a different domain. In all cases, both
optimization problem [3]. In less than a week the team         domains must run the same resource management
logged over 95,000 CPU hours to solve more than 540            software. With the exception of Condor, they all use a
billion Linear Assignment Problems controlled by a             resource allocation framework that is based on a system-
wide collection of queues­each representing a different       9. References
class of service.
    Condor flocking [11] supports multi-domain                [1] Abramson, D., Giddy, J., and Kotler, L., "High
computation management by using multiple Condor flocks        Performance Parametric Modeling with Nimrod/G: Killer
to exchange load. The major difference between Condor         Application for the Global Grid?", IPDPS'2000, IEEE Press,
flocking and Condor-G is that Condor-G allows inter-          2000.
domain operation on remote resources that require
                                                              [2] Abramson, D., Sosic, R., Giddy, J., and Hall, B., "Nimrod:
authentication, and uses standard protocols that provide
                                                              A Tool for Performing Parameterized Simulations Using
access to resources controlled by other resource              Distributed Workstations", Proc. 4th IEEE Symp. on High
management systems, rather than the special-purpose           Performance Distributed Computing, 1995.
sharing mechanisms of Condor.
    Recently, various research and commercial groups          [3] Anstreicher, K., Brixius, N., Goux, J.-P., and Linderoth, J.,
have developed software tools that support the harnessing     "Solving Large Quadratic Assignment Problems on
of idle computers for specific computations, via the use of   Computational Grids", Mathematical Programming, 2000.
simple remote execution agents (workers) that, once
                                                              [4] Beiriger, J., Johnson, W., Bivens, H., Humphreys, S., and
installed on a computer, can download problems (or, in
                                                              Rhea, R., "Constructing the ASCI Grid", Proc. 9th IEEE
some cases, Java applications) from a central location and    Symposium on High Performance Distributed Computing, IEEE
run them when local resources are available (i.e.             Press, 2000.
SETI@home [19], Entropia, and Parabon). These tools
assume a homogeneous environment where all resource           [5] Berman, F., "High-Performance Schedulers", Foster, I. and
management services are provided by their own system.         Kesselman, C. eds., The Grid: Blueprint for a New Computing
Furthermore, a single master (i.e., a single submission       Infrastructure, Morgan Kaufmann, 1999, pp. 279-309.
point) controls the distribution of work amongst all
                                                              [6] Berman, F., Wolski, R., Figueira, S., Schopf, J., and Shao,
available worker agents. Application-level scheduling
                                                              G.,   "Application-Level    Scheduling       on     Distributed
techniques [5, 6] provide "personalized" policies for         Heterogeneous Networks", Proc. Supercomputing '96, 1996.
acquiring and managing collections of heterogeneous
resources. These systems employ resource management           [7] Bester, J., Foster, I., Kesselman, C., Tedesco, J., and
services provided by batch systems to make the resources      Tuecke, S., "GASS: A Data Movement and Access Service for
available to the application and to place elements of the     Wide Area Computing Systems", Sixth Workshop on I/O in
application on these resources. An application-level          Parallel and Distributed Systems, May 5, 1999.
scheduler for high-throughput scheduling that takes data
                                                              [8] Casanova, H., Obertelli, G., Berman, F., and Wolski, R.,
locality information into account in interesting ways has     "The AppLeS Parameter Sweep Template: User-Level
been constructed [8]. Condor-G mechanisms complement          Middleware for the Grid", Proc. SC'2000, 2000.
this work by addressing issues of uniform remote access,
failure, credential expiry, etc. Condor-G could potentially   [9] Czajkowski, K., Fitzgerald, S., Foster, I., and Kesselman,
be used as a backend for an application-level scheduling      C., "Grid Information Services for Distributed Resource
system.                                                       Sharing", 2001.
    Nimrod [2] provides a user interface for describing
                                                              [10] Czajkowski, K., Foster, I., Karonis, N., Kesselman, C.,
"parameter sweep" problems, with the resulting
                                                              Martin, S., Smith, W., Tuecke, S., "A Resource Management
independent jobs being submitted to a resource                Architecture for Metacomputing Systems", Proc. IPPS/SPDP
management system; Nimrod-G [1] generalizes Nimrod to         '98 Workshop on Job Scheduling Strategies for Parallel
use Globus mechanisms to support access to remote             Processing, 1998.
resources. Condor-G addresses issues of failure, credential
expiry, and interjob dependencies that are not addressed      [11] Epema, D.H.J., Livny, M., Dantzig, R.v., Evers, X., and
by Nimrod or Nimrod-G.                                        Pruyne, J., "A Worldwide Flock of Condors: Load Sharing
                                                              among Workstation Clusters", Future Generation Computer
                                                              Systems, 12, 1996.
8. Acknowledgment
                                                              [12] Foster, I. and Kesselman, C., "Globus: A Toolkit-Based
  This research was supported by the NASA Information         Grid Architecture", Foster, I. and Kesselman, C. eds., The Grid:
Power Grid program.                                           Blueprint for a New Computing Infrastructure, Morgan
                                                              Kaufmann, 1999, pp. 259-278.
[13] Foster, I., Kesselman, C., Tsudik, G., and Tuecke, S., "A    [22] Livny, M., "High-Throughput Resource Management",
Security Architecture for Computational Grids", ACM               Foster, I. and Kesselman, C. eds., The Grid: Blueprint for a New
Conference on Computers and Security, 1998, pp. 83-91.            Computing Infrastructure, Morgan Kaufmann, 1999, pp. 311-
                                                                  337.
[14] Foster, I., Kesselman, C., and Tuecke, S., "The Anatomy
of the Grid: Enabling Scalable Virtual Organizations", Intl. J.   [23] Novotny, J., Tuecke, S., and Welch, V., "An Online
Supercomputer       Applications,    (to    appear),    2001.     Credential Repository for the Grid: MyProxy", to appear in
http://www.globus.org/research/papers/anatomy.pdf.                HPDC10.

[15] Henderson, R. and Tweten, D., "Portable Batch System:        [24] Papakhian, M., "Comparing Job-Management Systems:
External Reference Specification", 1996.                          The User's Perspective", IEEE Computational Science &
                                                                  Engineering, (April-June) 1998. See also http://pbs.mrj.com.
[16] IBM, "Using and Administering IBM LoadLeveler,
Release 3.0", IBM CorporationSC23-3989, 1996.                     [25] Raman, R., Livny, M., and Solomon, M., "Resource
                                                                  Management through Multilateral Matchmaking", Proceedings
[17] Institute, S.C.R., "DQS 3.1.3 User Guide", Florida State     of the Ninth IEEE Symposium on High Performance Distributed
University, Tallahassee, 1996.                                    Computing (HPDC9), Pittsburgh, Pennsylvania, August 2000,
                                                                  pp. 290-291.
[18] Johnston, W.E., Gannon, D., and Nitzberg, B., "Grids as
Production Computing Environments: The Engineering Aspects        [26] Steiner, J., Neuman, B.C., and Schiller, J., "Kerberos: An
of NASA's Information Power Grid", Proc. 8th IEEE                 Authentication System for Open Network Systems", Proc.
Symposium on High Performance Distributed Computing, IEEE         Usenix Conference, 1988, pp. 191-202.
Press, 1999.
                                                                  [27] Stevens, R., Woodward, P., DeFanti, T., and Catlett, C.,
[19] Korpela, E., Werthimer, D., Anderson, D., Cobb, J., and      "From the I-WAY to the National Technology Grid",
Lebofsky, M., "SETI@home: Massively Distributed Computing         Communications of the ACM, 40(11), 1997, pp. 50-61.
for SETI", Computing in Science and Engineering, 3(1), 2001.
                                                                  [28] Vazhkudai, S., Tuecke, S., and Foster, I., "Replica
[20] Litzkow, M., Livny, M., and Mutka, M., "Condor - A           Selection in the Globus Data Grid", Proc. Of the First
Hunter of Idle Workstations", Proc. 8th Intl Conf. on             IEEE/ACM International Conference on Cluster Computing and
Distributed Computing Systems, 1988, pp. 104-111.                 the Grid (CCGRID 2001), IEEE Computer Society Press, May
                                                                  2001, pp. 106-113.
[21] Litzkow, M., Tannenbaum, T., Basney, J., and Livny., M.,
"Checkpoint and Migration of UNIX Processes in the Condor         [29] Zhou, S., "LSF: Load Sharing in Large-Scale
Distributed Processing System", University of Wisconsin-          Heterogeneous Distributed Systems", Proc. Workshop on
Madison Computer Sciences, Technical Report 1346, 1997.           Cluster Computing, 1992.