Tags: argonne il, argonne national laboratory, computational tasks, computer science division, department of computer science, domain resources, dramatic increase, fault tolerance, grids, james frey, madison wi, management agent, mcs, miron livny, personal domain, resource selection, science mathematics, storage resources, todd tannenbaum, wisc,
Condor-G: A Computation Management Agent for Multi-Institutional Grids
James Frey, Todd Tannenbaum, Miron Livny Ian Foster, Steven Tuecke
Department of Computer Science Mathematics and Computer Science Division
University of Wisconsin Argonne National Laboratory
Madison, WI 53706 Argonne, IL 60439
{ jfrey, tannenba, miron }@cs.wisc.edu { foster, tuecke }@mcs.anl.gov
Abstract dynamically, in the course of their everyday
In recent years, there has been a dramatic increase in the activities.
amount of available computing and storage resources. · They do not want to be bothered with the location
Yet few have been able to exploit these resources in an of these resources, the mechanisms that are
aggregated form. We present the Condor-G system, which required to use them, with keeping track of the
leverages software from Globus and Condor to allow status of computational tasks operating on these
users to harness multi-domain resources as if they all resources, or with reacting to failure.
belong to one personal domain. We describe the structure · They do care about how long their tasks are likely
of Condor-G and how it handles job management, to run and how much these tasks will cost.
resource selection, security, and fault tolerance. In this article, we present an innovative distributed
computing framework that addresses these three issues.
The Condor-G system leverages the significant advances
that have been achieved in recent years in two distinct
1. Introduction
areas: (1) security, resource discovery, and resource
access in multi-domain environments, as supported within
In recent years the scientific community has
the Globus Toolkit [12], and (2) management of
experienced a dramatic pluralization of computing and
computation and harnessing of resources within a single
storage resources. The national high-end computing
administrative domain, specifically within the Condor
centers have been joined by an ever-increasing number of
system [20, 22]. In brief, we combine the inter-domain
powerful regional and local computing environments. The
resource management protocols of the Globus Toolkit and
aggregated capacity of these new computing resources is
the intra-domain resource management methods of
enormous. Yet, to date, few scientists and engineers have
Condor to allow the user to harness multi-domain
managed to exploit the aggregate power of this seemingly
resources as if they all belong to one personal domain.
infinite Grid of resources. While in principle most users
The user defines the tasks to be executed; Condor-G
could access resources at multiple locations, in practice
handles all aspects of discovering and acquiring
few reach beyond their home institution, whose resources
appropriate resources, regardless of their location;
are often far from sufficient for increasingly demanding
initiating, monitoring, and managing execution on those
computational tasks such as simulation, large scale
resources; detecting and responding to failure; and
optimization, Monte Carlo computing, image processing,
notifying the user of termination. The result is a powerful
and rendering. The problem is the significant "potential
tool for managing a variety of parallel computations in
barrier" associated with the diverse mechanisms, policies,
Grid environments.
failure modes, performance uncertainties, etc., that
Condor-G's utility has been demonstrated via record-
inevitably arise when we cross the boundaries of
setting computations. For example, in one recent
administrative domains.
computation a Condor-G agent managed a mix of desktop
Overcoming this potential barrier requires new
workstations, commodity clusters, and supercomputer
methods and mechanisms that meet the following three
processors at ten sites to solve a previously open problem
key user requirements for computing in a "Grid" that
in numerical optimization. In this computation, over
comprises resources at multiple locations:
95,000 CPU hours were delivered over a period of less
· They want to be able to discover, acquire, and
than seven days, with an average of 653 processors being
reliably manage computational resources
active at any one time. In another case, resources at three
sites were used to simulate and reconstruct 50,000 high- protocols for resource discovery and management.
energy physics events, consuming 1200 CPU hours in less These protocols support secure discovery of remote
than a day and a half. resource configuration and state, and secure
In the rest of this article, we describe the specific allocation of remote computational resources and
problem we seek to solve with Condor-G, the Condor-G management of computation on those resources.
architecture, and the results obtained to date. We use the protocols defined by the Globus Toolkit
[12], a de facto standard for Grid computing.
2. Large-scale sharing of computational · Computation management issues are addressed via
resources the introduction of a robust, multi-functional user
computation management agent responsible for
We consider a Grid environment in which an resource discovery, job submission, job
individual user may, in principle, have access to management, and error recovery. This Condor-G
computational resources at many sites. Answering why the component is taken from the Condor system [20].
user has access to these resources is not our concern. It · Remote execution environment issues are addressed
may be because the user is a member of some scientific via the use of mobile sandboxing technology that
collaboration, or because the resources in question belong allows a user to create a tailored execution
to a colleague, or because the user has entered into some environment on a remote node. This Condor-G
contractual relationship with a resource provider [14]. The component is also taken from the Condor system.
point is that the user is authorized to use resources at those This separation of concerns between remote resource
sites to perform a computation. The question that we access and computation management has some significant
address is how to build and manage a multi-site benefits. First, it is significantly less demanding to require
computation that uses those resources. that a remote resource speak some simple protocols rather
Performing a computation on resources that belong to than to require it to support a more complex distributed
different sites can be difficult in practice for the following computing environment. This is particularly important
reasons: given that the deployment of production Grids [4, 18, 27]
· Different sites may feature different authentication has made it increasingly common that remote resources
and authorization mechanisms, schedulers, speak these protocols. Second, as we explain below,
hardware architectures, operating systems, file careful design of remote access protocols can significantly
systems, etc. simplify computation management.
· The user has little knowledge of the characteristics
of resources at remote sites, and no easy means of 3. Grid protocol overview
obtaining this information.
· Due to the distributed nature of the multi-site In this section, we briefly review the Grid protocols
computing environment, computers, networks, and that we exploit in the Condor-G system: GRAM, GASS,
subcomputations can fail in various ways. MDS-2, and GSI. The Globus Toolkit provides open
· Keeping track of the status of different elements of source implementations of each.
a computation involves tedious bookkeeping,
especially in the event of failure and dependencies 3.1. Grid security infrastructure
among subcomputations.
Furthermore, the user is typically not in a position to The Globus Toolkit's Grid Security Infrastructure
require uniform software systems on the remote sites. For (GSI) [13] provides essential building blocks for other
example, if all sites to which a user had access ran DCE Grid protocols and for Condor-G. This authentication and
and DFS, with appropriate cross-realm Kerberos authorization system makes it possible to authenticate a
authentication arrangements, the task of creating a multi- user just once, using public key infrastructure (PKI)
site computation would be significantly easier. But it is mechanisms to verify a user-supplied "Grid credential."
not practical in the general case to assume such GSI then handles the mapping of the Grid credential to the
uniformity. diverse local credentials and authentication/authorization
The Condor-G system addresses these issues via a mechanisms that apply at each site. Hence, users need not
separation of concerns between the three problems of re-authenticate themselves each time they (or a program
remote resource access, computation management, and acting on their behalf, such as a Condor-G computation
remote execution environments: management service) access a new remote resource.
· Remote resource access issues are addressed by GSI's PKI mechanisms require access to a private key
requiring that remote resources speak standard that they use to sign requests. While in principle a user's
private key could be cached for use by user programs, this how much standard output and error data has been
approach exposes this critical resource to considerable received, thus permitting a client to request resending of
risk. Instead, GSI employs the user's private key to create this data after a crash of client or server.
a proxy credential, which serves as a new private-public
key pair that allows a proxy (such as the Condor-G agent) 3.3. MDS protocols and implementation
to make remote requests on behalf of the user. This proxy
credential is analogous in many respects to a Kerberos The Globus Toolkit's MDS-2 provides basic
ticket [26] or Andrew File System token. mechanisms for discovering and disseminating
information about the structure and state of Grid resources
3.2. GRAM protocol and implementation [9]. The basic ideas are simple. A resource uses the Grid
Resource Registration Protocol (GRRP) to notify other
The Grid Resource Allocation and Management entities that it is part of the Grid. Those entities can then
(GRAM) protocol [10] supports remote submission of a use the Grid Resource Information Protocol (GRIP) to
computational request ("run program P") to a remote obtain information about resource status. These two
computational resource, and subsequent monitoring and protocols allow us to construct a range of interesting
control of the resulting computation. Three aspects of the structures, including various types of directories that
protocol are particularly important for our purposes: support discovery of interesting resources. GSI
security, two-phase commit, and fault tolerance. The latter authentication is used as a basis for access control.
two mechanisms were developed in collaboration with the
UW team and are not yet part of the GRAM version 3.4. GASS
included in the Globus Toolkit. They will be in the
GRAM-2 protocol revision scheduled for later in 2001. The Globus Toolkit's Global Access to Secondary
GSI security mechanisms are used in all operations to Storage (GASS) service [7] provides mechanisms for
authenticate the requestor and for authorization. transferring data between a remote HTTP, FTP, or GASS
Authentication is performed using the supplied proxy server. In the current context, we use these mechanisms to
credential, hence providing for single sign-on. stage executables and input files to a remote computer. As
Authorization implements local policy and may involve usual, GSI mechanisms are used for authentication.
mapping the user's "Grid id" into a local subject name;
however, this mapping is transparent to the user. Work in 4. Computation management: the Condor-G
progress will also allow authorization decisions to be
made on the basis of capabilities supplied with the
agent
request.
Two-phase commit is important as a means of Next, we describe the Condor-G computation
achieving "exactly once" execution semantics. Each management service (or Condor-G agent).
request from a client is accompanied by a unique
sequence number, which is also included in the associated 4.1. User interface
response. If no response is received after a certain amount
of time, the client can repeat the request. The repeated The Condor-G agent allows the user to treat the Grid as
sequence number allows the resource to distinguish an entirely local resource, with an API and command line
between a lost request and a lost response. Once the client tools that allow the user to perform the following job
has received a response, it then sends a "commit" message management operations:
to signal that job execution can commence. · submit jobs, indicating an executable name,
Resource-side fault tolerance support addresses the fact input/output files and arguments;
that a single "resource" may often contain multiple · query a job's status, or cancel the job;
processors (e.g., a cluster or Condor pool) with · be informed of job termination or problems, via
specialized "interface" machines running the GRAM callbacks or asynchronous mechanisms such as
server(s) that maintain the mapping from submitting client email;
to local process. Consequently, failure of an interface · obtain access to detailed logs, providing a complete
machine may result in the remote client losing contact history of their jobs' execution.
with what is otherwise a correctly queued or executing There is nothing new or special about the semantics of
job. Hence, our GRAM implementation logs details of all these capabilities, as one of the main objectives of
active jobs to stable storage at the client side, allowing Condor-G is to preserve the look and feel of a local
this information to be retrieved if a GRAM server crashes resource manager. The innovation in Condor-G is that
and is restarted. This information can include details of these capabilities are provided by a personal desktop
Job Submission Machine Job Execution Site
Globus
GateKeeper
Condor-G
End User
Fo
rk
Scheduler
rk
Fo
Requests
Persistant Globus Globus
Job Queue JobManager JobManager
Submit
Submit
Fork
Site Job Scheduler
(PBS, Condor, LSF, LoadLeveler, NQE, etc.)
Condor-G
GridManager
GASS
Server
Job X Job Y
Figure 1. Remote execution by Condor-G on Globus-managed resources
agent and supported in a Grid environment, while and GRAM callbacks and status calls, while
guaranteeing fault tolerance and exactly-once execution · authenticating all requests via GSI mechanisms.
semantics. By providing the user with a familiar and The Condor-G agent also handles resubmission of
reliable single access point to all the resources he/she is failed jobs, communications with the user concerning
authorized to use, Condor-G empowers end-users to unusual and erroneous conditions (e.g., credential expiry,
improve the productivity of their computations by discussed below), and the recording of computation on
providing a unified view of dispersed resources. stable storage to support restart in the event of its failure.
We have structured the Condor-G agent
4.2. Supporting remote execution implementation as depicted in Figure 1. The Scheduler
responds to a user request to submit jobs destined to run
Behind the scenes, the Condor-G agent executes user on Grid resources by creating a new GridManager
computations on remote resources on the user's behalf. It daemon to submit and manage those jobs. One
does this by using the Grid protocols described above to GridManager process handles all jobs for a single user
interact with machines on the Grid and mechanisms and terminates once all jobs are complete. Each
provided by Condor to maintain a persistent view of the GridManager job submission request (via the modified
state of the computation. In particular, it: two-phase commit GRAM protocol) results in the creation
· stages a job' standard I/O and executable using
s of one Globus JobManager daemon. This daemon
GASS, connects to the GridManager using GASS in order to
· submits a job to a remote machine using the revised transfer the job's executable and standard input files, and
GRAM job request protocol, and subsequently to provide real-time streaming of standard
· subsequently monitors job status and recovers from output and error. Next, the JobManager submits the jobs
remote failures using the revised GRAM protocol to the execution site's local scheduling system. Updates
on job status are sent by the JobManager back to the credential expiration. The Condor-G agent addresses this
GridManager, which then updates the Scheduler, where requirement by periodically analyzing the credentials for
the job status is stored persistently as we describe below. all users with currently queued jobs. (GSI provides query
When the job is started, a process environment variable functions that support this analysis.) If a user's credentials
points to a file containing the address/port (URL) of the have expired or are about to expire, the agent places the
listening GASS server in the GridManager process. If the job in a hold state in its queue and sends the user an e-
address of the GASS server should change, perhaps mail message explaining that their job cannot run again
because the submission machine was restarted, the until their credentials are refreshed by using a simple tool.
GridManager requests the JobManager to update the file Condor-G also allows credential alarms to be set. For
with the new address. This allows the job to continue file instance, it can be configured to e-mail a reminder when
I/O after a crash recovery. less than a specified time remains before a credential
Condor-G is built to tolerate four types of failure: crash expires.
of the Globus JobManager, crash of the machine that Credentials may have been forwarded to a remote
manages the remote resource (the machine that hosts the location, in which case the remote credentials need to be
GateKeeper and JobManager), crash of the machine on refreshed as well. At the start of a job, the Condor-G agent
which the GridManager is executing (or crash of the the forwards the user' proxy certificate from the submission
s
GridManager alone), and failures in the network machine to the remote GRAM server. When an expired
connecting the two machines. proxy is refreshed, Condor-G not only needs to refresh the
The GridManager detects remote failures by certificate on the local (submit) side of the connection, but
periodically probing the JobManagers of all the jobs it it also needs to re-forward the refreshed proxy to the
manages. If a JobManager fails to respond, the remote GRAM server.
GridManager then probes the GateKeeper for that To reduce user hassle in dealing with expired
machine. If the GateKeeper responds, then the credentials, Condor-G could be enhanced to work with a
GridManager knows that the individual JobManager system like MyProxy [23]. MyProxy lets a user store a
crashed. Otherwise, either the whole resource long-lived proxy credential (e.g. a week) on a secure
management machine crashed or there is a network failure server. Remote services acting on behalf of the user can
(the GridManager cannot distinguish these two cases). If then obtain short-lived proxies (e.g. 12 hours) from the
only the JobManager crashed, the GridManager attempts server. Condor-G could use these short-lived proxies to
to start a new JobManager to resume watching the job. authenticate with and forward to remote resources and
Otherwise, the GridManager waits until it can reestablish refresh them automatically from the MyProxy server when
contact with the remote machine. When it does, it attempts they expire. This limits the exposure of the long-lived
to reconnect to the JobManager. This can fail for two proxy (only the MyProxy server and Condor-G have
reasons: the JobManager crashed (because the whole access to it).
machine crashed), or the JobManager exited normally
(because the job completed during a network failure). In 4.4. Resource discovery and scheduling
either case, the GridManager starts a new JobManager,
which will resume watching the job or tell the We have not yet addressed the critical question of how
GridManager that the job has completed. the Condor-G agent determines where to execute user
To protect against local failure, all relevant state for jobs. A number of strategies are possible.
each submitted job is stored persistently in the scheduler's A simple approach, which we used in the initial
job queue. This persistent information allows the Condor-G implementation, is to employ a user-supplied
GridManager to recover from a local crash. When list of GRAM servers. This approach is a good starting
restarted, the GridManager reads the information and point for further development.
reconnects to any of the JobManagers that were running at A more sophisticated approach is to construct a
the time of the crash. If a JobManager fails to respond, the personal resource broker that runs as part of the Condor-
GridManager starts a new JobManager to watch that job. G agent and combines information about user
authorization, application requirements and resource
4.3. Credential management status (obtained from MDS) to build a list of candidate
resources. These resources will be queried to determine
A GSI proxy credential used by the Condor-G agent to their current status, and jobs will be submitted to
authenticate with remote resources on the user's behalf is appropriate resources depending on the results of these
given a finite lifetime so as to limit the negative queries. Available resources can be ranked by user
consequences of its capture by an adversary. A long-lived preferences such as allocation cost and expected start or
Condor-G computation must be able to deal with completion time. One promising approach to constructing
Job Submission Machine Job Execution Site
Persistant
Job Queue
Globus Daemons
End User +
Requests
Local Site Scheduler
Condor-G
GridManager
Condor-G [See Figure 1]
Scheduler Fork GASS
Server
Fork
Condor-G Resource
Job
Collector Information
Condor
Daemons
Condor Transfer Job X
Shadow
Fork
Process for
Job X
Redirected Job X
System Call Condor System Call
Data
Trapping & Checkpoint
Library
Figure 2. Remote job execution via GlideIn
such a resource broker is to use the Condor Matchmaking management tool for Grid computations. However, we
framework [25] to implement the brokering algorithm. still have not addressed issues relating to what happens
Such an approach is described by Vazhkudai et al. [28]. when a job executes on a remote platform where required
They gather information from MDS servers about Grid files are not available and local policy may not permit
storage resources, format that information and user access to local file systems. Local policy may also impose
storage requests into ClassAds, and then use the restrictions on the running time of the job, which may
Matchmaker to make brokering decisions. A similar prove inadequate for the job to complete. These additional
approach could be taken for computational resources for system and site policy heterogeneities can represent
use with Condor-G. substantial barriers.
In the case of high throughput computations, a simple We address these concerns via what we call mobile
but effective technique is to "flood" candidate resources sandboxing. In brief, we use the mechanisms described
with requests to execute jobs. These can be the actual jobs above to start on a remote computer not a user job, but a
submitted by the user or Condor "GlideIns" as discussed daemon process that performs the following functions:
below. Monitoring of actual queuing and execution times · It uses standard Condor mechanisms to advertise its
allows for the tuning of where to submit subsequent jobs availability to a Condor Collector process, which is
and to migrate queued jobs. queried by the Scheduler to learn about available
resources. Condor-G uses standard Condor
5. GlideIn mechanism mechanisms to match locally queued jobs with the
resources advertised by these daemons and to
The techniques described above allow a user to remotely execute them on these resources [25].
construct, submit, and monitor the execution of a task · It runs each user task received in a "sandbox,"
graph, with failures and credential expirations handled using system call trapping technologies provided
seamlessly and appropriately. The result is a powerful by the Condor system [20] to redirect system calls
issued by the task back to the originating system. In sophisticated branch and bound algorithm. This
the process, this both increases portability and computation used an average of 653 CPUs during that
protects the local system. week, with a maximum of 1007 in use at any one time.
· It periodically checkpoints the job to another Each worker in this Master-Worker application was
location (e.g., the originating location or a local implemented as an independent Condor job that used
checkpoint server) and migrates the job to another Remote I/O services to communicate with the Master.
location if requested to do so (for example, when a A group at Caltech that is part of the CMS Energy
resource is required for another purpose or the Physics collaboration has been using Condor-G to
remote allocation expires) [21]. perform large-scale distributed simulation and
These various functions are precisely those provided reconstruction of high-energy physics events. A two-node
by the daemon process that is run on any computer Directed Acyclic Graph (DAG) of jobs submitted to a
participating in a Condor pool. The difference is that in Condor-G agent at Caltech triggers 100 simulation jobs on
Condor-G, these daemon processes are started not by the the Condor pool at the University of Wisconsin. Each of
user, but by using the GRAM remote job submission these jobs generates 500 events. The execution of these
protocol. In effect, the Condor-G GlideIn mechanism uses jobs is also controlled by a DAG that makes sure that
Grid protocols to dynamically create a personal Condor local disk buffers do not overflow and that all events
pool out of Grid resources by "gliding-in" Condor produced are transferred via GridFTP to a data repository
daemons to the remote resource. Daemons shut down at NCSA. Once all simulation jobs terminate and all data
gracefully when their local allocation expires or when they is shipped to the repository, the Condor-G agent at
do not receive any jobs to execute after a (configurable) Caltech submits a subsequent reconstruction job to the
amount of time, thus guarding against runaway daemons. PBS system that manages the reconstruction cluster at
Our implementation of this "GlideIn" capability submits NCSA.
an initial GlideIn executable (a portable shell script), Condor-G has also been used in the GridGaussian
which in turn uses GSI-authenticated GridFTP to retrieve project at NCSA to prototype a portal for running
the Condor executables from a central repository, hence Gaussian98 jobs on Grid resources. This Portal uses
avoiding a need for individual users to store binaries for GlideIns to optimize access to remote resources and
all potential architectures on their local machines. employs a shared Mass Storage System (MSS) to store
Another advantage of using GlideIns is that they allow input and output data. Users of the portal have two
the Condor-G agent to delay the binding of an application requirements for managing the output of their Gaussian
to a resource until the instant when the remote resource jobs. First, the output should be reliably stored at MSS
manager decides to allocate the resource(s) to the user. By when the job completes. Second, the users should be able
doing so, the agent minimizes queuing delays by to view the output as it is produced. These requirements
preventing a job from waiting at one remote resource are addressed by a utility program called G-Cat that
while another resource capable of serving the job is monitors the output file and sends updates to MSS as
available. By submitting GlideIns to all remote resources partial file chunks. G-Cat hides network performance
capable of serving a job, Condor-G can guarantee optimal variations from Gaussian by using local scratch storage as
queuing times to its users. One can view the GlideIn as an a buffer for Gaussian's output, rather than sending the
empty shell script submitted to a queuing system that can output directly over the network. Users can view the
be populated once it is allocated the requested resources. output as it is received at MSS using a standard FTP client
or by running a script that retrieves the file chunks from
6. Experiences MSS and assembles them for viewing.
Three very different examples illustrate the range and 7. Related work
scale of application that we have already encountered for
Condor-G technology. The management of batch jobs within a single
An early version of Condor-G was used by a team of distributed system or domain has been addressed by many
four mathematicians from Argonne National Laboratory, research and commercial systems, notably Condor [20],
Northwestern University, and University of Iowa to DQS [17], LSF [29], LoadLeveler [16], and PBS [15].
harness the power of over 2,500 CPUs at 10 sites (eight Some of these systems were extended with restrictive and
Condor pools, one Cluster managed by PBS, and one ad hoc capabilities for routing jobs submitted in one
supercomputer managed by LSF) to solve a very large domain to a queue in a different domain. In all cases, both
optimization problem [3]. In less than a week the team domains must run the same resource management
logged over 95,000 CPU hours to solve more than 540 software. With the exception of Condor, they all use a
billion Linear Assignment Problems controlled by a resource allocation framework that is based on a system-
wide collection of queueseach representing a different 9. References
class of service.
Condor flocking [11] supports multi-domain [1] Abramson, D., Giddy, J., and Kotler, L., "High
computation management by using multiple Condor flocks Performance Parametric Modeling with Nimrod/G: Killer
to exchange load. The major difference between Condor Application for the Global Grid?", IPDPS'2000, IEEE Press,
flocking and Condor-G is that Condor-G allows inter- 2000.
domain operation on remote resources that require
[2] Abramson, D., Sosic, R., Giddy, J., and Hall, B., "Nimrod:
authentication, and uses standard protocols that provide
A Tool for Performing Parameterized Simulations Using
access to resources controlled by other resource Distributed Workstations", Proc. 4th IEEE Symp. on High
management systems, rather than the special-purpose Performance Distributed Computing, 1995.
sharing mechanisms of Condor.
Recently, various research and commercial groups [3] Anstreicher, K., Brixius, N., Goux, J.-P., and Linderoth, J.,
have developed software tools that support the harnessing "Solving Large Quadratic Assignment Problems on
of idle computers for specific computations, via the use of Computational Grids", Mathematical Programming, 2000.
simple remote execution agents (workers) that, once
[4] Beiriger, J., Johnson, W., Bivens, H., Humphreys, S., and
installed on a computer, can download problems (or, in
Rhea, R., "Constructing the ASCI Grid", Proc. 9th IEEE
some cases, Java applications) from a central location and Symposium on High Performance Distributed Computing, IEEE
run them when local resources are available (i.e. Press, 2000.
SETI@home [19], Entropia, and Parabon). These tools
assume a homogeneous environment where all resource [5] Berman, F., "High-Performance Schedulers", Foster, I. and
management services are provided by their own system. Kesselman, C. eds., The Grid: Blueprint for a New Computing
Furthermore, a single master (i.e., a single submission Infrastructure, Morgan Kaufmann, 1999, pp. 279-309.
point) controls the distribution of work amongst all
[6] Berman, F., Wolski, R., Figueira, S., Schopf, J., and Shao,
available worker agents. Application-level scheduling
G., "Application-Level Scheduling on Distributed
techniques [5, 6] provide "personalized" policies for Heterogeneous Networks", Proc. Supercomputing '96, 1996.
acquiring and managing collections of heterogeneous
resources. These systems employ resource management [7] Bester, J., Foster, I., Kesselman, C., Tedesco, J., and
services provided by batch systems to make the resources Tuecke, S., "GASS: A Data Movement and Access Service for
available to the application and to place elements of the Wide Area Computing Systems", Sixth Workshop on I/O in
application on these resources. An application-level Parallel and Distributed Systems, May 5, 1999.
scheduler for high-throughput scheduling that takes data
[8] Casanova, H., Obertelli, G., Berman, F., and Wolski, R.,
locality information into account in interesting ways has "The AppLeS Parameter Sweep Template: User-Level
been constructed [8]. Condor-G mechanisms complement Middleware for the Grid", Proc. SC'2000, 2000.
this work by addressing issues of uniform remote access,
failure, credential expiry, etc. Condor-G could potentially [9] Czajkowski, K., Fitzgerald, S., Foster, I., and Kesselman,
be used as a backend for an application-level scheduling C., "Grid Information Services for Distributed Resource
system. Sharing", 2001.
Nimrod [2] provides a user interface for describing
[10] Czajkowski, K., Foster, I., Karonis, N., Kesselman, C.,
"parameter sweep" problems, with the resulting
Martin, S., Smith, W., Tuecke, S., "A Resource Management
independent jobs being submitted to a resource Architecture for Metacomputing Systems", Proc. IPPS/SPDP
management system; Nimrod-G [1] generalizes Nimrod to '98 Workshop on Job Scheduling Strategies for Parallel
use Globus mechanisms to support access to remote Processing, 1998.
resources. Condor-G addresses issues of failure, credential
expiry, and interjob dependencies that are not addressed [11] Epema, D.H.J., Livny, M., Dantzig, R.v., Evers, X., and
by Nimrod or Nimrod-G. Pruyne, J., "A Worldwide Flock of Condors: Load Sharing
among Workstation Clusters", Future Generation Computer
Systems, 12, 1996.
8. Acknowledgment
[12] Foster, I. and Kesselman, C., "Globus: A Toolkit-Based
This research was supported by the NASA Information Grid Architecture", Foster, I. and Kesselman, C. eds., The Grid:
Power Grid program. Blueprint for a New Computing Infrastructure, Morgan
Kaufmann, 1999, pp. 259-278.
[13] Foster, I., Kesselman, C., Tsudik, G., and Tuecke, S., "A [22] Livny, M., "High-Throughput Resource Management",
Security Architecture for Computational Grids", ACM Foster, I. and Kesselman, C. eds., The Grid: Blueprint for a New
Conference on Computers and Security, 1998, pp. 83-91. Computing Infrastructure, Morgan Kaufmann, 1999, pp. 311-
337.
[14] Foster, I., Kesselman, C., and Tuecke, S., "The Anatomy
of the Grid: Enabling Scalable Virtual Organizations", Intl. J. [23] Novotny, J., Tuecke, S., and Welch, V., "An Online
Supercomputer Applications, (to appear), 2001. Credential Repository for the Grid: MyProxy", to appear in
http://www.globus.org/research/papers/anatomy.pdf. HPDC10.
[15] Henderson, R. and Tweten, D., "Portable Batch System: [24] Papakhian, M., "Comparing Job-Management Systems:
External Reference Specification", 1996. The User's Perspective", IEEE Computational Science &
Engineering, (April-June) 1998. See also http://pbs.mrj.com.
[16] IBM, "Using and Administering IBM LoadLeveler,
Release 3.0", IBM CorporationSC23-3989, 1996. [25] Raman, R., Livny, M., and Solomon, M., "Resource
Management through Multilateral Matchmaking", Proceedings
[17] Institute, S.C.R., "DQS 3.1.3 User Guide", Florida State of the Ninth IEEE Symposium on High Performance Distributed
University, Tallahassee, 1996. Computing (HPDC9), Pittsburgh, Pennsylvania, August 2000,
pp. 290-291.
[18] Johnston, W.E., Gannon, D., and Nitzberg, B., "Grids as
Production Computing Environments: The Engineering Aspects [26] Steiner, J., Neuman, B.C., and Schiller, J., "Kerberos: An
of NASA's Information Power Grid", Proc. 8th IEEE Authentication System for Open Network Systems", Proc.
Symposium on High Performance Distributed Computing, IEEE Usenix Conference, 1988, pp. 191-202.
Press, 1999.
[27] Stevens, R., Woodward, P., DeFanti, T., and Catlett, C.,
[19] Korpela, E., Werthimer, D., Anderson, D., Cobb, J., and "From the I-WAY to the National Technology Grid",
Lebofsky, M., "SETI@home: Massively Distributed Computing Communications of the ACM, 40(11), 1997, pp. 50-61.
for SETI", Computing in Science and Engineering, 3(1), 2001.
[28] Vazhkudai, S., Tuecke, S., and Foster, I., "Replica
[20] Litzkow, M., Livny, M., and Mutka, M., "Condor - A Selection in the Globus Data Grid", Proc. Of the First
Hunter of Idle Workstations", Proc. 8th Intl Conf. on IEEE/ACM International Conference on Cluster Computing and
Distributed Computing Systems, 1988, pp. 104-111. the Grid (CCGRID 2001), IEEE Computer Society Press, May
2001, pp. 106-113.
[21] Litzkow, M., Tannenbaum, T., Basney, J., and Livny., M.,
"Checkpoint and Migration of UNIX Processes in the Condor [29] Zhou, S., "LSF: Load Sharing in Large-Scale
Distributed Processing System", University of Wisconsin- Heterogeneous Distributed Systems", Proc. Workshop on
Madison Computer Sciences, Technical Report 1346, 1997. Cluster Computing, 1992.