Universal worker service
Version 0.1

IVOA WG Internal Draft 2005 January 24

Working Group:: http://www.ivoa.net/twiki/bin/view/IVOA/IvoaGridAndWebServices
Author(s):: Guy Rixon

Abstract

An interface definition is presented for controlling long-running activities ('jobs') via an asynchronous SOAP service. The interface follows the WS-ResourceFramework (draft) standard of OASIS and includes ideas from AstroGrid's Common Execution Architecture. If implemented in full, the interface defines a universal worker service.

Status of this Document

This is an IVOA Working Draft for review by IVOA members and other interested parties. It is a draft document and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use IVOA Working Drafts as reference materials or to cite them as other than "work in progress. A list of current IVOA Recommendations and other technical documents can be found at http://www.ivoa.net/Documents/.

Acknowledgments

This document is a synthesis of ideas put forward in the grid community, in AstroGrid and in the IVOA working-group. The core semantics are copied directly from AstroGrid's CEA, orginally designed by Paul Harrison. The software agents in the eSTAR system were also an influence. The special semantics for resuming jobs after service restart came from a community discussion in AstroGrid's consortium meeting of December 2004.

Abstract
Status
Acknowledgments
Contents
1. Introduction
2. Related technology
3. Semantics
4. WSDL contract
5. Examples of message sequences
6. Interacting with VOStores
References

1. Introduction

Simple web services carry out acytivities synchronously: i.e. the client requests that an activity be undertaken, and the service does not reply until the activity is complete (or has failed, or the request is rejected). In the case of HTTP services, the message exchange in the transport protocol is not completed until the service is ready to report on the activity. This mode of operation is derived from the world-wide web and assumes web-like behaviour: interactions with the service are broken into small units and each interaction completes quickly. Typically, the work has to be broken into units that will complete withing the HTTP timout period of a few minutes.

VO service that do advanced work do not always fit the synchronous model. It is not always sensible to break the work into units that fit the synchronous interface. Three use cases come to mind that involve long-running activities.

A service that does complex queries on a relational database. The activity on the DB cannot be broken up to suit the service interface.
A service that processes collections of image data on a cluster, or on a grid. The processing can be split into small units (the client could submit each image-tile separately to the service), but doing so complicates matters for the client and may defeat any attempt to optimise the job.
A service that queues requests until resources are available.

In each case, the activity may be long compared to:

an HTTP message-timeout;
the up-time of the client (especially if the client is on a lap-top computer);
the up-time of the network;
the up-time of the the service (if the server host restarts and the service restarts its job from a check-point).

Long-running activities are implicitly expensive in computing resources and it is undesirable to lose part-completed jobs when some part of the infrastructure fails temporarily. It is useful for a client to get a cost/duration estimate of the activity before commiting the service to carry out the work.

Agents in the VO need an asychronous, service interface to long-running activities that has the following characteristics.

A client may create a job by a request to a service. The service replies to this request 'immediately' (within seconds) accepting or rejecting the job, stating an indentifier for the job.
The client may request an estimate of the duration of the job.
The client may start the job by a request to the service. The service replies immediately, not waiting for the job to complete.
The client may poll the status of the job, or may request asynchronous notification of progress.
The client may abort the job by contacting the service.
The service check-points its job. If it restarts, the client may request it to carry from the last check-point. The service does not restart from a checkpoint without approval from the client (see discussion below).
The results of a job are stored (in a VOStore) for the client to use later. The service can complete a job and deliver the results without help from the client. Furthermore, the service need not keep the results of a job in local storage pending recovery by the client.

The definition of the job to be carried out may be expressed to the service in two ways.

The SOAP operations that create jobs may take a number of arguments that combine to define the job. The schema of these arguments is specific to a particular service.
The job-creating operations may take a single argument which is an XML document in a job-control language, that language being a standard. The job-control language defines all aspects of the job.

If services use a standard job-control language, then their operations to create jobs can have the same signature: the nature of the job does not affect the siganture. Such services can then have identical WSDL contracts. These service might be viewed as Universal Worker Services. The definition of such a class of services gives many oppurtunities to reuse parts of implementations and decreases the cost of implementing the asynchronous interface.

2. Related technology

Computational grids are based on asynchronous service-interfaces, grid jobs being the extreme case of long-running activities. The service interface described in this document follows the semantics of grid services as defined by the (draft) standard WS-ResourceFramework [WS-RF] from OASIS. It is likely that a service to this IVOA standard could be constructed using a grid toolkit from outside the VO movement. It is possible that a workflow system intended for the grid could be made to call services to this IVOA standard.

The concept of creating a job as an ephemeral resource inside a service, and then returning to that resource to manage the job, is the 'factory pattern' of the WS-ResourceFramework standards. This pattern does not prescribe a standard operation for creating jobs and WS-RF does not specify a job-description language.

AstroGrid has, from the start, attempted to implement asychronous services-interfaces. The Common Execution Architecture [CEA] uses asychnronous service interfaces that allow basic management of jobs. CEA defines a WSDL contract for a universal worker service (although not under that name) and also provides a job-description language. This document proposes a form of interface which is semantically very close to CEA but which difers in syntax. I.e., it would not be possible for a current, CEA client to consume a service written to this document, but very easy to evolve a CEA system to use this new standard.

CEA does not curently conform to WS-RF syntax.

3. Semantics

3.1 Job identifier

A job SHALL be represented as a WS-Resource [WS-RF]. When a job is created, the service SHALL create an indentifier for it according to the rules of WS-RF and shall return that identifier to the client. This means that the resource identifier seen by the client SHALL be a WS-Resource Qualified Endpoint Reference consisting of a WS-Addressing structure stating the service endpoint that contains a ReferenceProperties element identifiying the resource in the context of the service. This is an example of such an identifier

<wsa:From>
  <wsa:Address>http://my.server/my.service</wsa:Address>
  <wsa:ReferenceProperties>myResourceId</wsa:ReferenceProperties>
</wsa:From>

where the prefix wsais bound to the namespace for WS-Addressing. The example above would be used in a message from the service to the client. To make a request in respect of the job, the client should use the identifier

<wsa:To>
  <wsa:Address>http://my.server/my.service</wsa:Address>
  <wsa:ReferenceProperties>myResourceId</wsa:ReferenceProperties>
</wsa:To>

changing the outer element but keeping the contents the same. Both wsa:From and wsa:To are of xsi:type="wsa:EndpointReferenceType". WS-Addressing defines several other elements of xsi:type="wsa:EndpointReferenceType". In IVO SOAP services, the service MUST use wsa:From and the client MUST use wsa:To.

The resource identifier in the ReferenceProperties MUST be unique for all time in the context of the issuing service, such that requests carrying ancient job-identifiers can never be confused with recent jobs. However, the value of this element MAY be duplicated in different services and client MUST NOT depend on the value is globally unique. Given that the Address element is unique to one service, a client MAY assume that the overall structure is globally unique.

When a service accepts a job that service might delegate the job to another service with the same asynchronous interface. E.g., one service may act as a load-balancer for a server farm. In this case, the first service MAY put its own endpoint in the identifier structure or MAY put the endpoint of the service to which the work was delegated. In the latter case, the services are inviting the client to send subsequent requests concerning the job directly to the second service, but the client need not act on this redirection. Both the service making the delegation and the service accepting the delegation MUST handle requests concerning the job in the same way. This generally means that the delegating service must be prepared to forward SOAP messages.

Job identifiers SHALL be carried in SOAP headers. When a client makes a request in respect of an existing job, it SHALL include the job identifier in the SOAP header, using the format described above. When a service replies to such a request, and when it sends asychronous notification concerning a job, then the service SHALL include the job identifier in the SOAP header.

3.2 Job creation

To create a new job in a service, a client SHALL call a SOAP operation that follows the factory pattern of WS-RF. Such an operation SHALL either accept or reject the job immediately, depending on the validity of the request. Validity checks might include a syntax check of the job description or a test of access rights.

If the job is rejected, the service SHALL return a fault that states why the job was rejected.

If the job is accepted, the service shall return a WS-Resource identifier for the job as specified above. WS-RF does not specify exactly how the identifier should be transported to the client. In an IVO SOAP service, the identifier SHALL be included in the SOAP header of the response to the job-creation request.

When a service accepts a job that service SHALL make a record of that job (see the provision for presistence of jobs, above).

Authors of services are encouraged to use the standard createJob operation, with the syntax defined below in this document, to implement the factory pattern. The request message for this operation takes as argument a single XML-document containing a job description. The response message for this operation is empty of arguments if the job is accepted. However, service authors are free to specify other job-creation operations with different syntax, provided that the operations conform to the requirements in this section.

3.3 Job lifetime and destruction

When a WS-Resource representing is destroyed, the service loses all memory of it. Hence, jobs are not immediately destroyed after the work of the job is completed. Instead, a service retains the resource after the job is finished. A resource is destroyed when its set lifetime expires, when the client invokes the Destroy operation (specified by WS-ResourceLifetime) on the resource, or when its volatile service (see below) restarts.

If the WS-Resource for a job is destroyed while that job is in progress, the service SHOULD abort the job. The standard Destroy operation on the resource is the abort command for the job. If the service cannot abort the job (e.g. the work remains active on some underlying engine such as a RDBMS) then the service MAY keep the resource alive (i.e. the client can still refer to it) or MAY behave as it the job has actually aborted (i.e. the service returns a fault for requests concerning the job).

Services may be volatile or non-volatile. Non-volatile services remember job details across service restarts and can resume jobs; volatile services forget all job detals when they restart. IVO services SHOULD be non-volatile, but volatile implementations are allowed for simplicity.

The default lifetime of the resource is set by the author of the service. The lifetime SHALL be counted from the point when the job is accepted. The service SHALL support the SetTerminationTime operation (defined by WS-ResourceLifetime) to change this time, but MAY NOT accept all requests to change the lifetime; e.g. the service MAY reject requests to set very-long or infinite lifetimes.

The lifetime SHOULD be as long as possible: days rather than hours or minutes. The underlying assumption is that the service can free any expensive resources (e.g. send bulk-data results to a VOStore) when the work of the job ends and retain in the resource only a fragment of metadata describing the job's outcome.

3.4 Reporting the state of a job

The service SHALL provide metadata describing the state of a job via the mechanism of WS-ResourceProperties (WS-RP). This means that the service SHALL support the operations GetResourceProperty and GetMultipleResourceProperties (defined in WS-RP) that return metadata describing the WS-Resource for the job. The service SHOULD reject the SetResourceProperty operation.

The resource-properties document for a service defines the metadata returned by the above operations and is part of the service's WSDL contract. IVOA asynchronous services SHALL use a standard resource-properties document that describes a job.

The standard resource-properties for the job resource include the following:

State of the job: PENDING, QUEUED, EXECUTING, RESTARTED, ENDED. PENDING means that the job has been accepted but not yet started by the client. QUEUED means that the client has asked the service to start the job but the service has not yet begun the work. ENDED covers both completed and failed jobs. RESTARTED means that the service has restarted, has recovered the job state from a checkpoint and is waiting for permission to resume the job (see below).
Error report for the job (empty if the job has no errors).
Estimate of the total time needed to execute the job. This includes the time already spent. The service makes an initial estimate when the job is accepted and may adjust the estimate during the job.
Estimate of the cost (e.g. CPU, network bandwidth, storage or monetary cost) to complete the job. The service makes an initial estimate when the job is accepted and may adjust the estimate during the job.
Estimate of fractional completion.

The service SHALL include all these properties in reports, but may put null values (XML attribute xsi:nil="true" on an empty element) for the last three. The service MUST put proper value for the state and error report. See the WSDL contract, below for details.

A service MAY support asynchronous notification of changes in state of a job. This means that state reports are sent to the client by the service without being invoked in a polling operation. If a service does not support asynchronous notification, then the client MUST be prepared to poll to determine the job state.

If the service supports asynchronous notification, then it MUST do so according to the OASIS (draft) standards WS-BaseNotification, WS-Topics and WS-BrokeredNotification. These standards work with the metadata of the WS-ResourceProperties standard.

3.5 Estimating the cost of a job

The service MAY support estimates of the duration and/or computing costs of a job. There is no special operation for obtaining these estimates. Instead, the client MAY read the estimates from the job state, as described above, at any time between job creation and job destruction.

3.6 resuming a job

If a service restarts during a job, the service MUST NOT resume the job without instructions from the client. While the service was down, the client might have reassigned the job to another service. In this case, it would be disasterous if the orginal service spontaneously started writing results to the same output location. Instead, a restarting service SHALL put all its services into the RESTARTED state (see the section on state reporting, above).

A client may call the resumeJob operation on the service in respect of a job. The client shall pass the resource identifier of the job in the SOAP header. If this operation is called for a job in the RESTARTED state, the service shall resume execution of the job from the last check-point. If the operation is called for a job in another state, then the service shall return a fault.

4.WSDL contract

The WSDL contract will be added in a later draft of this document.

5.Examples of message patterns

5.1 Job runs to completion: polling

5.2 Job runs to completion: asynchronous notification

Sequence diagram: job with asynchronous notification

Notes:

The Subscription object uses the same WsResourceId as the Service object.

5.3 Resuming a job

Sequence digram: job resumed after service restarts

Notes:

Resuming a job implies restarting the notification subscriptions of that job if the service supports notifications.
Resuming a job is still possible if the service doesn't support notifications. The service simply waits for the client to notice the restart by polling.

5.4 Checking the cost of a job

The client creates the same job on two services with the same basic capability. The client gets the job state metadata from both services and checks the two estimates of the job duration. The second service estimates a shorter run-time, so the client starts that job and destroys the job on the other service.

Sequence diagram: checking cost of a job

6. Interaction with VOStores

A service with an synchronous interface typically returns the results of its job to the client as part of the message reporting the final status of the job. An asychronous service cannot do this. Instead, the service can put the results in a VOStore, from which the client can later recover them. In general, bulk data should not be sent as part of a SOAP message, since this causes problems with some SOAP engines. It is better that all bulk data be read from or written to VOStores using a file-transfer protocol.

A service conforming to this document SHALL support reading of data-sets from a VOStore and and writing data-sets to a VOStore. The instructions to do so SHALL be part of the job-control language and their format is not specified here.

References

[WS-RF]: Czajkowski, K., Feguson, D. F., Foster, I., Frey, J., Graham, S., Sedukhin, I., Snelling, D., Tuecke, S., Vambenepe, W. The WS-Resource Framework IBM DeveloperWorks library
CEA: Harrison, P. A proposal for a common execution architecture IVOA note

Universal worker serviceVersion 0.1

IVOA WG Internal Draft 2005 January 24

Universal worker service
Version 0.1