Work around abusive users

Beginning

This work has been proposed by Mark Holliman et we had a first discussion during the Heidelberg IVOA interop in May 2013 (GWS Session 2). The aim is to define a Best practice against abusive uses of VO services.

Inputs from Mark Holliman

Application Considerations to prevent Abusive User Behaviour

This page is meant to be a sounding board and general discussion area for the issue of preventing users from knowingly or unknowingly abusing services through VO enabled applications. The topic is of interest both to application developers and data providers. The goal is to make VO enabled applications less likely to enable problematic behaviour by default and to develop possible solutions for data providers that can alleviate or prevent service disruptions.

Common examples of "abusive behaviour" include Denial of Service attacks (where a user overloads a service with requests, effectively crashing it), and ...

Incident Examples and Proposed Solutions

  • Topcat Multicone DoS on WFAU services
    • Description: In January and March 2013 there were incidents where users unknowingly crashed the cone search services for WFAU's archives. In both cases the users were running multiple instances of Topcat, performing multi-cone queries with the multi-threaded setting activated. In the worst case the user was effectively running 600 concurrent connections to the cone-search services continously for 72 hours, with each query requesting 1 degree square of tabular results. The web container running the services was overloaded and eventually crashed, bring down access for all users.
    • Lessons Learned: A bug was identified in Topcat whereby the "Stop" button didn't actually stop subsequent queries from being submitted. The user thought they had stopped their requests multiple times, when in fact the queries continued in the background alongside all subsequent queries they were submitting. This bug is now fixed.
    • Proposed Solutions:
      1. VO Applications should use low default settings for options like multithreading. When possible, they should also provide warning messages when a user selects a setting that could be considered unreasonable.
      2. VO Services could provide 'HTTP 307 Temporary Redirect' messages to clients that are overloading their services. The temporary redirect would point to a page with time delayed redirect, and the time on that would be set by the server to delay query response. As more requests come in the wait time would grow. The client applications must be capable of understanding the redirect message and act accordingly.

Inputs from CDS

The VizieR solution for abusive users is still to set a delay for HTTP queries identified as "abusive". The solution is manual: it adds the IP in a "black" list where queries are affected by a "sleep 5" before being executed. But as the capacity of VizieR to ingest more queries has been increased during the last year and this action is not often used.
We should also consider the fact that an abuse acces to a service could be somewhere in a workflow (a simple pipeline or a more complex execution plan). In this case the user will probably not know that he is guilty. For him, the execution of his query could be considered as slow (if a delay is used) or not possible(if a redirection is used) .

Inputs from ThomasBoch

In the CDS cross-match service, we output a 503 code (server unavailable) if the service is too busy to process the request. We should also consider HTTP code 429 (too many requests, see http://tools.ietf.org/html/rfc6585#section-4 )

Inputs from CADC

In the past at the CADC we had similar problems, where our processing users were making a very large number of requests to our VOSpace virtual storage system. Sometimes a user would, inadvertently, because perhaps of a bug in their code or because they have a resource intensive script, be making so many requests to VOSpace that it made the system very slow for other users. Sometimes, the overload would exhaust our low level resources and make the system unusable.

Initially, we implemented a "throttling" scheme, where users were grouped and limited to a certain number of simultaneous requests. When users reached their limit, they were throttled: we would issue of 503 "Service Unavailable" with a value in the "Retry-After", suggesting a time in which the user should retry their request. Our clients were written in a way to recognize these HTTP responses and delay actions appropriately.

However, we found it hard to set the correct limit numbers and identify the groups, causing throttling too early or too late. We then realized that it is the limits of our low level resources that we must understand and protect. These are resources such as database connections, threads, file descriptors, etc.. So, using pools to gain access to these resources, we then issued the 503 responses when these pools were exhausted. We found this was the true indicator of the amount of load we could handle and required no maintenance or updating as our user base grew. Only the pool sizes need be adjusted when our low level resource capabilities changed.

The other advantage of using pools to control resources is that the users' connections will wait in line for the pool resource. You have the ability to set the amount of time users wait before we give up and issue the 503 response.

So far this seems to be working well and are now able to scale with our growing user base.

Inputs from ???

Feel free to complete

Topic revision: r6 - 2013-09-28 - AndreSchaaff
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback