Friday, 23 March 2007

Web Services and securing access to large datasets

One of the challenges for the DEWS project has been transfer of large datasets using web services. This is a well known problem in the web services world but is complicated in this case by how we secure the transfer. The conflict in our case comes here:

  • For DEWS we want to use standards where possible. So for securing web services this has been interpreted as use SOAP with WS-Security. We would like to use digital signature where possible. [Message confidentiality may be required but is likely to be impracticable for transfer of large datasets for performance reasons. - We want to be able to authenticate clients downloading data but we accept we can't stop an attacker listening on a port as data is accessed. This may sound weak but at least if clients are authenticated it means an attacker can't initiate retrieval of a dataset of their choosing. They can only listen out for an authorised users request.]
  • ... but we're using the Geoserver (implementation of OGC Services) to serve some of our datasets. It doesn't use SOAP (uses a more RESTful style interface) and it's not secured.
To address this we've created a Gatekeeper web service. This acts as a proxy for Geoserver protecting it by applying security constraints to requests before passing on through valid ones to Geoserver.  We've built the Gatekeeper using a standard WebSphere based web service. It uses WS-Security to give us digital signature and can be run over https to give confidentiality where necessary.

Transfer of a large binary dataset is fine if you leave out the Gatekeeper and our adoption of WS-Security(!). - Geoserver receives a http request by GET or POST and returns the binary data back along the same channel setting the data type so that the client will understand.

Now with the Gatekeeper and a SOAP interface placed between Geoserver and the client it's not as straightforward. Tooling like WebSphere generates client and server stubs that expect SOAP in - SOAP out not SOAP in binary data out as we would want.

Unfortunately we've needed some decisions quickly to this problem. The options we've considered are:

  • SOAP in binary data out: adapt interface to return binary data back instead of a SOAP response.
  • Use SOAP attachments
  • Do transfer out of band of the web service - SOAP response returns a data download uri for later retrieval by the client.

Taking them in Turn...

SOAP in Binary Data out

Tooling for SOAP based web services creates interfaces that expect SOAP not binary data. To go with this we would need some way to modify the server output and client response. This can be considered a black box for WebSphere and I suspect for other alternatives such as Axis.  Even if we got it working it's not a standards based solution.

SOAP Attachments

This keeps us standards based but which technology do we adopt? MTOM seems the one to go for but no one on the technical team has much experience with this or any other alternative for SOAP attachments? ... something else?
More importantly can it cope with 120Mb datasets(!) - probably not but maybe we could divide the dataset into smaller chunks. Also, how do we handle error recovery? Is there a built in mechanism? Taking both these things into account implies some intelligence on the part of the client and so a more heavyweight solution on the client side. This is something we want to avoid. The client should be as simple as possible so that potential users of the system have minimum overhead to get set up and connected.

We definitely need to know more about the limitations of this solution when applied to large datasets.

Out of Band Transfer

Handle the transfer of the binary data outside the bounds of the web service by returning to the client a URI to the dataset.

The advantages are that it's a simple solution and there are existing tools to handle download and error recovery.  Immediately there are security concerns with this though. An attacker could snoop the URI of the dataset returned by the Gatekeeper. This first problem can easily be addressed by encrypting the channel using https. This still leaves the main problem with this idea: how does the server authenticate the client making a request to the URI? If it can't do this anyone can download the data.

Added to this is the fact that the preferred method of retrieval from Geoserver in our case is GET rather POST. GET is preferred because it will be more straightforward for the client. This decision rules out one possible solution, that is to use XML security to sign the POST message. Besides if we went with this, why don't we dispense with the Gatekeeper's SOAP interface and use a REST style one? ... but we've already said that we want to stick to standards where possible and WS-Security is what we've picked here.

That leaves the problem of trying to secure a GET request. One alternative we considered was authentication by client IP address but this quickly runs into problems when we considered that some clients can't guarantee a consistent IP address or operate behind a site proxy.

Another approach is for the client to sign some component of the GET request:
  1. Client makes authorisation request to Gatekeeper over SOAP interface
  2. Gatekeeper checks and returns the URI for download of the data and a unique security token
  3. Client makes a GET request to the URI sending arguments including the
    security token, a signature of the token and the Distinguished Name of the X.509 certificate to be used to verify the signature. (We should perhaps also include the name of the signature algorithm)
  4. The server checks that the security token is valid
  5. The server checks the DN against the certificates held in a key store.
  6. If recognised it verifies the signature using the public key from the certificate -> if verified, it knows the request has come from a valid source.
All get args are URL-safe base 64 encoded.

I've tried this out with a simple Python test script, the production code will be Java. If we implement the server side with a servlet I'm told we can put extra security constraints - a time limit on it duration and we can log when download starts and stops.

This is not a standards based solution, it's a custom one but then an initial look at Google Authentication has given me reason to believe it uses something similar:

Besides this, all the security steps up to this point, attribute retrieval, role mapping and authorization are achieved over SOAP based services secured with WS-Security. Only the final step is different.

The client side will require some coding in order to generate the digital signature but this should be straightforward using the relevant Java OpenSSL library.

This is the chosen solution unless SOAP with attachments proves much, much easier.

Monday, 5 March 2007

Eggifying the Python OpenSSL Package M2Crypto

This has been on a todo list for a long time. A priority NDG (NERC DataGrid) Security package is make it as easy to deploy as possible. For this, I'm creating Python Eggs for separate client and server packages (

Third party packages are setup as dependencies some being easier to incorporate than others. For M2Crypto, a Python wrapper to OpenSSL, the one problem is getting setup to recognise alternative installation locations of openssl to link with.

There is already a an --openssl command line option to specify an alternate path but this is not integrated with the rest of distutils. In particular you can't specify it as a config file option and you also can't use include_dirs or library_dirs options either as alternative means of passing in the correct openssl paths. Using config file options will help greatly for setting up M2Crypto as an Egg dependency.

One straightforward way to solve this would seem to be to write your own build_ext class. Each command option set on the command line for setup has an equivalent class implemented for it in distutils. You can add new ones or override existing ones.

from distutils.command import build_ext

class _build_ext(build_ext.build_ext):
    '''Specialization of build_ext to enable swig_opts to 
    inherit any include_dirs settings made at the command 
    line or in a setup.cfg file'''

    user_options = build_ext.build_ext.user_options + \
    [('openssl=', 'o', 'Prefix for openssl installation location')]

    def initialize_options(self):
        '''Overload to enable custom openssl settings to 
        be picked up'''
        # openssl is the attribute corresponding to openssl 
        # directory prefix command line option
         if == 'nt':
            self.libraries = ['ssleay32', 'libeay32']
            self.openssl = 'c:\\pkg'
            self.libraries = ['ssl', 'crypto']
            self.openssl = '/usr'
        # ...
    def finalize_options(self):
        '''Overloaded build_ext implementation to append 
        custom openssl include file and library linking 

        opensslIncludeDir = os.path.join(self.openssl, 
        opensslLibraryDir = os.path.join(self.openssl, 
        self.swig_opts = ['-I%s' % i \
            for i in self.include_dirs + [opensslIncludeDir]]
        self.include_dirs += [os.path.join(self.openssl, 
        self.library_dirs += [os.path.join(self.openssl, 

The class variable user_options is extended to include the openssl option (line 08). When this option is parsed, distutils expects to store the setting in an attribute openssl. This is set up in the customised initialize_options method. The openssl option will be displayed the help message for build_ext and can be set on the command line with --openssl/-o or in a config file under build as openssl=...

initialize_options and finalize_options are the key methods which are customised to tweak the link settings for openssl. Making the settings here means any changes set on the command line or in a config file are correctly picked up and set before compilation.

setup picks up the custom _build_ext class by setting it the cmdclass keyword giving the command name and equivalent class in a dictionary:

setup(name = 'M2Crypto',
      # ...
      cmdclass={'build_ext': _build_ext})

Distutils customisation looks much more straightforward now but this is after lots of digging through documentation and code :)