Data Feeds

Over the last year or so I've heard quite a few references to data feeds, and there is some interest in the user community in using these to distribute 2011 Census data. Would you like to know more about the principles and practice relating to data feeds ? Do you have experience of using these and can give us some examples of good and bad practice, issues and risks?

Comments

Data Feeds

As a company which uses and redistributes census data we will almost certainly download census data once only to hold on our own servers, taking into consideration that census data is large volume and, once released, not updated until the next census.

From this perspective data feeds of census data may not be of much value to us, although the propect of closer linkage of data to metadata is potentially attractive.

We believe that whatever other formats or options exist, the availability of the data in .csv format is essential as a basic, bog-standard delivery option. It's not glamorous, new or high-tech, but it is very easy to process .csv reliably without additional development on the part of users. CSV is probably the only sufficiently widely accepted standard which ticks all the boxes for this purpose.

Having said this, we may also be interested in data feeds, but they seem to be more appropriate for lower-volume, more frequently updated data such as economic time series, employment data, etc. Surely the census isn't the natural place to start with this technology ?

Data feeds

The idea of using data feeds is one that has been gaining traction for a while; I'm co-ordinating an
ESRC
Data Feed Research Network
that is looking at
implications of the idea.

The advantages of a feed are that (with data always drawn from a central source) updates can be quickly propagated, and that data and metadata can be closely linked, presumably delivered using XML. An advantage in closely linking metadata is that it makes it much easier for automated sub-setting and aggregation routines to work. The main disadvantage is simply that it is not what people are familiar with, although typical csv 'standard tables' could easily be derived from a feed.

Such a feed would need to be rather more complex than an RSS feed, because of the volumes of data and number of possible views (i.e. from a single count, to a large number of counts for a given area, to a single item repeated for 200000+ Output Areas, etc.).

A more useful comparison IMO is with the Open Geospatial Consortium WMS and WFS servers. A census data feed will need to accept a 'getCensusData' request in which the user specifies what data they want; the server processes this and sends back the appropriate data. This could be preceded by some form of querying request that asks the server what data is available.

This could be done using a backend database that contained a set of pre-prepared aggregations for all areas, but could it's worth noting that the same approach would be used for a flexible output system. If the requested data are not already in the system, then they could be added, assuming they pass some form of (semi-automated?) disclosure control test.

Apologies

Sorry, you've already got a feed for your blog.

To answer the question more directly:

I would guess that the problem with feeds is they aren't that well suited for the quantity of data you will be creating, unless you're feeding some summary of the data as a whole.

I think if you really want dynamic data provision, you're better off providing a public API, with some authentication (maybe openid?).

Feeds

There are two basic feed formats: RSS and Atom. RSS is probably more widely used, but some say the format itself is inferior to the Atom format.

Ideally you would support both, giving your users the option of subscribing to either.

The most obvious way to use feeds is to make your blog a feed, so that users will be automatically notified every time you add a new post to your blog, without having to remember to check your website regularly. You could do something similar for changes to your wiki (though that's more complicated). Given your website is public, that won't have any security problems.

A more complex feed reading system might allow users to dynamically access the actual data you collect, but I think you're best off exploring an API system for that, given the quantity and complexity of what you will be collecting.

Data Feeds

I think the idea of a data feed is to provide data as required. This reduces the need for having multiple versions of datasets duplicated here and there. If done right then a user knows what version of data they are using and can access that same version at source even if a newer version is available, say for example, with some errors fixed. Good metadata is best for good feeds. Are there plans to feed user derived data as well as the provided data generated from source?

XML Feed XML Feed