|
|
ONS will need to re-purpose 2001 standard output from the published tabular form into a format for web dissemination and testing for 2011. This will involve some form of smashing together multiple tables into condensed rationalised cubes, probably based in SDMX. Census partners and users will no doubt be working along similar or parallel lines to develop applications and dissemination objectives that are broadly consistent. This section can be used to document and share experiences to date and lessons learned, and discuss current and planned lines or future development where similar tools or techniques might help us to help each other. See also: 2001 - 2011 comparison data: Updating 2001 definitions |
1 About the CWSWG 2 Member projects 3 Definitions and technical specifications 4 Metadata 6 Coverting 2001 tables to 2011 SDMX cube format Sortable list of all wiki pages Suggestions for new pages, created: 18 May 2011 18:56 Metadata, created: 25 Mar 2010 10:56 Cube design & sparsity, created: 25 Mar 2010 10:54 EDINA Development progress, created: 24 Mar 2010 10:47 GROS Development progress, created: 24 Mar 2010 10:43 |
Converting 2001 tables to 2011 SDMX cube format
page revision: 7, last edited: 27 Oct 2009 12:24
To be added
Luckily we had the table frameworks as tab delimited text files, so it was possible to open these in Excel.
We then assigned
Category names to each group of column and row headers
For a table of Age by General Health by Sex, the first row category would be General Health, then second row category would be Sex and the column heading category would be Age
eg
Then for each cell id we extracted the row and column headings, together with their categories and inserted this into a table.
ie
We did this for for each of tables, then we needed to harmonise the column and row headings as the syntax used varied across the tables, eg Aged 0 to 4 years old, was expressed as 0-4, 0 to 4, 0-4 years, 0 to 4 years, aged 0-4 and so on.
This then created for each category a distinct list of codes, we then used these lists as the codelists for SDMX.
Was thinking about this, the best way to get this information would be to use the original questions asked on the census forms and use categories defined as the possible answers, then you would get the official text for each code as well.
The questions as shown in the explanatory volume, also show the structure for the hierarchical codelists too.
Like the jargon? I made it myself.
The first things that we (mostly Rob) did in the CAIRD Project were to look at the potential XML schemas (DDI and SDMX) in order to familiarise ourselves with them and check on the kinds of information that could be encoded and the ways in which this is achieved within the schemas. Once we'd assured ourselves that the schemas could accommodate the kinds of information that we knew were entangled within the existing 2001 outputs, we set to work trying to extract the information in structured and usable forms as decribed by Rob above. The initial stages of this involved a mixture of interactive tidying up and programmatic parsing of textual table frameworks that we had already produced as an intermediate stage in the creation of the html cell selection frameworks used in our Casweb interface. Rob and Richard Wiseman created versions of the frameworks in which all the row and column headers were straightened out and filled in with values. This involved a lot of rather fiddly work, especially for some of the more complex compound tables in our sample, which had to be split into several different tables. Once the table frameworks had been restructured in this way, the codelists and their constituent codes were compared against a set of cleaned up standard codelists that Rob built up as he went along in order to make sure that all the text labels for codes were consistent (the 0-4, 0 to 4, 0-4 years, 0 to 4 years, aged 0-4 problem described by Rob previously). Rob was then able to process the table frameworks to extract the cell IDs with their associated codelist/code pairs as contained in the final table in his description.