Right, let me warn you in advance that what follows is going to be deeply geeky and while I’ll try my best to put this in layman’s terms it may only make full sense if you possess at least a fair understanding of how databases work.
This evening, in common with a number of other bloggers (judging by the comments) I acquired a copy of the internal emails relating to the two CDs of child benefit data that have been lost in transit between HMRC and NAO from Nick Robinson’s blog on the BBC website. To save time I’ve uploaded the emails (pdf) to MoT, where they can be downloaded from this link:
informationrelatingtochildbenefitdata.pdf
I’ve since read the emails (naturally) with an eye to trying to understand the technical aspects of this whole farrago, and what I’ve found raises even more questions as to what’s going on HMRC.
For our starting point, lets take these two emails:
We’re close to the start of the whole mess, with HMRC sending some sample data between sections and then on to NAO for analysis in advance of providing a full set of data, to allow NAO to figure out how the information is recorded and what information they actually want.
Now let’s get a bit technical.
The live system, from which the data on the two lost CDs was extracted, runs on a mainframe computer and this, therefore, limits considerably the range of possible systems that could be in use to either a completely bespoke system or – much more likely – a bespoke front end system linked to a corporate-grade relational database management system, something of the order of IBM’s Informix or DB2 platforms, Oracle, Sybase or CA-IDMS.
We’re talking big league systems here and that, in turn, places definable limitations on the technical aspects of moving that amount of information (25 million records) from A to B (or rather from HMRC to NAO).
In practice, you have two possible options: the information could be sent in a proprietary binary database format, in which case there has to be a compatible database platform at the NAO’s end of the transaction in order for them to get at the data, or you have to send the information over in a common text-based format, such a comma separated variable (CSV) file, an SQL ‘dump’ (a text file containing a string of SQL commands required to recreate the database on another compatible database server) or, if you’re really up to date, an XML document.
The reference to the NAO having to ‘decipher’ the information militates very strongly against the possibility of the transfer of information being carried out via a proprietary binary database format – if that was the method in use then the NAO would simply import the data files into its database server and set its own techs to work on extracting the relevant information via the database platform. ‘Decipher’ in this context, makes sense only if a text-based format is being used, and particular one that does not contain information about the structure of the data, which almost certainly rules out XML and suggests than an SQL dump is unlikely to have been used, as the NAO would not have to decipher the data, only the SQL commands relating to creating the database structure necessary to house the data.
(The comments also strongly suggest that the level of technical understanding amongst those making the key decisions here is nowhere near what it should be.)
Moving on, HMRC received a reply on the same day that sheds rather more light on the technical situation.
The key point is (c) where the NAO asks whether the data file will ‘have initial headings’ – that clearly rules out the possibility of the data being in the form of an SQL dump (and XML as well) or the use of a binary format – what is being referred to here is a text file, most likely a comma separated variable file.
I suppose I should, at this point, explain what a comma separated variable file (CSV) is and how it ‘works’.
A CSV file is a text file containing a number of ‘records’, each of which is made up of a set number of ‘fields’, each of which is a single piece of information. Each field, and each record is separated from the next using a specific character to indicate the end of a field and the end of a record. Let’s use a simple example of what a record looks like.
First Name, Last Name, House Number, Street, Town, Post Code
Each item of data (field, i.e. First Name, Last Name) is separated from the next using a comma and there’s a carriage return/line feed (the character(s) you get when you press the enter key while typing, at the end of each record.
When you import a CSV file into a programme that handles this format, a spreadsheet like Excel or a database programme like Access, these two special characters (commas and carriage returns/line feeds) are used by the programmes import routines to split the data up into discrete records, and split the data in each record up into its constituent fields.
By long established convention, the very first record in data file in this format can be used to send information which identifies the nature of the contents of each field, so if you have simple address file that looks like this:
Record Number, First Name, Last Name, House Number, Street, Town, Post Code
1, Joe, Bloggs, 10, Any Road, Any Town, AA1 1AA
2, Fred, Smith, 11, Another Road, Another Town, XX1 1XX
The first line (record) gives you the headings that tell you what type of information should be in each field of the records that follow – and if you don’t send that information then the end user will have to add it themselves in order to make sense of the data, hence the question about initial headings; and you would only ask this question if what you were being sent was a text file in CSV or similar format – in the case of the other possible formats the information about what each field contains is ‘built’ into the file and need only be read by the relevant software.
Why is this relevant?
Well, because if CSV file or a similar text based file was being used then, provided one knows which special characters are being used to delimit fields and records, the various tasks that the NAO is talking out in this e-mail, i.e. segregating certain information and dropping other items (e.g. bank information) could be undertaken on a reasonably well specified desktop PC by anyone who is fairly competent in working with common spreadsheet or database applications, like Microsoft Excel or Access, both of which can import this format.
The biggest obstacle facing anyone handed this task in no more than the sheer amount of data than would need to be imported/processed – a single file containing 25 million records would be too big for Excel or Access (using its standard database engine) to swallow in a single sitting.
However, if we assume the data filled the two CD’s completely (1.3Gb) and a typical zip compression rate of around 80-85% for text, there would be around 6.5-8Gb of data in total to be processed, then using Access as a front end to a Microsoft SQL database, or MySQL (which is free) would be an option, as would using a relatively cheap piece of text processing software to split the file into ‘bitesize’ chunks of a size that could then be imported to Excel or Access. And that’s assuming you stick with Microsoft, as this is one of those jobs that the various limitations of Windows and M$s’ other applications makes tricky but which would actually be relatively easy to pull off on Linux box.
This is a job that could be done on a decent desktop PC by a competent junior IT/Application support technician in a couple of days (allowing for processing time) at very little cost in real terms, and – coming from the other direction – unless we’re dealing with a completely bespoke system, it also a job that EDS could have done as an overnight batch run for very little cost either. If the live data is stored using a mainstream corporate database platform, then I’d estimate no more than a half day’s coding/testing time to write the batch job and query(ies) necessary to extract the precise (and sanitised) data that the NAO indicated that it required.
All of which brings me to the critical email.
This is the now infamous email in which HMRC bleats about the additional cost of providing the precise information that the NAO requires (never mind the cost to the NAO of it doing its own data extraction on such a large dump file) and I’m just not convinced.
There is something serious fucked-up in an IT services contact in which the provision of audit information is a chargeable extra – the correct response from HMRC in EDS responds to such a request with ‘it’ll cost you £x if we do this’ is ‘it’ll cost you the contract if you don’t’ and based on what I can glean from these emails this should not be an expensive task unless you’re dealing with a completely bespoke system and need a custom programme written specifically to do the scan and extract.
From what I can see, part of the investigation into the circumstances in which this data was lost needs to look very carefully at HMRC’s contract with EDS – audit information should be an optional extra – and at whether HMRC actually enquired as to the cost of sanitising this data – the Telegraph has gone on record as suggesting that the actual cost to HMRC would have been in the region of £5,000:
The Daily Telegraph – which has revealed how the government had ignored repeated warnings over the safety of personal data – has established that HM Revenue and Customs officials could have avoided the scandal if they had paid £5,000 for computer data supplier EDS to edit the data on the discs.
Five grand still seems a bit on the high side, but compared to the potential costs if the CDs did get into the wild it’s a pittance.
Fuck up from start to finish doesn’t come close.
If I can be flippant for a second…
Two words: Comic Sans.
That’s all.
It just makes you want to weep. You feel (somewhat) sorry for Darling, the amalgamation of Customs and Excise with the Inland Revenue was done by Brown and was a grade-A cock-up from start to finish. A quick review, HMRC responsible due to structural failings for almost half the nations personal details going missing with some bank records. FSA responsible with Bank of England for failing to prevent first run on a bank in 160+ years, something that wouldn’t have probably happened under the old system.
The country doesn’t need ‘Vision’, or ‘Reform’, or ‘Modernisation’ primarily. It needs competence. Something that appears to be sorely lacking.
“Two words: Comic Sans.”
Is it sad that this is what scared me most too?
Well, at least it’s been exported from a mainframe system. My experience with at least local government IT departments has been worse –
A few years ago, I bumped into some council chaps who apparently maintained their local CSA info in an Access 2 database. Was a bit embarrasing actually – the techie who admitted it could have been killed by the looks he got from his coworkers (line manager types).
The thing that strikes me is the arse-covering tone of the correspondence. In true NL fashion, no-one wants a) to be responsible for anything, or b) to pay for it.
I too noticed the use of Comic Sans.
A small point, Unity, is that the threat of “do this or lose the contract” may be empty if HMRC is run the same way as the Audit Commission – that is to say that contractors are in all the key positions in the department, and if the Audit Commission were to try that on, the company providing the contractors could simply leave, crippling the department.
This is, of course, also a worrying situation to be in for the government, but I’m just saying that such threats in this context may not have carried any weight.
I’ve not seen any serious discussion of Data Protection Act issues anywhere yet. My amateur stab at it is:
By default a government department cannot transfer personal data to another department (because s63(2) of the DPA makes each govt dept a seperate “data controller”). But transfers are allowed if required by law, and I guess the National Audit Office has legislation allowing it to require data transfers. But, because NAO explicitly said they did “not need address, bank or parent details”, I suspect that removed the legal authority for the HMRC to transfer that data to the NAO. So by giving the NAO that data regardless, HMRC has probably infringed its duties under the DPA, as it had no lawful right to transfer that data.
However government departments cannot be prosecuted (exempted by s63(5) of the DPA), though individual civil servants acting without consent can be – though in this case it seems the junior civil servant did have consent.
So I’d guess HMRC will get a severe ticking off by the Information Commissioner, but there will be no prosecutions.
What’s odd is that it’s EDS, who I thought lost the main HMRC contract to CapGemini a couple of years ago. Still, they’re different shades of shit, really.
Oh, and someone really ought to point out to the Tories that they started all this outsourcing lunacy, so they should really make it clear whether they want to insource it all again or if they’re just being opportunist.
It was the Tories that started outsourcing the IR (as was) computing, thus destroying quite a comprehensive system that meant that you didn’t have to enter your main employment income every year as well as losing the IT-competent staff. EDS are at the basis of a lot of the IT mess the Govt has, even if contracts get changed. It is fundamental that there have to be Civil Servants in every department who can both understand the software and access our data otherwise the government will eventually be seen as both incompetent and careless which is what is happening now. It really should be simple to take only specified fields of a random sample of 1500 records from a database.
What is scary is that there are CD/DVD writers everywhere – presumably with USB ports – on govt computers. When PCs were first used in offices, people used to put locks on the floppy drive so that all of 1MB couldn’t go missing. How come locks haven’t been put in devices capable of writing 5GB or so?
But every cloud has a silver lining. As there are so many excess NI numbers in circulation – ISTR about 10 million were identified in 1998 – if half the country is on those CDs, perhaps some kind fraudster could help the govt and identify all the NI duplicates at the same time…:) Maybe the taxpayer could save some housing benefit.
Delighted to read here confirmation of my thoughts the moment I read the media reports. I used Silicon Office relational databases from 1982 to 1994 to run my company, and could extract any data I wanted in any format and on any basis I wanted at the press of few keys. I could set parameters for even the most complex requirements within minute, and see results minutes later even with 10mhz operating speeds and 25mb system memory.
The acting head of HMRC told a Commons Select Committee that it would have cost not
This is a letter I sent to The Times on the 19th, not published. Now sent to many papers
24 Dec 07
The Editor,
Sir,
There are many curious aspects of the lost Revenue and Customs CDs containing personal information on twenty-five million people:
As the National Audit Office required only one thousand randomly chosen names on which to carry out spot checks, why didn’t Revenue and Customs extract and send just those, to minimise any problems if they went astray? The Government have, after all, been doing much the same thing for Premium Bond holders since the valve-powered computers of the 1950s.
Dave Hartnett, acting head of HM Customs and Reveue, told the Public Accounts Select Commitee that it would have cost in the region of