Tuesday, 17 July 2007

Untangling Two Different Copies of the Same Excel Spreadsheet

The other week I was on the JISC infoNet stand at a regional conference and one of the delegates who came over to see what was new, said, “What happened to all those technical articles you used to write for NILTA News? Ages since we had one of them!”

Well, yes, I have to admit that I haven’t done an awful lot of programming over the last few years and any knowledge of database language that I have is probably a little outdated now. Database products seem to be updated regularly! However, not long after that conversation, I had a phone call from an ex-colleague who was now working for a local government organisation. She had been put in charge of some records that were stored on an Excel spreadsheet.

The problem was that two clerks had copies of this spreadsheet, held separately on their PC hard drives and at some point both of them had made update amendments to the data and she wanted some way of comparing the two files and sorting out the most up-to-date information.

I was able to help by suggesting a way forward, but unfortunately in a case like this there is no way of automatically determining which of two different records within a larger file holds the correct information. The solution I suggested would eliminate all those records that were the same in both files and allow her to concentrate a manual checking of the data on those that didn’t match.

By far the easiest way of matching records is to have both sets within the same spreadsheet, but I wanted to ensure that neither of the two copies were compromised. Therefore step one was to make a copy of spreadsheet “A”, leave a blank column to the right of all the data, then paste in all the data from spreadsheet “B”.

We now had two sets of data side by side – but of course not necessarily with data records in the same order as perhaps some rows had been deleted, some had been added, maybe in both but probably only in one of the sheets.

So we couldn’t be certain that any record in sheet “A” would be on the same row as a corresponding record in sheet “B”. We needed a way of highlighting which records existed in both sheets with all the same data in each of the fields. They were the only records that she could trust and she would then have to take the rest and make enquiries to ascertain which record was current.

To make things easier we created a new field for both sheet “A” records and sheet “B” records that held all the data from every field. This was fairly easy using the “&” sign to concatenate one field with another.

Using &" "& allows us to include a text space character between the two – not necessary for our purposes but it looks better! The formula =A4&B4&C4 would have produced the result “FredBloggs52 Sonder Avenue”.

The next step is to insert another field into sheet “A” and use that to compare each row to any row from sheet “B”. Remember that the corresponding record to row 4 in sheet “A” could be absolutely anywhere in sheet “B” so comparing row 4 in sheet "A" only to row 4 in the sheet "B" would likely cause disaster!

What we did was to compare the new field that held every bit of data in sheet “A” to the corresponding new field in sheet “B” but checking the entire column in sheet “B” by using a range. J4:J1025 would check each row from 4 to 1025 for a match. The syntax for the check is – persons of a nervous disposition look away now…


This is the formula as given (if you are able to find it!) using Excel’s Help facility. It’s not the best solution for reasons I’ll explain later, but before I add to this already complicated formula, let's break it down step by step to see what it is doing.

To understand any formula, start within the brackets and work outwards. So, bit by bit…

“MATCH” is the innermost command and means what it says. In the inner brackets are the bits that tell it what to match.

“D4” means compare or match the contents of cell D4

“$J$4:$J$1025” means compare against the range J4 to J1025. The $ signs are there so that as we copy the formula down, although D4 will change to D5, D6 and so on, the range will stay constant. We don’t want to compare D1025 to a range of J1025:J2042, we want the range to stay constant.

“FALSE” …I’m not really sure what this does at this point! By default if the formula finds a match it will show the row number from the range in which it found the match. From the RANGE – not the spreadsheet! In our example with the range starting in row 4, if it returned 26 we would find the matching record on row 29 of the spreadsheet! If it doesn’t find a match it displays a “Value Not Found” error.

That’s fine but it’s not elegant. Far better to display the word “TRUE” if it finds a match and “FALSE” if it doesn’t. So that is what the ISNA(…) bit does. Except that it does it in reverse – it displays “TRUE” if a match is NOT found and “FALSE” if it does find a match. That is the Microsoft solution. You could head your column “Missing from sheet B” and there’s a job done. You need now to compare sheet B to sheet A of course, because new rows may have been added so now we add a new field to the far right and enter the formula:


which does the same thing only the other way around! It matches row by row sheet “B” to a range in sheet “A”.

I said the solution wasn’t elegant. It’s still not. That’s simply because we have that double negative. We are looking for matching fields but the results of the formula shows “FALSE” if we find one!

So let’s add a simple command and another set of brackets to turn it round.


Now the “TRUE” and “FALSE” are the more logical way round and we can head the column “Match Found”.

There’s still a lot to do here because the results only tell us whether the whole data was found or not. The matching record could exist but may have some amended data – a different address field for instance. It is obvious to a human that Fred Bloggs living at the same address as Frederick Bloggs in the other sheet should be a match. It will not be a match to a computer!

We have an entry for Marie Stokes in each sheet. They may or may not be the same person as the entries have different addresses. Even if it is the same person there is no way of telling which address is correct! So from each side we know that all the records that show “TRUE” are common to both spreadsheets but now need to manually examine all other rows to see if the same name exists with a different address or whether the name appears only in one sheet or the other.

So now we can sort the records based on column D and then copy all the records labelled "True", from Spreadsheet "A" only, into a new worksheet. We can then delete those records from Spreadsheet "A" leaving only those records which are unique to Spreadsheet "A".

Next we can sort records based on column J and delete all the records labelled "True" from Spreadsheet "B" as, being previously matched in Spreadsheet "A", they have already been copied to the new worksheet.

This leaves only those records in either sheet that need to be manually examined for any similarities such as the Fred/Frederick Bloggs. Where there are two different addresses for what seems to be the same person, you will have to do some research as it is impossible to say which is the most up-to-date address, or even whether it is the same person.

These records can be added to the new worksheet as and when they are verified. The new worksheet now becomes the master copy and should be labelled and treated as such. Spreadsheets "A" and "B" should now be made read-only and archived or destroyed in accordance with your Records Retention Policy. You do have one...don't you?

Moral of the tale - don't ever have more than one updateable copy of anything, whether it is a database, spreadsheet or even a set of minutes. Use Records Management techniques to version control documents and if multiple copies are needed in electronic format designate a master copy and make all others read-only.

JISC infoNet has a number of online resources available on the subject of Records Management at http://www.jiscinfonet.ac.uk/records-management

No comments: