I love playing those spot the difference games. You know exactly how many differences there are, you know that it’s only a matter of time before you find them all, and you know that even if you run out of time or patience you can cheat and find the last one you were looking for. It’s mindless entertainment.
At work, however, when I am looking at two documents that are almost exactly the same, the last thing I want to do is play “spot the difference”. I don’t know exactly how many differences there are, I don’t have the time to find them all, and there is no cheat page to help me find the last one. It’s anything but entertainment, and unfortunately I have to do it a lot. Good thing there are tools to help me do it! If only they were perfect…
The requirement to see the differences between almost identical files comes up most often in source code, but it will become more common with other kinds of managed artifacts such as requirements, documentation snippets and test case descriptions as tools continue to be adopted that properly version these artifacts. When do you use a differencing tool?
- When you want to know why a change was made; and
- When you want to know what else was changed by the developer/analyst at the same time.
A difference tool, in conjunction with other information, helps you better understand the evolution of your artifact.
Note that the tool by itself provides only a superficial understanding of the motivations and effects of the changes. Ideally, when a user changes something the change is linked to a change order which is itself linked to a feature request or a defect report. This provides the additional context that is often lacking when the user checks in a change. All version control tools provide the opportunity for the developer to annotate the change as they’re checking in the file but often they don’t, or if they do, they don’t provide enough explanation for you to truly understand their motivations, design choices, and other background information. It is important that you be able to trace the changes to all the process artifacts that led to the change so you get the full picture. Here is a hypothetical example:
Starting with the task that resulted in version 1 of the file being created in the first place, you can see that a couple of defects were posted, the first one to add some logging and a link to a help page, and the second one a null pointer exception with steps to reproduce. These are very simple examples, but those defects, tasks, change orders and feature requests often hold a lot of useful data that help the user really understand why the particular changes were made.
The other context that is lacking in certain version control tools is the ability to lump together changes in more than one file, where all changes taken together constitute the entirety of the fix (in the case of a defect) or the design (in the case of a new feature). These are called change sets or change packages. We’ll talk about that some other day.
There’s no need to talk about typical differencing tools – you know what they do and how they work. The typical tool is ideal if you are dealing with multi-line artifacts where the number of changed lines is not large compared to the total number of lines in the file. It is not as helpful if the lines are so long and/or the number of changes is so large that a high proportion of the total lines is different. It is completely useless if the file is binary. Some differencing tools know how to interpret files in various formats and can draw differences between the visual representations of each file. Obvious examples of binary-formatted files include Word and Excel – at least, older versions of those products. More current versions use XML-formatted files which are line-based text files. XML is interesting because moving a block of XML from one location to another may or may not actually change the meaning of the file if the schema (the instructions for interpreting the XML) has been changed. In fact, if the schema specifies that a particular tag should be dropped from the file and it is dropped, then you could plausibly claim that the file did not really change! Its meaning is the same from the perspective of the schema.
This brings up a more interesting problem which is the interpretation of changes. A raw Visio or CAD document may indicate that there is some sort of change, but what really matters to the user is the fact that two components in the drawing are now 0.1 inch closer to each other in the second document. I don’t know of tools that do this sort of “high level” differencing, but clearly there is a need for it. Or, for that matter, a tool that will difference two video clips, or two sound clips. The possibilities are endless. The biggest challenge for this type of differencing is in the presentation. How exactly do you indicate that a chord in one music clip is a minor chord while in the other clip it is a diminished seventh? Or that the second jpeg image has an extra face in the group shot that wasn’t there in the first? You can imagine how to present the change in the most useful way for any given situation, but to code the general case is a daunting challenge.
Along those lines, artifact differencing is not limited to just text: Managed artifacts have relationships that can change from one version to the next. This is a similar visualization challenge, though perhaps a little easier to do. The challenge is this: How do you tell the user that the “Satisfied By” relationship had 3 items in the first version (numbered 18244, 19382 and 20467), and 2 items in the second (numbered 19382 and 19440). The numbers are not meaningful, so you want to show some text that represents the other items, but which text do you show? How do you highlight that the order changed but otherwise the list is the same?
The old days of
>>> Line 256: Added
… line 1
… line 2
… line 3
<<< Line 312: Removed
… line 1
… line 2
are over. We don’t have time to think about what that means and how those changes fit into the overall context. That’s what computers are for. The next generation of differencing tools will do more than a simple line-by-line comparison of text files. They will
- interpret the differences and present them in terms that are more meaningful to a human;
- incorporate additional information to help explain why the changes were made; and
- present differences in non-primary text fields and relationship fields in ways that are meaningful to a human.
Now if only you could make some money with such a program. Ay, there’s the rub. But it’s not impossible! Stay tuned.

