Occasionally I have to shift around my financial records from the early stone age – 1997 and before. My wife gets on my case to get rid of the paperwork but I never seem to get around to it. It doesn’t take up much room so why bother? In the back of my mind I can imagine a social anthropologist finding the files in an attic some 200 years from now and wondering in awe over the quaint practices of the late 20th century such as the transmission of information on paper, the fact that I had to bring my car to a human to repair it, and the bizarre notion that I told the government how much money I made and hence how much tax I owed, rather than the government telling me. So if that’s my future claim to posterity, why would I want to throw it all away?
Most organizations are not particularly concerned about posterity and yet they seem equally reluctant to throw out any of their data. It costs time and money to do so, and purging data introduces very high risks. What happens if you remove data that you actually need at some point? Of course you will have made backups, but it would be so difficult to recover the data that in all but the most dire circumstances that data is gone for good. Here are the pros and cons of purging data:
Pro
- remove data that would otherwise show up in search results as noise
- require less disk space
- less demand on database server CPU
- less bandwidth shifting around useless data
Con
- may purge data you actually need
- takes time to plan and execute the purge
- no customer is paying you for it
On the pro side, we invoke Moore’s law and say that the incremental cost of disks, bandwidth and CPU is so small that it’s not worth the effort. And even if your organization cared about a loss of efficiency due to an overabundance of data – which they don’t – they figure that some sort of data filter will do the trick. On the con side, there is the small risk of loss of data, which probably doesn’t matter very much. The bottom line is, frankly, the bottom line. No money, no purgey.
Does it matter if your records go so far back that you couldn’t possibly use the information? For some data, it actually does matter. Take feature requests, for example. Feature requests may be posted by your customer support team on behalf of customers. They may also be posted by your product management team, or by other stakeholders. Over time, some feature requests become obsolete because technology has evolved beyond the point where the feature is needed any more. Also, over time you will accumulate clusters of feature requests that address various aspects of the same problem. Occasionally it is a good idea to consolidate all the old information into a single feature request. But is it necessary to actually remove those old feature requests from the system? No. You can put them into a Retired state and have a rule that feature requests in this state are simply ignored by all operations in the system. Ignorance is bliss.
Another example is a record of historical changes to a file in your source code repository. Do you really care that 18 years ago somebody named Leonard Zither – who of course no longer works for your company – used an encryption algorithm that was adequate at the time but is no longer suitable? Mr. Zither is now comfortably enjoying his retirement in Phoenix and will probably not take your call if you wanted clarification on the choices he made. So the thing to do is to mark all lines in the code that are older than, say, 10 years old, as “old” and therefore you have no further information about why they are there. Also, versions of the files that are older than 10 years can be removed from your view of the evolution history of the file. I don’t think you want to disable all possible access to the old data. Just disable it for regular user access.
There is a class of data which I would call “temporary data” which has little value once it has been created and consumed. These would be intermediate output files, repository snapshots of checkpointed data, and bits of one-off code and data that nobody uses any more. I see no particular need to hang onto that kind of data. In fact, I’d make that a stronger statement and recommend that they get removed as soon as you know that you no longer need them. As time goes on it will be increasingly difficult to decide whether it is worth retaining.
There is a final class of data which grows even faster than Moore’s law can accommodate, which is of marginal utility and therefore worth considering whether it should be purged or not. These include emails, IM conversations, tweets and log files. These are firmly under the purview of the IT department and do not really belong in the information repository on which your company is built. They are not the “crown jewels” of your organization. Let your IT department handle them, as they already do.
So the answer seems to be that the right time to purge data from your corporate repository is… never! In fact, there is some benefit to having even extremely old data in your repository. Some enterprising author may want to dig up information about the earliest days of development of your product which after 20 years has become a huge international hit. The data in your repository may some day be the digital equivalent of Thomas Edison’s notebooks or Michelangelo’s rough drafts, which are enormously important historical records. There are notable exceptions, including temporary files and old communication logs. But for the corporate repository, the most important work artifacts in your system, I say you should keep it all.
