Wouldn’t it be nice if we lived in a world in which there were no conflicts? I say it’s worth working towards that lofty goal. Today I’ll “act locally” and talk about how to avoid conflicts when merging branched files back into the main line of development. Today’s discussion focusses mostly on source code.
In my previous post I talked about how difficult it is to merge a developer’s changes into the development line when somebody else checked in his or her code first and there are conflicts between the two sets of changes. The situation can get quite tedious for certain high-traffic files. Sometimes it gets so bad that developers will do anything to avoid these files. Or, when their changes involve multiple files, they check in all the other changes first (including perhaps kluges that will have to be removed later), then wait for the file to become available, then get in as quickly as they can to make their changes. Everybody wants to get their job done, and frankly most of the time they don’t give a hoot whether they will inconvenience other developers. It can get ugly working on a project that has these problems.
Which leads to the question, “are there ways to avoid conflicts?” It turns out that yes, there are, but there is no perfect solution. The obvious one is to provide an explicit locking mechanism in the repository. This is called pessimistic locking. It’s pessimistic because it operates under the assumption that somebody is going to change the same file so it’s best to prevent that from happening. It’s a defeatist attitude. If the artifact is locked by another user you can view the current version but you can’t check it out. Only when the current user has checked in a new version (or unlocked it, having realized he didn’t have to make changes) can you lock it yourself. When you lock it, you may lock only the version that the previous locker checked in. In source code control systems, often this causes problems if the previous locker checked in a whole bunch of files at the same time: You must grab the current version of those other files too, so that your code will compile. But this is a minor inconvenience. The worst problem is simply the fact that you must wait for the artifact to become available before you can check it out. Sometimes, you really have nothing better to do. You just wait.
Sometimes “playing nice” works against you. Let’s say you have a task that will likely involve checking out a dozen files. Rather than checking out all of them immediately, you’re a good guy, so you check out only the first four files. Later on, you check out four more, but in the meantime let’s say that three of them have changed. When this happens often you must “resynchronize” your entire workspace to the current versions so that your code will compile. Then when you check out the last four, you may have to resynchronize again. Suddenly being a good guy is not so attractive.
The other locking is called, reasonably enough, optimistic locking. It’s optimistic because the assumption is that nobody will also check out the same file. The optimistic is one where you can check out any version of any file, and only when you are checking it back in do you have to merge your code with the version of the code in the development line. This is nicer in some respects because you can get started on your work immediately – no waiting for somebody else to check in a file – and you can work for a longer period of time on your own local copy of the code. Your starting point on your local machine is good enough to get going. You don’t have to occasionally grab the entire repository in order to ensure that it compiles. The disadvantage is that when you are attempting to check in the changes, you must merge your changes with other changes. This is challenging and it may invalidate all your unit testing. If you have to redo your testing, it may take so long that you have to do another merge before you can finally check in your changes. Total yuck.
The fact that neither optimistic nor pessimistic locking is best suited for all scenarios has led some Brazilian researchers to propose a way to determine which is best on a per-file basis [1]. Most SCM systems enforce the locking as a policy, so this is not an option. But let’s just say you can do what they’re proposing. What they suggest is that when the cost of merging tends to be high but the risk of conflict is low, then you should use a pessimistic policy. When the cost of merging tends to be low but the risk of conflict is high, then you should use an optimistic policy. When both merge effort and risk are low then it doesn’t matter which policy you use. When both the merge effort and risk are high, then neither policy will do a good job. In that last case the recommendation is to consider refactoring the code. Seems like a good approach, frankly. We often face the situation at my job where we add new features to the product even when we know that we should be refactoring code, simply because developers can’t prove that the code needs to be refactored.
I would recommend another way to determine which files should be refactored: Simply count the number of versions of each file, and the files with the highest version number is a potential candidate. Most files in the repository are version 1. They simply get checked in and are never changed. But a few of them have a great many changes. The distribution of versions might look like this:
On the left is a linear distribution of version numbers. The big peak on the left corresponds to all the files at version 1, and the little spike on the right corresponds to all files having version numbers off the scale on the right. This is an exponential distribution which means if you take the logarithm of the count and plot against the number of files (the right diagram) you will get roughly a straight line. The long “tail” of the distribution to the right – the part that doesn’t fit well to the straight line, as denoted by the “refactoring threshold” – corresponds to the files that should be considered for refactoring.
I confirmed on one of the components at work that the distribution does indeed fit nicely to an exponential curve, and sure enough the files to the right of the refactoring threshold were the “usual suspects” – files that everyone modifies all the time. And yes, they need to be refactored.
[1] João Gustavo Prudêncio, Leonardo Murta, Cláudia Werner, “On the Selection of Concurrency Control Policies for Configuration Management,” sbes, pp.155-164, 2009 XXIII Brazilian Symposium on Software Engineering, 2009

Only a few digg would certainly explore the following theme like you would perform.