As a software developer, you deal in numbers. Software is just a collection of binary digits ordered in the right sequence. You like to quantify things and rank things. That’s why many of you hold performance in such high regard: because you can quantify it!
You can’t quantify the maintainability of code easily, however. I believe that’s why many programmers aren’t as interested in code quality: It can be debated ad infinitum, whereas performance numbers are hard and sexy.
Managers, too, want the numbers. I’ve met many a manager who sought to distill the essence of code quality into a single number. Even a non-technical manager can read a performance measurement and know whether or not it meets business needs. But, they don’t have a single number at their disposal that tells them whether or not their teams are doing the right thing.
I’m often asked, “How do I know if my team is really doing TDD well?” There are many numbers that a development team easily can proffer that would seem to answer these questions: unit tests count, code complexity, code coverage, and so on. They can interpret these metrics over time. They also can crunch numbers and relate the metrics to one another for more insights.
For example, an increasing code coverage metric would seem to be a good thing, but it’s possible to quickly increase code coverage just by copy/pasting a few bad integration tests and making a few changes. This is known as “gaming a metric,” something that suggests developers are being nefarious. But, that’s pointing fingers in the wrong direction: When management looks to define goals by the numbers, they get exactly what they ask for: numbers.
Unfortunately, gaming code coverage is a solution that leads to high-cost test maintenance. So, perhaps you can temper the code coverage metric with code complexity. Or, you can combine it with a ratio that represents the number of tests to the overall code size.
Still, the combined metrics will always miss one of the critical elements: How effectively does the code communicate? Even code exhibiting positive scores on all typical metrics, including code complexity, cohesion, low coupling, code coverage, and so on, can be abstruse. And it’s this quality of code—its readability—that is one of the most significant factors in the cost of its maintenance over time.
A developer experienced in clean code concepts can look quickly at code and point out its problems with respect to expressiveness. You can look at the typical code base and point out dozens of areas for improvement, often with your eyes half-open and within a single method. But, you don’t have an effective way of distilling this instant knowledge into a single number.
Michael Feathers suggests that there is a concept of “design sense:” a potential for immediate, even subconscious recognition of code quality. Malcolm Gladwell’s book Blink discusses the value of this concept in other professions. Certainly, a rapid assessment of code quality can be taught, and probably in reasonably short order, but that implies the assessor can read code and wants to look at it.
That leaves your managers with a quandary: They are not technical, but they need to know whether or not their technicians are doing the right things. A number, maybe even a set of numbers, would be nice.
Measuring Agility
In an agile software development shop, your goal is to continually produce software that meets the customer’s needs. You look to ship quality software every two weeks.
The best metric would seem to be what Ron Jeffries calls “Running Tested Features,” or RTF. It simply measures agile directly at the level of its definition: your ability to continue shipping quality product to the customer. RTF represents how many features shipped each iteration. The fact that the features are tested ensures some level of quality.
Unfortunately, a good RTF measurement is only as good as your last iteration. You could hit a brick wall, and be completely unable to deliver “running, tested features” in the very next iteration, and for several iterations following. I’ve seen this happen before on agile software projects, and I’m sure I’ll see it happen again.
On these failed projects, was there a single metric that could have acted as a crystal ball, something that could have indicated serious problems with the project? In at one case, the answer was yes, but we didn’t know what that metric was until it was too late. The appropriate metric would have been something like “dependency of tests upon production code.” Just glancing at the code, we knew it had problems, but we also didn’t believe they would have such devastating effect.
The failure of a single metric on its own doesn’t indicate that you shouldn’t capture them. RTF is a valuable indicator that can tell you many things about the current health of a project. But it has limitations, like any other metric (or combination of metrics), including its inability to predict the future.
One of the main points of agile is that you should be looking at concerns and challenges each and every iteration. Retrospectives, ignored by many agile teams, are a wonderful and ultimately essential tool. A retrospective is an opportunity to take a deep, introspective look at how you work as a team and at the quality of your output. Metrics themselves are ripe for retrospective discussion and adaptation. The shortness of agile iterations allows you to introspect many, many times, correcting small problems before they grow into larger ones.
Agile Metric Principles
So, what makes for a good set of metrics? Based on my experience, I’ve formulated a set of principles for establishing and using metrics in an agile environment. I don’t believe this is an exhaustive or final list, and it may even be too long, but it can always be iterated!
- Don’t produce metrics that no one wants. Too many metrics are overwhelming, and the sheer volume can bury important problems. Minimize the overall number of metrics, and emphasize the ones that tell a story. Try eliminating a questionable metric and see if anyone complains.
- Be honest about how management uses metrics. As soon as developers find out that metrics are used as the basis for punitive measures, even the most honest developer will find a way to game them.
- Don’t use metrics to compare teams. Metrics are best viewed as probes and indicators of problem areas that may warrant further investigation.
- Don’t introduce metrics that require significant work to produce. If an automated build or tool can’t produce a metric, it’s probably not worth the effort.
- Take team maturity into account when selecting metrics. Metrics should change over time. For a team just learning TDD, a simple count of unit tests should emphasize the simple behavior of getting the team to write more tests. The team should quickly mature beyond this phase, and in two iterations should be looking at a more meaningful metric (such as code coverage). And code coverage, too, should be de-emphasized as its numbers become less interesting. Be willing to change metrics.
- Ensure that metrics don’t demoralize the team. A metric that shows little hope of improvement can crush the spirits of a team. For example, in the early phases of trying to cover a large, legacy codebase with tests, a coverage metric is going to show very, very low percentages.
- A single metric on its own has minimal use. As mentioned earlier, a management team shouldn’t emphasize fixing a single metric, such as code coverage, as the goal. Instead, management should emphasize instead fixing the real problem (perhaps the answer is, “get them to really do TDD”).
- Use metrics as a basis for discussion, not as a final decision point.
About the Author
Jeff Langr is a veteran software developer with over a quarter century of professional software development experience. He’s authored two books and over 50 published articles on software development, including Agile Java: Crafting Code With Test-Driven Development (Prentice Hall) in 2005. You can find out more about Jeff at his site, http://langrsoft.com, or you can contact him via email at jeff at langrsoft dot com.