Software Quality Metrics, Page 2
In-Process Quality Metrics for Software Testing
Until recently, most software quality metrics in many development organizations were of an in-process nature. That is, they were designed to track defect occurrences during formal machine testing. These are listed here only briefly because of their historical importance and because we will be replacing them with upstream quality measures and metrics that will supersede them.
Defect rate during formal system testing is usually highly correlated with the future defect rate in the field because higher-than-expected testing defect rates usually indicate high software complexity or special development problems. Although it may be counterintuitive, experience shows that higher defect rates in testing indicate higher defect rates later in use. If these appear, the development manager has a set of "damage control" scenarios he or she may apply to correct the quality problem in testing before it becomes a problem in the field.
Overall defect density during testing is only a gross indicator; the pattern of defect arrival or "mean time between defects" is a more sensitive metric. Naturally the development organization cannot fix all of the problems arriving today or this week, so a tertiary measure of defect backlog becomes important. If the defect backlog is large at the end of a development cycle, a lot of prerelease fixes have to be integrated into the system. This metric thus becomes not only a quality and workload statement but also a predictor of problems in the very near future.
Phase-based defect removal uses the defect removal effectiveness (DRE) metric. This is simply the defects removed during each development phase divided by the defects latent in the product, times 100 to get the result as a percentage. Because the number of latent defects is not yet known, it is estimated to be the defects removed during the phase plus the defects found later. This metric is best used before code integration (already well downstream in the development process, unfortunately) and for each succeeding phase. This simple metric has become a widely used tool for large application development. However, its ad hoc downstream nature naturally leads to the most important in-process metric as we include in-service defect fixes and maintenance in the development process.
Four new metrics can be introduced to measure quality after the software has been delivered:
- Fix the backlog and the backlog management index (BMI)
- Fix the response time and responsiveness
- Percentage of delinquent fixes
- Fix quality (that is, did it really get fixed?)
These metrics are not rocket science. The monthly BMI as a percent is simply 100 times the number of problem arrivals during the month divided by the number of problems closed during the month. Fix responsiveness is the mean time of all problems from their arrival to their close. If for a given problem the turnaround time exceeds the required or standard response time, it is declared delinquent. The percentage of delinquent fixes is 100 times the number that did not get fixed in time divided by the number that did. Fix quality traditionally is measured negatively as lack of quality. It's the number of defective fixes—fixes that did not work properly in all situations or, worse, that caused yet other problems. The real quality goal is, of course, zero defective fixes.
Software Complexity Metrics
Computer systems are more complex than other large-scale engineered systems and that it is the software that makes them more complex. A number of approaches have been taken to calculate, or at least estimate, the degree of complexity. The simplest basis is LOC, a count of the executable statements in a computer program. This metric began in the days of assembly language programming, and it is still used today for programs written in high-level programming languages. Most procedural third-generation memory-to-memory languages such as FORTRAN, COBOL, and ALGOL typically produce six executable ML statements per high-level language statement. Register-to-register languages such as C, C++, and Java produce about three. Recent studies show a curvilinear relationship between defect rate and executable LOC. Defect density, or defects per KLOC, appears to decrease with program size and then increase again as program modules become very large (see Figure 3.1). Curiously, this result suggests that there may be an optimum program size leading to a lowest defect rate—depending, of course, on programming language, project size, product type, and computing environment.8 Experience seems to indicate that small programs have a defect rate of about 1.5 defects per KLOC. Programs larger than 1,000 lines of code have a similar defect rate. Programs of about 500 LOC have defect rates near 0.5. This is almost certainly an effect of complexity, because small, "tight" programs are usually intrinsically complex. Programs larger than 1,000 LOC exhibit the complexity of size because they have so many pieces. This situation will improve with Object-Oriented Programming, for which there is greater latitude, or at least greater convenience of choice, in component size than for procedural programming. As you might expect, interface coding, although the most defect-prone of all programming, has defect rates that are constant with program size.
FIGURE 3.1 The Relationship Between LOC and Defect Density
In 1977 Professor Maurice H. Halstead distinguished software science from computer science by describing programming as a process of collecting and arranging software tokens, which are either operands or operators. His measures are as follows:
He then based a set of derivative measures on these primitive measures to express the total token vocabulary, overall program length, potential minimum volume for a programmed algorithm, actual program volume in bits, program level as a complexity metric, and program difficulty, among others. For example:
V* is the minimum volume represented by a built-in function that can perform the task of the entire program. S* is the mean number of mental discriminations or decisions between errors—a value estimated as 3,000 by Halstead.
When these metrics first were announced, some software development old-timers thought that Halstead had violated Aristotle's first law of scientific inquiry: "Don't employ more rigor than the subject matter can bear." But none could gainsay the accuracy of his predictions or the quality of his results. In fact, the latter established software metrics as an issue of importance for computer scientists and established Professor Halstead as the founder of this field of inquiry. The major criticism of his approach is that his most accurate metric, program length, is dependent on N1 and N2, which are not known with sufficient accuracy until the program is almost done. Halstead's formulas fall short as direct quantitative measures because they fail to predict program size and quality sufficiently upstream in the development process. Also, his choice of S* as a constant presumes an unknown model of human memory and cognition and unfortunately is also a constant that doesn't depend on program volume. Thus, the number of faults depends only on program size, which later experience has not supported. The results shown in Figure 3.1 indicate that the number of defects is not constant with program size but may rather take on an optimum value for programs having about 500 LOC. Perhaps Halstead's elaborate quantifications do not fully represent his incredible intuition, which was gained from long experience, after all.
About the same time Halstead founded software science, McCabe proposed a topological or graph-theory measure of cyclomatic complexity as a measure of the number of linearly independent paths that make up a computer program. To compute the cyclomatic complexity of a program that has been graphed or flow-charted, the formula used is
More simply, it turns out that M is equal to the number of binary decisions in the program plus 1. An n-way case statement would be counted as n . 1 binary decisions. The advantage of this measure is that it is additive for program components or modules. Usage recommends that no single module have a value of M greater than 10. However, because on average every fifth or sixth program instruction executed is a branch, M strongly correlates with program size or LOC. As with the other early quality measures that focus on programs per se or even their modules, these mask the true source of architectural complexity—interconnections between modules. Later researchers have proposed structure metrics to compensate for this deficiency by quantifying program module interactions. For example, fan-in and fan-out metrics, which are analogous to the number of inputs to and outputs from hardware circuit modules, are an attempt to fill this gap. Similar metrics include number of subroutine calls and/or macro inclusions per module, and number of design changes to a module, among others. Kan reports extensive experimental testing of these metrics and also reports that, other than module length, the most important predictors of defect rates are number of design changes and complexity level, however it is computed.9
Function Point Metrics
Quality metrics based either directly or indirectly on counting lines of code in a program or its modules are unsatisfactory. These metrics are merely surrogate indicators of the number of opportunities to make an error, but from the perspective of the program as coded. More recently the function point has been proposed as a meaningful cluster of measurable code from the user's rather than the programmer's perspective. Function points can also be surrogates for error opportunity, but they can be more. They represent the user's needs and anticipated or a priori application of the program rather than just the programmer's a posteriori completion of it. A very large program may have millions of LOC, but an application with 1,000 function points would be a very large application or system indeed. A function may be defined as a collection of executable statements that performs a task, together with declarations of formal parameters and local variables manipulated by those statements. A typical function point metric developed by Albrecht10 at IBM is a weighted sum of five components that characterize an application:
These represent the average weighting factors wij, which may vary with program size and complexity. xij is the number of each component type in the application.
The function count FC is the double sum:
The second step employs a scale of 0 to 5 to assess the impact of 14 general system characteristics in terms of their likely effect on the application:
The scores for these characteristics ci are then summed based on the following formula to find a value adjustment factor (VAF):
Finally, the number of function points is obtained by multiplying the number of function counts by the value adjustment factor:
This is actually a highly simplified version of a commonly used method that is documented in the International Function Point User's Group Standard (IFPUG, 1999).11
Although function point extrinsic counting metrics and methods are considered more robust than intrinsic LOC counting methods, they have the appearance of being somewhat subjective and experimental in nature. As used over time by organizations that develop very large software systems (having 1,000 or more function points), they show an amazingly high degree of repeatability and utility. This is probably because they enforce a disciplined learning process on a software development organization as much as any scientific credibility they may possess.