Software cost overruns, schedule delays, and poor quality have been endemic in the software industry for more than 50 years.
The system is stable; let’s just document the known problems.
—Quality control manager of a tier-one application vendor
A large body of literature has appeared over the past three or four decades on how developers can measure various aspects of software development and use, from the productivity of the programmers coding it to the satisfaction of the ultimate end users applying it to their business problems. Some metrics are broader than others. In any scientific measurement effort, you must balance the sensitivity and the selectivity of the measures employed. Here we are primarily concerned with the quality of the software end product as seen from the end user’s point of view. Although much of the software metrics technology used in the past was applied downstream, the overall trend in the field is to push measurement methods and models back upstream to the design phase and even to measurement of the architecture itself. The issue in measuring software performance and quality is clearly its complexity as compared even to the computer hardware on which it runs. Managing complexity and finding significant surrogate indicators of program complexity must go beyond merely estimating the number of lines of code the program is expected to require.
Measuring Software Quality
Historically software quality metrics have been the measurement of exactly their opposite—that is, the frequency of software defects or bugs. The inference was, of course, that quality in software was the absence of bugs. So, for example, measures of error density per thousand lines of code discovered per year or per release were used. Lower values of these measures implied higher build or release quality. For example, a density of two bugs per 1,000 lines of code (LOC) discovered per year was considered pretty good, but this is a very long way from today’s Six Sigma goals. We will start this article by reviewing some of the leading historical quality models and metrics to establish the state of the art in software metrics today and to develop a baseline on which we can build a true set of upstream quality metrics for robust software architecture. Perhaps at this point we should attempt to settle on a definition of software architecture as well. Most of the leading writers on this topic do not define their subject term, assuming that the reader will construct an intuitive working definition on the metaphor of computer architecture or even its earlier archetype, building architecture. And, of course, almost everyone does! There is no universally accepted definition of software architecture, but one that seems very promising has been proposed by Shaw and Garlan:
Abstractly, software architecture involves the description of elements from which systems
are built, interactions among those elements, patterns that guide their composition, and
constraints on those patterns. In general, a particular system is defined in terms of a collection
of components, and interactions among those components.1
This definition follows a straightforward inductive path from that of building architecture, through system architecture, through computer architecture, to software architecture. As you will see, the key word in this definition—for software, at least—is patterns. Having chosen a definition for software architecture, we are free to talk about measuring the quality of that architecture and ultimately its implementations in the form of running computer programs. But first, we will review some classical software quality metrics to see what we must surrender to establish a new metric order for software.
Classic Software Quality Metrics
Software quality is a multidimensional concept. The multiple professional views of product quality may be very different from popular or nonspecialist views. Moreover, they have levels of abstraction beyond even the viewpoints of the developer or user. Crosby, among many others, has defined software quality as conformance to specification.2 However, very few end users will agree that a program that perfectly implements a flawed specification is a quality product. Of course, when we talk about software architecture, we are talking about a design stage well upstream from the program’s specification. Years ago Juran3 proposed a generic definition of quality. He said products must possess multiple elements of fitness for use. Two of his parameters of interest for software products were quality of design and quality of conformance. These separate design from implementation and may even accommodate the differing viewpoints of developer and user in each area.
Two leading firms that have placed a great deal of importance on software quality are IBM and Hewlett-Packard. IBM measures user satisfaction in eight dimensions for quality as well as overall user satisfaction: capability or functionality, usability, performance, reliability, installability, maintainability, documentation, and availability (see Table 3.1). Some of these factors conflict with each other, and some support each other. For example, usability and performance may conflict, as may reliability and capability or performance and capability. IBM has user evaluations down to a science. We recently participated in an IBM Middleware product study of only the usability dimension. It was five pages of questions plus a two-hour interview with a specialist consultant. Similarly, Hewlett-Packard uses five Juran quality parameters: functionality, usability, reliability, performance, and serviceability. Other computer and software vendor firms may use more or fewer quality parameters and may even weight them differently for different kinds of software or for the same software in different vertical markets. Some firms focus on process quality rather than product quality. Although it is true that a flawed process is unlikely to produce a quality product, our focus here is entirely on software product quality, from architectural conception to end use.
TABLE 3.1 IBM’s Measures of User Satisfaction
Total Quality Management
The Naval Air Systems Command coined the term Total Quality Management (TQM) in 1985 to describe its approach to quality improvement, patterned after the Japanese-style management approach to quality improvement. Since then, TQM has taken on many meanings across the world. TQM methodology is based on the teachings of such quality gurus as Philip B. Crosby, W. Edwards Deming, Armand V. Feigenbaum, Kaoru Ishikawa, and Joseph M. Juran. Simply put, it is a management approach to long-term success that is attained through a focus on customer satisfaction. This approach requires the creation of a quality culture in the organization to improve processes, products, and services. In the 1980s and ’90s, many quality gurus published specific methods for achieving TQM, and the method was applied in government, industry, and even research universities. The Malcolm Baldrige Award in the United States and the ISO 9000 standards are legacies of the TQM movement, as is the Software Engineering Institute’s (SEI’s) Capability Maturity Model (CMM), in which organizational maturity level 5 represents the highest level of quality capability.4 In 2000, the SW-CMM was upgraded to Capability Maturity Model Integration (CMMI).
The implementation of TQM has many varieties, but the four essential characteristics of the TQM approach are as follows:
- Customer focus: The objective is to achieve total customer satisfaction—to “delight the customer.” Customer focus includes studying customer needs and wants, gathering customer requirements, and measuring customer satisfaction.
- Process improvement: The objective is to reduce process variation and to achieve continuous process improvement of both business and product development processes.
- Quality culture: The objective is to create an organization-wide quality culture, including leadership, management commitment, total staff participation, and employee empowerment.
- Measurement and analysis: The objective is to drive continuous improvement in all quality parameters by a goal-oriented measurement system.
Total Quality Management made an enormous contribution to the development of enterprise applications software in the 1990s. Its introduction as an information technology initiative followed its successful application in manufacturing and service industries. It came to IT just in time for the redevelopment of all existing enterprise software for Y2K. The efforts of one of the authors to introduce TQM in the internal administrative services sector of research universities encountered token resistance from faculty oversight committees. They objected to the term “total” on the curious dogmatic grounds that nothing is really “total” in practice. As CIO, he attempted to explain TQM to a faculty IT oversight committee at the University of Pennsylvania that this name was merely a phrase to identify a commonly practiced worldwide methodology. But this didn’t help much. However, he persevered with a new information architecture, followed by (totally!) reengineering all administrative processes using TQM “delight-the-customer” measures. He also designed a (totally) new information system to meet the university’s needs in the post-Y2K world (which began in 1996 in higher education, when the class of 2000 enrolled and their student loans were set up).5
Generic Software Quality Measures
In 1993 the IEEE published a standard for software quality metrics methodology that has since defined and led development in the field. Here we begin by summarizing this standard. It was intended as a more systematic approach for establishing quality requirements and identifying, implementing, analyzing, and validating software quality metrics for software system development. It spans the development cycle in five steps, as shown in Table 3.2.
TABLE 3.2 IEEE Software Quality Metrics Methodology
A typical “catalog” of metrics in current use will be discussed later. At this point we merely want to present a gestalt for the IEEE recommended methodology. In the first step it is important to establish direct metrics with values as numerical targets to be met in the final product. The factors to be measured may vary from product to product, but it is critical to rank the factors by priority and assign a direct metric value as a quantitative requirement for that factor. There is no mystery at this point, because Voice of the Customer (VOC) and Quality Function Deployment (QFD) are the means available not only to determine the metrics and their target values, but also to prioritize them.
The second step is to identify the software quality metrics by decomposing each factor into subfactors and those further into the metrics. For example, a direct final metric for the factor reliability could be faults per 1,000 lines of code (KLOC) with a target value—say, one fault per 1,000 lines of code (LOC). (This level of quality is just 4.59 Sigma; Six Sigma quality would be 3.4 faults per 1,000 KLOC or one million lines of code.) For each validated metric at the metric level, a value should be assigned that will be achieved during development. Table 3.3 gives the IEEE’s suggested paradigm for a description of the metrics set.6
TABLE 3.3 IEEE Metric Set Description Paradigm7
|Name||Name of the metric|
|Metric||Mathematical function to compute the metric|
|Cost||Cost of using the metric|
|Benefit||Benefit of using the metric|
|Impact||Can the metric be used to alter or stop the project?|
|Target value||Numerical value to be achieved to meet the requirement|
|Factors||Factors related to the metric|
|Tools||Tools to gather data, calculate the metric, and analyze the results|
|Application||How the metric is to be used|
|Data items||Input values needed to compute the metric|
|Computation||Steps involved in the computation|
|Interpretation||How to interpret the results of the computation|
|Considerations||Metric assumptions and appropriateness|
|Training||Training required to apply the metric|
|Example||An example of applying the metric|
|History||Projects that have used this metric and its validation history|
|References||List of projects used, project details, and so on|
To implement the metrics in the metric set chosen for the project under design, the data to be collected must be determined, and assumptions about the flow of data must be clarified. Any tools to be employed are defined, and any organizations to be involved are described, as are any necessary training. It is also wise at this point to test the metrics on some known software to refine their use, sensitivity, accuracy, and the cost of employing them.
Analyzing the metrics can help you identify any components of the developing system that appear to have unacceptable quality or that present development bottlenecks. Any components whose measured values deviate from their target values are noncompliant.
Validation of the metrics is a continuous process spanning multiple projects. If the metrics employed are to be useful, they must accurately indicate whether quality requirements have been achieved or are likely to be achieved during development. Furthermore, a metric must be revalidated every time it is used. Confidence in a metric will improve over time as further usage experience is gained.
In-Process Quality Metrics for Software Testing
Until recently, most software quality metrics in many development organizations were of an in-process nature. That is, they were designed to track defect occurrences during formal machine testing. These are listed here only briefly because of their historical importance and because we will be replacing them with upstream quality measures and metrics that will supersede them.
Defect rate during formal system testing is usually highly correlated with the future defect rate in the field because higher-than-expected testing defect rates usually indicate high software complexity or special development problems. Although it may be counterintuitive, experience shows that higher defect rates in testing indicate higher defect rates later in use. If these appear, the development manager has a set of “damage control” scenarios he or she may apply to correct the quality problem in testing before it becomes a problem in the field.
Overall defect density during testing is only a gross indicator; the pattern of defect arrival or “mean time between defects” is a more sensitive metric. Naturally the development organization cannot fix all of the problems arriving today or this week, so a tertiary measure of defect backlog becomes important. If the defect backlog is large at the end of a development cycle, a lot of prerelease fixes have to be integrated into the system. This metric thus becomes not only a quality and workload statement but also a predictor of problems in the very near future.
Phase-based defect removal uses the defect removal effectiveness (DRE) metric. This is simply the defects removed during each development phase divided by the defects latent in the product, times 100 to get the result as a percentage. Because the number of latent defects is not yet known, it is estimated to be the defects removed during the phase plus the defects found later. This metric is best used before code integration (already well downstream in the development process, unfortunately) and for each succeeding phase. This simple metric has become a widely used tool for large application development. However, its ad hoc downstream nature naturally leads to the most important in-process metric as we include in-service defect fixes and maintenance in the development process.
Four new metrics can be introduced to measure quality after the software has been delivered:
- Fix the backlog and the backlog management index (BMI)
- Fix the response time and responsiveness
- Percentage of delinquent fixes
- Fix quality (that is, did it really get fixed?)
These metrics are not rocket science. The monthly BMI as a percent is simply 100 times the number of problem arrivals during the month divided by the number of problems closed during the month. Fix responsiveness is the mean time of all problems from their arrival to their close. If for a given problem the turnaround time exceeds the required or standard response time, it is declared delinquent. The percentage of delinquent fixes is 100 times the number that did not get fixed in time divided by the number that did. Fix quality traditionally is measured negatively as lack of quality. It’s the number of defective fixes—fixes that did not work properly in all situations or, worse, that caused yet other problems. The real quality goal is, of course, zero defective fixes.
Software Complexity Metrics
Computer systems are more complex than other large-scale engineered systems and that it is the software that makes them more complex. A number of approaches have been taken to calculate, or at least estimate, the degree of complexity. The simplest basis is LOC, a count of the executable statements in a computer program. This metric began in the days of assembly language programming, and it is still used today for programs written in high-level programming languages. Most procedural third-generation memory-to-memory languages such as FORTRAN, COBOL, and ALGOL typically produce six executable ML statements per high-level language statement. Register-to-register languages such as C, C++, and Java produce about three. Recent studies show a curvilinear relationship between defect rate and executable LOC. Defect density, or defects per KLOC, appears to decrease with program size and then increase again as program modules become very large (see Figure 3.1). Curiously, this result suggests that there may be an optimum program size leading to a lowest defect rate—depending, of course, on programming language, project size, product type, and computing environment.8 Experience seems to indicate that small programs have a defect rate of about 1.5 defects per KLOC. Programs larger than 1,000 lines of code have a similar defect rate. Programs of about 500 LOC have defect rates near 0.5. This is almost certainly an effect of complexity, because small, “tight” programs are usually intrinsically complex. Programs larger than 1,000 LOC exhibit the complexity of size because they have so many pieces. This situation will improve with Object-Oriented Programming, for which there is greater latitude, or at least greater convenience of choice, in component size than for procedural programming. As you might expect, interface coding, although the most defect-prone of all programming, has defect rates that are constant with program size.
FIGURE 3.1 The Relationship Between LOC and Defect Density
In 1977 Professor Maurice H. Halstead distinguished software science from computer science by describing programming as a process of collecting and arranging software tokens, which are either operands or operators. His measures are as follows:
He then based a set of derivative measures on these primitive measures to express the total token vocabulary, overall program length, potential minimum volume for a programmed algorithm, actual program volume in bits, program level as a complexity metric, and program difficulty, among others. For example:
V* is the minimum volume represented by a built-in function that can perform the task of the entire program. S* is the mean number of mental discriminations or decisions between errors—a value estimated as 3,000 by Halstead.
When these metrics first were announced, some software development old-timers thought that Halstead had violated Aristotle’s first law of scientific inquiry: “Don’t employ more rigor than the subject matter can bear.” But none could gainsay the accuracy of his predictions or the quality of his results. In fact, the latter established software metrics as an issue of importance for computer scientists and established Professor Halstead as the founder of this field of inquiry. The major criticism of his approach is that his most accurate metric, program length, is dependent on N1 and N2, which are not known with sufficient accuracy until the program is almost done. Halstead’s formulas fall short as direct quantitative measures because they fail to predict program size and quality sufficiently upstream in the development process. Also, his choice of S* as a constant presumes an unknown model of human memory and cognition and unfortunately is also a constant that doesn’t depend on program volume. Thus, the number of faults depends only on program size, which later experience has not supported. The results shown in Figure 3.1 indicate that the number of defects is not constant with program size but may rather take on an optimum value for programs having about 500 LOC. Perhaps Halstead’s elaborate quantifications do not fully represent his incredible intuition, which was gained from long experience, after all.
About the same time Halstead founded software science, McCabe proposed a topological or graph-theory measure of cyclomatic complexity as a measure of the number of linearly independent paths that make up a computer program. To compute the cyclomatic complexity of a program that has been graphed or flow-charted, the formula used is
More simply, it turns out that M is equal to the number of binary decisions in the program plus 1. An n-way case statement would be counted as n . 1 binary decisions. The advantage of this measure is that it is additive for program components or modules. Usage recommends that no single module have a value of M greater than 10. However, because on average every fifth or sixth program instruction executed is a branch, M strongly correlates with program size or LOC. As with the other early quality measures that focus on programs per se or even their modules, these mask the true source of architectural complexity—interconnections between modules. Later researchers have proposed structure metrics to compensate for this deficiency by quantifying program module interactions. For example, fan-in and fan-out metrics, which are analogous to the number of inputs to and outputs from hardware circuit modules, are an attempt to fill this gap. Similar metrics include number of subroutine calls and/or macro inclusions per module, and number of design changes to a module, among others. Kan reports extensive experimental testing of these metrics and also reports that, other than module length, the most important predictors of defect rates are number of design changes and complexity level, however it is computed.9
Function Point Metrics
Quality metrics based either directly or indirectly on counting lines of code in a program or its modules are unsatisfactory. These metrics are merely surrogate indicators of the number of opportunities to make an error, but from the perspective of the program as coded. More recently the function point has been proposed as a meaningful cluster of measurable code from the user’s rather than the programmer’s perspective. Function points can also be surrogates for error opportunity, but they can be more. They represent the user’s needs and anticipated or a priori application of the program rather than just the programmer’s a posteriori completion of it. A very large program may have millions of LOC, but an application with 1,000 function points would be a very large application or system indeed. A function may be defined as a collection of executable statements that performs a task, together with declarations of formal parameters and local variables manipulated by those statements. A typical function point metric developed by Albrecht10 at IBM is a weighted sum of five components that characterize an application:
These represent the average weighting factors wij, which may vary with program size and
complexity. xij is the number of each component type in the application.
The function count FC is the double sum:
The second step employs a scale of 0 to 5 to assess the impact of 14 general system characteristics in terms of their likely effect on the application:
The scores for these characteristics ci are then summed based on the following formula to find a value adjustment factor (VAF):
Finally, the number of function points is obtained by multiplying the number of function counts by the value adjustment factor:
This is actually a highly simplified version of a commonly used method that is documented in the International Function Point User’s Group Standard (IFPUG, 1999).11
Although function point extrinsic counting metrics and methods are considered more robust than intrinsic LOC counting methods, they have the appearance of being somewhat subjective and experimental in nature. As used over time by organizations that develop very large software systems (having 1,000 or more function points), they show an amazingly high degree of repeatability and utility. This is probably because they enforce a disciplined learning process on a software development organization as much as any scientific credibility they may possess.
Availability and Customer Satisfaction Metrics
To the end user of an application, the only measures of quality are in the performance, reliability, and stability of the application or system in everyday use. This is “where the rubber meets the road,” as users often say. Developer quality metrics and their assessment are often referred to as “where the rubber meets the sky.” This article is dedicated to the proposition that we can arrive at a priori user-defined metrics that can be used to guide and assess development at all stages, from functional specification through installation and use. These metrics also can meet the road a posteriori to guide modification and enhancement of the software to meet the user’s changing needs. Caution is advised here, because software problems are not, for the most part, valid defects, but rather are due to individual user and organizational learning curves. The latter class of problem calls places an enormous burden on user support during the early days of a new release. The catch here is that neither alpha testing (initial testing of a new release by the developer) nor beta testing (initial testing of a new release by advanced or experienced users) of a new release with current users identifies these problems. The purpose of a new release is to add functionality and performance to attract new users, who initially are bound to be disappointed, perhaps unfairly, with the software’s quality. The DFTS approach we advocate in this article is intended to handle both valid and perceived software problems.
Typically, customer satisfaction is measured on a five-point scale:11
- Very satisfied
- Very dissatisfied
Results are obtained for a number of specific dimensions through customer surveys. For example, IBM uses the CUPRIMDA categories—capability, usability, performance, reliability, installability, maintainability, documentation, and availability. Hewlett-Packard uses FURPS categories—functionality, usability, reliability, performance, and serviceability. In addition to calculating percentages for various satisfaction or dissatisfaction categories, some vendors use the net satisfaction index (NSI) to enable comparisons across product lines. The NSI has the following weighting factors:
- Completely satisfied = 100%
- Satisfied = 75%
- Neutral = 50%
- Dissatisfied = 25%
- Completely dissatisfied = 0%
NSI then ranges from 0% (all customers are completely dissatisfied) to 100% (all customers are completely satisfied). Although it is widely used, the NSI tends to obscure difficulties with certain problem products. In this case the developer is better served by a histogram showing satisfaction rates for each product individually.
|Sidebar 3.1: A Software Urban Legend|
|Professor Maurice H. Halstead was a pioneer in the development of ALGOL 58, automatic programming technology, and ALGOL-derived languages for military systems programming at the Naval Electronics Laboratory (NEL) at San Diego. Later, as a professor at Purdue (1967.1979), he took an interest in measuring software complexity and improving software quality. The legend we report, which was circulated widely in the early 1960s, dates from his years at NEL, where he was one of the developers of NELIAC (Naval Electronics Laboratory International ALGOL Compiler). As the story goes, Halstead was offered a position at Lockheed Missile and Space Systems. He would lead a large military programming development effort for the Air Force in the JOVIAL programming language, which he also helped develop. With messianic confidence, Halstead said he could do it with 12 programmers in one year if he could pick the 12. Department of Defense contracts are rewarded at cost plus 10%, and Lockheed had planned for a staff of 1,000 programmers, who would complete the work in 18 months. The revenue on 10% of the burdened cost of 1,000 highly paid professionals in the U.S. aerospace industry is a lot of money. Unfortunately, 10% of the cost of 12 even very highly paid software engineers is not, so Lockheed could not accept Halstead’s proposition. This story was widely told and its message applied for the next 20 years by developers of compilers and operating systems with great advantage, but it has never appeared in print as far as we know. Halstead did leave the NEL to join Lockheed about this time. They benefited from his considerable software development expertise until he went to Purdue.|
Current Metrics and Models Technology
The best treatment of current software metrics and models is Software Measurement: A Visualization Toolkit for Project Control and Process Measurement,12 by Simmons, Ellis, Fujihara, and Kuo. It comes with a CD-ROM that contains the Project Attribute Monitoring and Prediction Associate (PAMPA) measurement and analysis software tools. The book begins with Halstead’s software science from 1977 and then brings the field up to date to 1997, technologically updating the metrics and models by including later research and experience. The updated metrics are grouped by size, effort, development time, productivity, quality, reliability, verification, and usability.
Size metrics begin with Halstead’s volume, now measured in source lines of code (SLOC), and add structure as the number of unconditional branches of control loop nesting and module fan-in and fan-out. The newly added rework attributes describe the size of additions, deletions, and changes made between versions. Combined, they measure the turmoil in the developing product. The authors have also added a new measure of code functionality smaller than the program or module, called a chunk. It is a single integral piece of code, such as a function, subroutine, script, macro, procedure, object, or method. Volume measures are now made on functionally distinct chunks, rather than larger-scale aggregates such as programs or components. Tools are provided that allow the designer to aggregate chunks into larger units and even predict the number of function points or object points. Furthermore, because most software products are not developed from scratch but rather reuse existing code chunks with known quality characteristics, the toolkit allows the prediction of equivalent volume using one of four different algorithms (or all four, if desired) taken from recent software science literature. A new volume measure called unique SLOC has been added. It evaluates new LOC on a per-chunk basis and can calculate unique SLOC for a developing version of the product.
Naturally, volume measures are the major input for effort metrics. Recent research adds five categories of 17 different dominators, which can have serious effort-magnifying effects. The categories into which dominators fall are project, product, organization, suppliers, and customers. For example, potential dominators in the product category include the amount of documentation needed, programming language, complexity, and type of application. In the organization category, they include the number of people, communications, and personnel turnover. Customer includes user interface complexity and requirements volatility, which are negative, but then the dominators are all basically negative. Their name signifies that their presence may have an effort-expansion effect as large as a factor of 10. But when their influence is favorable, they generally have a much smaller positive effect. A range of effort prediction and cost forecasting algorithms based on a variety of theoretical, historical/experiential, statistical, and even composite models are provided.
The third measure category is development time, which is derived from effort, which is derived from size or volume. The only independent new variable here is schedule. Given the resources available to the project manager, the toolkit calculates overall minimum development time and then allows the user to vary or reallocate resources to do more tasks in parallel. However, the system very realistically warns of cost runaways if the user tries to reduce development time by more than 15% of the forecast minimum.
Because effort is essentially volume divided by productivity, you can see that productivity is inversely related to effort. A new set of cost drivers enters as independent variables, unfortunately having mostly negative influences. When cost drivers begin to vary significantly from nominal values, you should take action to bring them back into acceptable ranges. A productivity forecast provides the natural objective function with which to do this.
The quality metrics advocated in Simmons, et al. are dependent on the last three metric sets: reliability, verification, and usability. Usability is a product’s fitness for use. This metric depends on the product’s intended features, their verified functionality, and their reliability in use. Simply stated, this metric means that all promises were fulfilled, no negative consequences were encountered, and the customer was delighted. This deceptively simple trio masks evaluation of multiple subjective psychometric evaluations plus a few performancebased factors such as learnability, relearnability, and efficiency. Much has been written about measures of these factors. To sell software, vendors develop and add more features. New features contain unique SLOCs, and new code means new opportunities to introduce bugs. As might be expected, a large measure of customer dissatisfaction is the result of new features that don’t work, whether due to actual defects or merely user expectations. The only thing in the world increasing faster than computer performance is end-user expectations: A product whose features cannot be validated, or that is delivered late or at a higher-than-expected price, has a quality problem. Feature validation demands that features be clearly described without possible misunderstanding and that metrics for their measurement be identified.
The last point in the quality triangle is reliability, which may be defined as defect potential, defect removal efficiency, and delivered defects. The largest opportunity for software defects to occur is in the interfaces between modules, programs, and components, and with databases. Although the number of interfaces in an application is proportional to the program’s size, it varies by application type, programming language, style, and many other factors. One estimate indicates that 70% or more of software reliability problems are in interfaces. Aside from the occurrence of errors or defects, and their number (if any), the major metric for quality is the mean time between their occurrence. Whether you record time to failure, time intervals between failures, cumulative failures in a given time period, or failures experienced in a given time interval, the basic metric of reliability is time.
New Metrics for Architectural Design and Assessment
A new science of software architecture metrics is slowly emerging, amazingly in the absence of any generally accepted definition of software architecture. Software engineers have been coasting on the metaphor of building architecture for a long time. Some clarity is developing, but it is scarcely more than extending the metaphor. For example, an early (1994) intuitive definition states the following:
There is currently no single, universally accepted definition of software architecture, but typically a systems architectural design is concerned with describing its decomposition into components and their interconnections.15
Actually, this is not a bad start. When one of the authors became manager of a large-scale computer design at Univac as chief architect in 1966, this was the operative definition of computer (hardware) architecture. It was sufficient only because of the tradition of hardware systems design, which had led to the large-scale multiprocessor computer system.16
But to be more precise for software that comes later to this tradition, software architecture is “the structure of the components of a program/system, their interrelationships, and principles and guidelines governing their design and evolution over time.”17 While descriptive, these definitions still do not give us enough leverage to begin defining metrics for the architectural assessment of software. We would like to again quote Shaw and Garlan’s more recent definition:
Abstractly, software architecture involves the description of elements from which systems are built, interactions among those elements, patterns that guide their composition, and constraints on those patterns. In general, a particular system is defined in terms of a collection of components and the interactions among those components.18
A software design pattern is a general repeatable solution to a commonly occurring problem in software design. It is not a finished design that can be transformed directly into program code. Rather, it is a description of or template for how to solve a problem that can be used in many different situations. Object-oriented design patterns typically show relationships and interactions between classes or objects, without specifying the final application classes or objects that are involved. Algorithms are not thought of as design patterns, because they solve computational problems rather than design problems.
In practice, architectural metrics are involved not only upstream in the software development process and architecture discovery but also further downstream before coding begins as architectural review. These terms were introduced by Avritzer and Weyuker19 for use at AT&T and will be used here as well.
Common Architectural Design Problems
The most commonly occurring architectural design problems can be grouped into three categories: project management, requirements, and performance. The following list describes problems affecting project management.20 It’s an excellent list that we have reordered to reflect our own experiences with software development management:
- The stakeholders have not been clearly identified.
- No project manager has been identified.
- No one is responsible for the overall architecture.
- No project plan is in place.
- The deployment date is unrealistic.
- The independent requirements team is not yet in place.
- Domain experts have not been committed to the design.
- No software architect(s) have been assigned.
- No overall architecture plan has been prepared.
- No system test plan has been developed.
- No measures of success have been identified.
- No independent performance effort is in place.
- No contingency plans have been written.
- No modification tracking system is in place.
- Project funding is not committed.
- No quality assurance team is in place.
- No hardware installation schedule exists.
Here are the most common issues affecting the definition of requirements for a software development project (again in order of importance according to our experience):
- The project lacks a clear problem statement.
- No requirements document exists.
- The project lacks decision criteria for choosing the software architecture.
- Outputs have not been identified.
- The size of the user community has not been determined.
- Data storage requirements have not been determined.
- Operational administration and maintenance have not been identified.
- Resources to support a new requirement have not been allocated.
Here are the most common performance issues affecting the architecture of a software development project (priority reordered):
- The end user has not established performance requirements.
- No performance model exists.
- Expected traffic rates have not been established.
- No means for measuring transaction time or rates exists.
- No performance budgets have been established.
- No assessment has been made to ensure that hardware will meet processing requirements.
- No assessment has been made to ensure that the system can handle throughput.
- No performance data has been gathered.
In our experience, the leading critical quality issues in each category are either customer requirements issues or aspects of the project management team’s commitment to the customer’s requirements. This leads to our focus on QFD as a means of hearing the voice of the customer at the beginning of the software development project rather than having to listen to their complaints after the software has been delivered.
Pattern Metrics in OOAD
Object-Oriented Analysis and Design (OOAD) has at last come of age; it is the preferred software development technology today. Sophisticated online transaction-oriented applications and systems in aerospace and medical technology have been using C++ for years. The emergence of Java as the preferred language for Internet application development has established OOAD and Object-Oriented Programming (OOP) as major players. Many college and university computer science programs are Java-based nowadays, and many application groups are busy converting COBOL and PL/1 applications to Java. The use of OOAD both in the development of new systems and in their conversion from legacy procedural language implementations favors the discovery and use of design patterns. A design pattern is a microarchitecture that provides a proven solution to design problems that tend to recur within a given application or software development context. A design pattern includes its static structure as the hierarchy of classes and objects in the pattern and their relationships. It also includes the behaviors (to use the Java term) or dynamic relationships that govern how the objects exchange messages. Design patterns are classified by their users into three groups; they are either creational, structural, or behavioral. Creational patterns concern object creation. Structural patterns capture class or object composition. Behavioral patterns deal with how classes and objects interact. Most patterns are discovered in existing software packages when they are either modified or rewritten by very senior programmers or software architects who notice their repetitive occurrences. A research group at the Institute for Scientific and Technical Research in Trento, Italy has developed a set of metrics for OO software applications that allow them to extract design patterns automatically.21
Ostensibly the major benefit of understanding and using repeating design patterns in OOP is to enhance code reusability. However, software quality enhancement has significant benefits as well. Not only do patterns reduce the unique SLOC in an application, they also reduce its effective volume, increase development productivity, and simplify later enhancement and maintenance. The technology frontier in software development today is learning how to break high-level architectural design into its architectural components. This is similar to how a building architect breaks his or her overall design into horizontal microarchitectures such as stories and plazas and into vertical micro-architectures or infrastructure components such as HVAC, plumbing, power, elevators, and stairways.
The future of software quality is further automation by automatic program generation. This means the ability to break the overall customer-responsive architecture (form follows function) into successively lower levels of architecture, down to micro-architectures such as design patterns from which quality application software can be generated automatically.
- Historically software quality metrics have measured exactly the opposite of quality—that is, the number of defects or bugs per thousand lines of code.
- Software is a multidimensional concept that can be viewed from many professional and user viewpoints.
- Two leading firms in customer-focused software quality are IBM and Hewlett-Packard.
- IBM has a proprietary measure set, whereas HP uses five Juran quality parameters.
- The Naval Air Systems Command coined the term Total Quality Management (TQM) in 1985 to describe its own quality improvement program. It soon spread worldwide.
- The four essential characteristics of a TQM program in any field are customer focus, process improvement, quality culture, and measurement and analysis.
- TQM made an enormous contribution to the quality of enterprise software in the early 1990s, just in time for the Y2K transition.
- Until recently, most software quality metrics were of an in-process nature; metrics to support DFTS must be applied upstream in the development process.
- Small programs (less than 100 LOC) exhibit 1.5 defects per KLOC. Large programs (more than 1,000 LOC) exhibit 1.5 defects per KLOC. Medium-sized programs often have only 0.5 defects per KLOC.
- Sophisticated software tools for measuring software quality, such as PAMPA, are beginning to appear.
- OOP goals in software reusability tend to enhance software quality as well.
- M. Shaw and D. Garlan, Software Architecture: Perspectives on an Emerging Discipline (New Jersey: Prentice Hall, 1996), p. 1.
- P. B. Crosby, Quality Is Free: The Art of Making Quality Certain (New York: McGraw-Hill, 1979).
- J. M. Juran and F. M. Gryna, Jr., Quality Planning and Analysis: From Product Development Through Use (New York: McGraw-Hill, 1970).
- S. H. Kan, Metrics and Models in Software Quality Engineering (Singapore: Pearson Education, 2003), p. 7.
- P. C. Patton and L. May, “Making Connections: A five-year plan for information systems and computing,” ISC, University of Pennsylvania, 1993.
- IEEE, Standard for a Software Quality Metrics Methodology (New York: IEEE, Inc., 1993).
- Ibid, p. 10.
- Op cit Kan, p. 312.
- Op cit Kan, p. 327.
- S. D. Conte, H. E. Dunsmore, V. Y. Shen, Software Engineering Metrics and Models (Menlo Park, CA: Benjamin/Cummings, 1986).
- IFPUG, International Function Point User’s Group Standard (IFPUG, 1999).
- D. B. Simmons, N. C. Ellis, H. Fujihara, W. Kuo, Software Measurement: A Visualization Toolkit for Project Control and Process Measurement (New Jersey: Prentice-Hall, 1998).
- Op cit Kan, p. 55.
- Op cit Simmons, et al., p. 250.
- Ibid, p. 257.
- D. Garlan, R. Allen, J. Ockerbloom, “Architectural mismatch: why reuse is so hard,” IEEE Software, Nov. 1995, pp. 17.26 (p. 20).
- D. Garlan and D. Perry, “Introduction to the special issue on software architecture,” IEEE Transactions on Software Engineering, April 1995, pp. 269.274 (p. 269).
- Op cit Shaw and Garlan, p. 1.
- G. Avritzer and E. J. Weyuker, “Investigating Metrics for Architectural Assessment,” Proceedings of the Fifth International Software Metrics Symposium, pp. 2.10.
- Ibid, p. 8.
- G. Antoniol, R. Fiutem, L. Cristoforetti, “Using Metrics to Identify Design Patterns in Object-Oriented Software,” Proceedings of the Fifth International Software Metrics Symposium, pp. 23.34.
About the Authors
Bijay Jayaswal is the CEO of Agilenty Consulting Group, LLC. He has held senior executive positions and has consulted in quality and strategy for the last 20 years and has helped introduce corporate-wide initiatives in reengineering, Six Sigma, and Design for Six Sigma and has worked with senior executive teams for effective implementation of such initiatives.
Peter Patton has been a leader of large development projects and is the author of seventeen books, book chapters, and monographs on computer hardware and software architecture. Both authors have done extensive writing and college teaching, as well as consulting to software development groups.
About the Source of the Material
Design for Trustworthy Software: Tools, Techniques, and
Methodology of Developing Robust Software
By Bijay Jayaswal, Peter Patton