American Association for Clinical Chemistry
Better health through laboratory medicine
August 2008 Clinical Laboratory News: ROC Curves

 

August 2008: Volume 34, Number 8 


ROC Curves
Uncovering the Pearls and Avoiding the Pitfalls 
By David Plaut

 

When clinicians order blood tests for patients, in essence they are asking for help in making a diagnosis. Their training tells them to request particular tests based on what they observe and what patients describe. However, this information often indicates several possible disorders or a cluster of similar diseases. To decide what to do next—another lab test, a radiology test, a biopsy, or even make a diagnosis—clinicians rely upon infor-mation from the lab.

It goes without saying that clinicians expect lab results to be accurate and precise; otherwise, the diagnosis could be wrong. Accuracy and precision are analytical aspects of lab tests. But frequently, these qualities are confused with the clinical efficacy of a test, which is how well the test identifies patients with a particular disease or set of diseases that share common signs and symptoms.

Many diagnostic tests performed by clinical labs are quantitative and use a cutpoint to distinguish a normal from an abnormal result. But to establish the cutpoint, researchers determine the extent to which the test results vary among people who do or do not have the diagnosis of interest. This is not a trivial exercise and requires statistical analysis. The receiver operating characteristic (ROC) curve is one such statistical analysis method that was developed in the 1950s for evaluating radar signal detection. Today, ROC curves now are used routinely in medicine to evaluate diagnostic tests.

This article will describe how ROC curves are obtained, what they mean, and what laboratorians should understand about them.

In the Beginning

Before becoming a part of any diagnostic protocol, researchers often evaluate new tests in blinded clinical trials. However, whether the study is blinded or not, it is at this evaluation phase that ROC curves come into play. Think of ROC curves as a trade-off between the rates of false-negative results and false-positive results—there being no “perfect test.”

In a research setting, ROC curves begin with two pieces of data on each of a series of patients; one of these is a lab result, and the other is a diagnosis provided by the clinician without regard for the result of the test being evaluated. It is in this sense that a test is referred to as blinded. The clinician can use other tests from the lab, radiology, or a biopsy, as well as the signs and symptoms she sees, but not the data from the test being evaluated as a diagnostic tool. Clearly, developing ROC curves for a lab test depends on the cooperative efforts of multiple disciplines, and laboratorians should be included when the data are reviewed and decisions are made about the test’s clinical efficacy.

While a research study of a new lab test is necessary to validate its utility, most labs accept the data cleared by the FDA, as well as data published in journals, rather than doing a ROC curve study for each test added to the menu. Take, for example, a serum glucose test or a hemoglobin method. Most hospital labs would not perform ROC curve analysis of these analytes unless their population significantly differed from that in the validation study.

The Basics of ROC Curves

To understand how ROC curves are derived and how they are used, it’s helpful to work through an example. Table 1A shows what a data collection worksheet for the ROC curve analysis might look like. In a research study, the number of patients tested could run into the hundreds, but in a community hospital, data may be collected on only a few dozen patients. There is no ideal number for the sample size, although the six data points in our example are not sufficient to determine a ROC curve.

From these data points, a 2 x 2 grid is prepared (Table 1B). In this example, a cutpoint, also referred to as a cutoff, of 15 was selected to label a test result as positive. Any value below 15 would be considered a negative result. It is important to recognize, however, that any value can be chosen as the cutoff value. In essence, the cutoff value represents a compromise between the total number of positive and negative results. If the cutoff value were set at 16 in this example, there would be no false positives.

Table 1
A. Data Collection Table for ROC Curve Study
Patient ID
Lab result
Medical
diagnosis
Comment
123
9
254
10
436
16
764
22
+
325
34
+
654
36
+

B. A 2 x 2 Grid of the Data
Diagnosis
Neg
Pos
Lab Result
Neg
2
0
Pos
1
3

Definitions

Five statistics can be calculated from these diagnoses and the corresponding laboratory results.

Sensitivity = Positive result with Positive Dx/All Positive Dx

Specificity = Negative result with Negative Dx/All Negative Dx

Positive predictive value =
 Positive result with Positive Dx/All Positive Results

Negative predictive value =
 Negative result with Negative Dx/All Negative Results

Efficiency = Number of correct results
 (Pos lab–Pos Dx) + (Neg lab–Neg Dx)/total patients

 

Easier Said Than Done

All this seems quite straightforward. But don’t be lulled into believing this is the case. While ROC curves are visually appealing and simple to read, there are many caveats that need to be considered. It is to those I wish to draw your attention and ask that you keep in mind as you read the literature, package inserts, or look at ROC curves on slides at a meeting.

Let’s first look at the lab. In the example, at a cutoff value of 15, patient 436 is positive in the lab but negative to the clinician. This is considered a false-positive result. You might ask, “What went wrong?” Among the many possible answers are incorrect patient ID, analytical interference in the test, or the test simply is not specific for the disease.

To better understand the specificity of the test, it’s instructive to look at its imprecision. If the test has a 5% coefficient of variation (%CV) at a level of 16, the result could have been as low as 14.4 or as high as 17.6 (16 x 0.05 = 0.8; 0.8 x 2 = 1.6 or 2 SDs). In other words, there’s a good possibility that retesting the same sample would produce a “negative” result. This is not a trivial issue when only a handful of patient samples are tested. In this example, the specificity is 67% when the cutoff value is set at 15. However, taking into account the imprecision of the method, if the result was 14.5, the test would have a specificity of 100%.

Bear in mind that in studies of this type, the test is being performed on patients who have met certain criteria. In the emergency department, the criteria for ordering a test for troponin may include shortness of breath, chest pain, and being younger than 44 years. This is to say that the patients are selected before the test is ordered. Clearly, patient selection can have an effect on the both the diagnosis and laboratory parts of the ROC curve data sheet.

When reading a package insert or an article in a journal, make certain you understand the criteria used to select the patients studied. Then ask yourself: Is my patient population similar? Or do my physicians use other criteria to select the patients for whom they request this test? Are they perhaps looking for a different disease? It is these questions that may prompt a laboratory not engaged in a research study to do its own ROC curve study with fewer patients.

In addition to the imprecision inherent in any test are various non-analytical factors. One of these is timing. It is possible that the physician made the correct diagnosis, but the patient presented earlier, or later, in the disease continuum than the test can detect. Some examples are: a heart attack patient presenting within 30 minutes of the occurrence, a pregnant woman presenting the morning after conception, or a patient who had a thrombus in her leg a week ago and now has a negative d-Dimer result.

Another reason that the lab result and the diagnosis may differ comes down to the individual clinician. One possibility is that the clinician’s diagnosis is incorrect. There are many ways for this to happen. First, the patient may not give the clinician all the information about her symptoms or the patient may be vague about the symptoms or even evasive. In addition, many disease states have signs and symptoms—such as shortness of breath or abdominal pain—that are present in more than one disease.

Incorrect diagnosis can also be attributed to human error, and there are many opportunities for this. The clinician must interpret the patient’s symptoms from visual, tactile, olfactory, and auditory input. Moreover, clinicians are frequently rushed and must establish a diagnosis within a short time, and patients are often anxious to have an immediate decision. There are any number of other explanations of how the clinician could diagnose a patient as negative, when in fact, the patient is positive. In the data collection, only a yes or no answer is recorded, but in reality, many factors go into making diagnostic decisions. The process is not black and white.

Which Test Is Better?

ROC curve data appear in many journal articles, and sometimes only a part of the ROC curve data is published when two methods are being compared. For example, imagine two tests, A and B, are being compared for their ability to detect a given disease. The authors present the data shown in Table 2A.

Which test is better? The data do not make it easy to answer this question. Look back at Figure 1A and you will see part of the problem: as sensitivity increases, specificity decreases and vice versa. It is possible that these two tests are quite similar in their ability to detect the disease. The problem here is that it appears that the cutoff for test A selected specificity over sensitivity, whereas the cutoff for test B chose sensitivity over specificity. Had the researcher first chosen a sensitivity of, say, 90%, and then looked at the corresponding value for specificity of test A and B, the data might have appeared more like those in Table 2B.

Table 2
Comparing Tests: Which One is Better?
A
Statistic
Test A
Test B
Sensitivity
82
93
Specificity
94
80
Here the cutoff value for A was selected to be highly specific for the disease, and B’s cutoff value was selected to be highly sensitive for the diagnosis.
B
Statistic
Test A
Test B
Sensitivity
90
90
Specificity
84
85
Here data are presented for test A and B at 90% sensitivity. The tests appear to have roughly equal specificity for the diagnosis.

When the authors also present the actual graph of the data, even though only a finite number of points are plotted as in Figure 1A, more information is available to the reader. It is also helpful when authors include information about the selection of a value for either sensitivity or specificity. Then the reader can make a more informed judgment about which test is better.

Demystifying Area Under the Curve

One more aspect of the ROC curve—the area under the ROC curve (AUC)—is frequently used to describe a test’s validity. Figure 1C shows two variations of AUC. In the ideal world, the data would go from 0.0 instantly up to a sensitivity of 1.00 and then horizontally across at 1.00 to a point on the specificity scale of 1.00. This gives an AUC of 1.0. On the other hand, any value less than 1.00 is less than ideal.

Figure 1C also shows a line with a 45° slope, which is an AUC of 0.5. In some respects, a diagnosis based on a test with an AUC of 0.5 is equivalent to flipping a coin. Generally, such data signify that the test is useless for diagnosing the disease or condition.

Published studies often include an estimated AUC when comparing two or more tests. Let’s look at an example from a recent article on cardiac markers. “Diagnostic efficiency was compared…by AUC for three strategies: 6-h post-pain CK-MB measurement; Delta CK-MB; and 6-h post-pain cTnT measurement. At 6-h post pain the respective values were: CK-MB 0.939; Delta CK-MB 0.948; and cTnT 0.989.” The authors conclude that cTnT at 6 hours has high diagnostic sensitivity for AMI and is superior to CK-MB mass and Delta CK-MB even using a low cut-off value. However, one might ask if 0.989 is statistically different from 0.948.

Because most tests use only a single cutoff value and the AUC is obtained from multiple cutoffs, you may sometimes wonder about the value of the AUC. The AUC has one direct interpretation: If you sample a randomly selected healthy patient and obtain a result of x and sample a randomly selected diseased patient and obtain a value of y, the AUC is an estimate of the probability that y is greater than x, assuming that large values indicate disease.

The Take-Home Message

In general, ROC curve analysis is a set of statistical tools that helps select optimal tests and similarly helps discard suboptimal ones. The ROC curves are also helpful for selecting optimal cutoffs for a test and have become a common tool in medicine to determine the clinical accuracy of lab tests.

When you look at a ROC curve, remember there is more to the simple lines on a graph. All ROC curves are good representations of the data; it is the other factors—the lab test, the other diagnostic tests, the patients, and the clinicians—that are all imperfect. These details can explain why your clinician does not like a test or why you field callers wondering why a test is positive when it “it shouldn’t be.” There are many pearls in a ROC curve, but be aware of the possible pitfalls!

SUGGESTED READING

  1. Barry H and Ebell M. Test Characteristics and Decision Rules. Endocrin and Metabol Clinics North Am 1997;26(1):45–65.
  2. Burke D. Test Selection Strategies. Clinics in Lab Medicine 2002;22(2):xi–xii.
  3. Galen RS and Gambino SR. Beyond normality: the predictive value and efficiency of medical diagnoses. New York, N.Y: Wiley, 1975.
  4. Henderson AR. Assessing test accuracy and its clinical consequences: a primer for receiver operating characteristic curve analysis. Ann Clin Biochem 1993;30:521–539.
  5. Metz C. Basic principles of ROC analysis. Seminars in Nucl Med 1978;8:283–298.
  6. Obuchowski N, Lieber M, Wians F. ROC curves in clinical chemistry: uses, misuses, and possible solutions. Clin Chem 2004;50(7):1118–1125.
  7. Riegelman R and Hirsch R. Studying a study and testing a test: how to read the medical literature. 2nd ed. Boston: Little, Brown, 1989.
  8. Zou K, O’Malley J, and Mauri K. ROC Analysis. Circulation 2007;115:654–657.
  9. Zweig M and Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 1993;39:561–577.

David Plaut is a chemist and statistician in Plano, Texas. He spent a number of years developing assays for the clinical laboratory, as well as the first PC-based, intra- and inter-laboratory QC system.