December 2011: Volume 37, Number 12
Proficiency Testing: Making the Grade
When Is Just Getting by Good Enough?
By Bill Malone
Next year, the rules governing proficiency testing (PT) in clinical labs will turn 20 years old. Considering the advances that have occurred over the past 2 decades, federal regulators who oversee the program that requires labs to regularly evaluate their performance think it’s time for a change. The Centers for Medicare and Medicaid Services (CMS) and the Centers for Disease Control and Prevention (CDC) are now working on an overhaul of PT regulations through an in-depth analysis of the list of required analytes, grading measures, and other standards.
PT experts in the lab community also are eager to see changes in the way labs use and understand PT that go beyond receiving a passing grade for regulatory requirements. They stress that while PT can be an agent for improving lab medicine, it is also a tool that is frequently misunderstood. While laboratorians have discovered that PT results can yield insights for quality management, the unknown differences between PT samples and patient samples can leave labs with questions about their true performance. PT providers must often alter samples during multiple steps of processing, preparation, and storage, so labs cannot always expect to see the same results as they would with authentic patient samples.
As a result, laboratorians need to have a clear understanding about the potential—as well as the limits—of PT. For example, passing a PT challenge should not be confused with excellent clinical performance or accuracy for an assay, explained Gary Horowitz, MD, associate professor of pathology at Harvard Medical School and director of clinical chemistry at Beth Israel Deaconess Medical Center in Boston. “This is a basic misunderstanding about proficiency testing. In most cases, getting a passing grade doesn’t tell you whether you’re doing a great job; it just says that you’re not a statistical outlier compared to your peers. And it’s only good for patient care if your peer group is doing well,” Horowitz said. “On the other hand, if labs dig a little deeper and understand PT’s limitations, there are tremendous opportunities to generate a lot of information from PT data to improve quality.” Horowitz has worked on the College of American Pathologists (CAP) chemistry resource committee, which oversees the CAP PT program for clinical chemistry.
An Easy A?
Under the Clinical Laboratory Improvement Amendments (CLIA) that spell out the requirements for PT in the U.S., labs must enroll in three PT events a year for each analyte listed in the regulations that the lab performs. Most events consist of five samples, and a lab needs to produce results within range in four out of five to pass. In some cases, such as immunohistochemistry, all five responses must be correct. Occasionally failing is not uncommon, and CLIA allows a lab to fail one out of three events on a rolling basis. In other words, a lab needs to pass at least two in a row after a failure, irrespective of the calendar year.
After an initial failure of two out of three events, or two consecutive events for an analyte or specialty, CMS may impose training and technical assistance if certain conditions are met. However, because CMS sees repeated failures as a possible indicator of more serious quality problems, a failure still keeps labs on edge, as CLIA penalties for repeated failures can mean testing for those analytes is halted. But apart from the stress, what does passing or failing really say about the lab?
According to Horowitz, the answer is—it depends. For many analytes, PT providers grade events based on peer groups—pooling results from a group of labs using the same instrument and reagent. With peer groups, acceptable answers must fall within range of a target value, determined according to a formula based on the mean of all participant responses. “Less than one percent of laboratories on a statistical basis are going to have outliers, so the grading is actually designed to find laboratories that are performing very poorly,” Horowitz said.
Moreover, with a grade hinging on a group average, how a peer group happens to come together can significantly alter how difficult a particular challenge really is. “If your peer group is very imprecise, and you’re being graded by plus or minus three standard deviations, your values could be pretty far from the mean value and you’d still pass,” Horowitz explained. “On the other hand, methods that are inherently more reproducible have to meet a higher standard.”
PT providers are stuck using peer groups because the matrix effects in PT samples are, for the most part, unknown. Fresh frozen, carefully prepared human serum is hard to handle for PT providers, due to cost and logistics. As a result, PT samples are manufactured in a way that inherently alters their matrices. These alterations cause unknown biases in PT samples that render comparison to either a gold standard, or to other labs using different reagents and instruments, impossible. The behavior of such samples among methods is what’s known as commutability. A commutable sample acts like any real patient sample, something many PT samples can’t live up to.
Understanding commutability enables labs to interpret PT results wisely, emphasized W. Greg Miller, PhD, who currently serves on the CAP chemistry resource committee. Miller and Horowitz coauthored a review article about PT that appears this month in Clinical Chemistry (Clin Chem 2011;57:1670–1681). “PT can be a very effective tool for evaluating state-of-the-art, but only if the samples are commutable. If not, then you do not have useful information about methods and method comparison,” Miller said. “It’s important for laboratorians to recognize that you cannot predict, for a given sample, whether or not it will be commutable. So you really have to treat all the samples as if they’re non-commutable, unless you have evidence that conclusively demonstrates that they are.” Miller is a professor of pathology and director of clinical chemistry and pathology information systems at Virginia Commonwealth University in Richmond.
A lack of commutability can lead laboratorians to incorrectly assume that two methods for an assay agree. Conversely, laboratorians can also wrongly attribute discrepancies among methods to the matrix effects of the PT sample, when in fact a real bias exists, noted Robert Rej, PhD, director of clinical chemistry and hematology for the Wadsworth Center, New York State Department of Health. “If method A gives you a value of 100 and method B a value of 120, the initial presumption is often that this is a difficulty with the PT samples, rather than necessarily with the analytical system, even though that might not be the case” Rej said. “Unfortunately, with the huge number of samples needed by each PT provider, and with manufacturers constantly upgrading and changing their methods, it’s just not possible to systematically research every PT fluid and every method for every analyte to discern whether the problem lies with the PT material or with the actual method.”
Near Misses and Fruitful Failures
Even though the dearth of commutable PT samples makes it difficult to compare methods or assess true accuracy, PT data remains a treasure trove for quality management at the level of the individual lab, Miller and his coauthors emphasized in their Clinical Chemistry review. “Peer group evaluation provides valuable information to assess quality, verifying that a laboratory is using a measurement procedure in conformance to the manufacturer’s specifications and with other laboratories using the same technology,” the authors wrote.
Speaking at an October 12 AACC webinar, Making Proficiency Testing Work for You, Horowitz described how PT reports can offer a glimpse of impending quality problems even when the lab is passing its PT challenges. “Even though PT was only designed to identify outliers, that doesn’t preclude us as laboratorians from saying, ‘yes, we had no PT exceptions on this report, but I don’t like being at 1.9 standard deviations. What are we doing differently?’” Horowitz said. “At that point, you graduate from using PT only as a regulatory tool and move into using it as a quality management tool.”
Horowitz offered an illustration from his lab, consecutive PT surveys that all showed passing results. In one example, the surveys boosted his confidence in the lab’s calcium results. The surveys showed that in each challenge, the lab’s results came close to the mean of the peer group—on both sides of the mean—indicating a lack of bias. For bilirubin, however, the surveys picked up a potential problem. Horowitz noticed that in all of the past three surveys, his lab had fallen on the negative side versus his peer group. “I can use this as a warning signal that we haven’t failed yet, but we’re on the verge of failing because there is something going on that’s not quite right,” he said. “Good performance would not have a bias that continues to be on one side of the mean.”
In such situations, Horowitz recommends taking advantage of troubleshooting guides from the lab’s PT provider. He especially likes those from CAP because they display charts representative of various trends of bias and how to investigate them. “Troubleshooting should not be reserved only for when a PT failure occurs,” Horowitz said. “Labs should be examining all of their PT data, even in the absence of failures.” For example, a trend of consistent bias on PT surveys could indicate that the lab has been storing a calibrator improperly. In that case, if the lab doesn’t catch it, results will be consistently off the mark without the lab knowing about it.
When a lab does fail a PT challenge, deliberate and thoughtful troubleshooting is more than helpful. It’s required by CLIA and lab accreditors. Also speaking at the AACC webinar, Judith Yost, MA, MT (ASCP), director of the CMS Division of Laboratory Services, appealed to labs to carefully investigate and document all PT failures. “With an unsuccessful performance, you really need to do a root cause analysis and be sure to document the details of your investigation determining what actually happened,” she said. “Sometimes it can be just an aberration, a random error, but usually not. Something in the laboratory’s systems and processes caused that error to occur.”
After a failure, Horowitz encouraged labs to use a good troubleshooting checklist to make sure no stone is left unturned and that the lab documents each investigation consistently. Accreditors often make such checklists available to labs, or labs can develop their own. The December review article in Clinical Chemistry offers one example. In addition, Horowitz recommended that labs not only pay attention to the sample deemed unacceptable, but go further and review all five samples for that particular challenge. Often they will display a bias that can give clues as to what went wrong. Even in the case of a truly random error, where no serious systematic underlying problem comes to light, faithful adherence to a troubleshooting checklist will demonstrate to accreditors that the lab took the failure seriously.
PT Without the PT
Even for those analytes for which no formal PT survey is available, CLIA still requires a biannual accuracy check, referred to as alternative assessment. According to Yost, CMS surveyors cite labs 6% of the time for failure to properly perform and document alternative assessment. “That is a pretty significant number,” she said. “Labs need to meticulously compare their test menu to their PT enrollment on a regular basis and make sure that those analytes for which no PT is available have that accuracy check documented.”
Sometimes even when a PT provider does have samples available for an analyte, not enough labs participate using the same instruments and reagents to meet the 10-participant minimum for a proper peer group. However, even in such cases, a PT provider’s data can still be mined for a useful quality check, Horowitz maintained. For example, if the PT provider returns results that are ungradeable due to a lack of participants, a lab can perform an alternative assessment by comparing to another peer group. “Too few participants doesn’t necessarily mean that the game is over,” he said. “What we’ve done is go back through the participant summary report and find another method that we thought was comparable because it was an almost identical instrument using the same reagents. Then we did our own evaluation of our results versus those results: our value, minus the mean for the other group, divided by the standard deviation. That is perfectly acceptable on the inspection list as alternative assessment. Of course, what you can’t do is look through the surveys and select something just because it agrees with your value so that you look good.”
If no survey exists for an analyte, a lab must consider other options. For example, Horowitz’s lab offers a qualitative Watson-Schwartz test, used to screen for acute intermittent porphyria. In the absence of any PT surveys, the lab performs an alternative assessment by sending out some of their samples every 6 months to a reference lab and comparing results. For detailed advice on alternative assessment, Miller recommended labs refer to a guideline from the Clinical Laboratory and Standards Institute, GP29-A2: Assessment of Laboratory Tests When Proficiency Testing Is Not Available.
New Approaches for Molecular Dx PT?
The burgeoning field of molecular diagnostics presents unique problems for PT that challenge both regulators and labs. CLIA regulates PT by analyte, not by method, so regulators would not add molecular methods per se, but are considering adding genetic mutations and other analytes (See Box, below).
Particularly in the area of molecular oncology, PT providers have hit a wall with traditional PT schemes. Providers can send out samples of purified DNA to assess the analytical phase of testing for many molecular tests. However, when it comes to oncology assays like KRAS mutation testing, selection and handling of the tumor specimen is just as critical as the analytic step of identifying the genetic mutation, noted Jeffrey Kant, MD, PhD, past chair of the CAP/American College of Medical Genetics Biochemical and Molecular Genetics Resource Committee. “When you’re working with tissue specimens, the samples themselves are much more challenging,” he said. “They may have been fixed and embedded in paraffin previously, requiring the lab to go back and rehydrate the tissue and get the nucleic acid out to test it. Others may have a heterogeneity of tumor spread, or necrosis that has set in with poor blood supply.” Kant is professor of pathology and human genetics at the University of Pittsburgh Medical Center.
Providing enough high quality, standardized specimens for this kind of testing is a formidable undertaking. And the lack of uniformity of tissue specimens can make it appear there are problems with an assay when there are not, according to Kant. “It may be that in your hospital situation, everything is working great,” he said. “But if specimens in a PT survey are suboptimal for reasons the provider can’t always control, the overall performance on a survey can actually be poorer than what the general experience is in the community, and so it misrepresents the true quality of testing. Potentially, regulators could look at this and say we’re doing a terrible job.”
The unique difficulties of tissue specimens also limit the kinds of questions a PT survey can answer, Kant explained. “You really can’t test the sensitivity of someone’s assay, because you can’t reproducibly provide a sample to dozens of labs where you know the percent of the tumor mutation is 10 percent from tissue,” he said. “That just isn’t going to be possible, biologically.”
Until these problems get worked out, Kant sees potential in PT providers breaking out the preanalytical, analytical, and reporting phases of testing to assess the whole as accurately as possible. Although it might not satisfy purists who prefer a more holistic assessment, a piecemeal method currently offers the greatest flexibility, reliability, and utility for a PT program, Kant said.
As powerful molecular tests for oncology increasingly draw the attention of regulators, payers, and the public, PT providers and labs will be under pressure to get it right, Kant commented. “I think this area is going to get more and more visibility because it’s a rapidly growing area, and certainly in the case of companion diagnostic tests, there are critical clinical decisions and a significant amount of money riding on decisions to give these therapies or not. That’s why my bias is to make sure people can get the right analytic results out of standardized materials, and I know that the molecular oncology committee in CAP has been making efforts in that direction.”
Time to Tackle Proficiency Testing Regulations What Can Labs Expect from the New Rules?
Lab medicine has changed a lot since 1992. However, the rules for proficiency testing (PT) in the Clinical Laboratory Improvement Amendments (CLIA) have not. Regulators are now in the data-crunching phase of a project to review and revise the CLIA PT regulation based on recommendations from the Clinical Laboratory Improvement Advisory Committee (CLIAC). They are aiming to have the draft proposed rule ready for Department of Health and Human Services clearance next year. Following clearance, the proposed rule soliciting public comment will be published in the Federal Register.
Speaking at an October 12 AACC webinar, Making Proficiency Testing Work for You, Judith Yost, MA, MT (ASCP), director of the Division of Laboratory Services at the Centers for Medicare and Medicaid Services (CMS), acknowledged that parts of the regulations are outdated or confusing to labs. For example, the regulations instruct labs to treat PT samples just like patient samples, but labs can get in serious trouble if they refer a PT sample to another lab for confirmatory testing, a common procedure for certain tests. For example, in the case of HIV antibody testing, many labs perform immunoassay screening in house and refer reproducibly reactive samples to reference labs for Western Blot confirmation. In addition to clarifying the language in the regulation for such issues, CMS plans to tackle adding new analytes to the list that require PT and a review of the grading criteria labs must meet.
CLIA lists 83 analytes, but PT providers offer PT for many more than these, and accreditors often require labs to participate. For example, based on a lab’s test menu, the College of American Pathologists (CAP) requires PT enrollment for close to 400 analytes. About 600 others are available, but optional.
CMS is working together with the Centers for Disease Control and Prevention (CDC) to develop an updated list of analytes and a related grading system. “CLIAC has recommended that we look at four things: availability of PT, testing volume for different analytes, clinical relevance, and cost,” said Nancy Anderson, chief of the Laboratory Practice Standards Branch in the Division of Laboratory Science and Standards at CDC. “We came up with a scheme for looking at those criteria in a logical way, so we started by looking at the availability of PT, and programs that might offer it, even if it’s not already regulated. We considered analytes that are now offered by multiple PT programs, because we’ll be working with them to get data when we get to the point of setting the scoring criteria.”
CDC also examined testing volumes, based on data from Medicare, Medicaid, and private payers. To evaluate clinical relevance of an analyte—the most challenging element, according to Anderson—CDC is reviewing clinical practice guidelines, CDC’s own Morbidity and Mortality Weekly Report, as well as FDA risk classification and other data.
Molecular tests will also be on the table when CDC and CMS collaborate to pen draft regulation. “Under CLIA, PT is based on the analyte and is method-neutral,” Anderson explained. “There are some molecular methods that already have required PT in microbiology.”
CDC and CMS will meet with PT providers and other experts for input on how grading criteria should change, according to Anderson. They’ll discuss what the limits should be around the target value for each analyte. Both new analytes being added to the list, as well as those currently in the regulation, will be considered. “We realize that the grading criteria need to be adjusted, because for some analytes they’re too wide, and for others too narrow,” Yost commented.
According to Robert Rej, PhD, director of clinical chemistry and hematology for the Wadsworth Center, New York State Department of Health, federal regulators face a tough challenge to update PT regulation when the field is constantly changing. Rej has served on both CDC and CLIAC groups that have advised regulators on PT. “There have to be a certain minimum number of labs to make it worthwhile to mount a PT program, and the tests themselves must demonstrate clinical utility,” he said. “A test that is highly useful and of critical clinical importance, even though it may not be performed by a large number of laboratories, might be included over a test that is less clinically important, but offered by a larger number of labs. Laboratory medicine is a dynamic field, with tests going in and out of favor, but the total number of tests appears to be increasing at a steady rate. Maintaining such a list for regulatory purposes is certainly not easy.”
When the Truth Is Not Relative
The cost and limited availability of high-quality patient samples means that for the foreseeable future, PT providers and labs will have to continue to make due with materials of unknown commutability for most PT surveys. In the mean time, labs that want to go the extra mile can take advantage of optional, accuracy-based PT surveys for certain analytes. Accuracy-based surveys use validated commutable samples that labs can compare to a gold standard reference method. This way, labs can compare their PT results to a true value and not solely among themselves via a peer-group mean.
At the level of the individual lab, accuracy-based PT demonstrates the real-world performance of an assay for patient care, Horowitz emphasized during the webinar. On a wider scale, since accuracy-based PT does away with matrix-related bias, such programs can reveal the real state of agreement, harmonization, and accuracy across methods.
For this reason, despite the fact that PT providers can only offer accuracy-based surveys on a limited basis, the lab community should take advantage of these programs to power harmonization efforts, Miller said. In their Clinical Chemistry paper he, Horowitz, and their coauthors set out several ways in which this could occur. They encouraged PT providers to share commutable samples in order to reduce costs, and share global summary reports of accuracy-based surveys to advance the field. Manufacturers have a role to play as well. For example, companies could exploit residual commutable samples to calibrate their instruments. “PT/External Quality Assessment providers are in a unique position to add substantial value to the practice of laboratory medicine by identifying analytes that are in need of standardization or harmonization, and by stimulating and sustaining global standardization and harmonization initiatives that are needed to support clinical practice guidelines,” the authors wrote.
In October 2010, when AACC convened a conference of stakeholders from around the world on the issue of harmonization, using commutable samples for PT was a central theme (CLN 2010;36:12). PT will continue to have this crucial role, according to Miller. “As the AACC harmonization initiative takes shape over the next several years, the role of accuracy-based PT becomes quite important in assessing the success of any given harmonization activity,” he said. “I hope that as our profession understands better the importance of and develops the tools to achieve harmonization, that accuracy-based PT becomes more a part of our usual practice.”