To realize precise clinical decision-making, the establishment of reference intervals (RIs) applicable to the existing regional laboratories’ laboratory equipment and patient populations is a prerequisite. Recently, dozens of Chinese health industry standards have been released with recommended RIs for 50+ tests based on the Chinese population, including common tests such as aspartate aminotransaminase, alanine aminotransferase, and alkaline phosphatase. This has greatly alleviated the pressure on most laboratories in China from adopting RIs derived from foreign populations. The problems, including a limited number of tests in the recommendations and the poor applicability to special populations like ethnic minorities, are still prominent (1). Laboratories still need to urgently integrate the existing recommendations and explore the feasible path to establish RIs with high applicability and high interpretability.

So how should clinical laboratories establish RIs for use in guiding precise clinical diagnosis and treatment? The answer to this question is not unique. As early as 2008, the Clinical and Laboratory Standards Institute (CLSI) provided a systematic recommendation on the establishment and validation of RIs for laboratory tests. The direct method is currently the most reliable method because the authentic measurement of values from “apparently healthy individuals” are used. However, this method is cumbersome, time-consuming, and often limited by inadequate sample sizes.

The drawbacks of the recommended direct method are obvious but there is an alternative path. The indirect method uses data mining algorithms to analyze the data derived from routine clinical measurement. It is based on the principle that data derived from non-pathological individuals accounts for the bulk of routine test data. Thus, effective distinction of apparently healthy individuals could be achieved by using a robust data mining algorithm and the RIs could be calculated from numeric data. Briefly, the data mining algorithms work based on different arithmetic principles, such as graphical recognition, iteration and parameter searches. While the use of data mining algorithms to establish RIs is cheap, fast, and feasible for most laboratories relative to direct methods, the applicability and performance of data algorithms to be evaluated and heterogeneity in various data preprocessing steps should also be considered.

We recently assessed five separate data mining algorithms combined with a simplified two-step preprocessing to establish RIs for laboratory tests used for to the diagnosis of thyroid disorders, thyrotropin (TSH), total and free thyroxine (TT4 and FT4), total and free triiodothyronine (TT3 and FT3). The performance of the five data mining algorithms, which included Hoffmann, Bhattacharya, Expectation-maximization, kosmic, and refineR, were objectively evaluated. There were three main findings from this study: (1) the simplified, transparent data preprocessing method combined with a data mining algorithm was effective for establishing RIs, especially for laboratories in which outpatients seeking medical examination are the predominant patient population. As the evidence of this, algorithm-calculated RIs were comparable to standard RIs calculated from the Reference data set in which reference individuals were selected following strict inclusion and exclusion criteria (2) The Hoffmann, Bhattacharya, kosmic, and refineR data mining approaches have high concordance in calculating RIs when the data has a Gaussian distribution, as we found to be the case in measurement as FT3, FT4, and TT4, in our study population and laboratory. (3) The EM algorithm demonstrated superior performance for establishing TSH RI, which had a non-Gaussian, right-skewed distribution. Together these results imply that different data mining algorithms may be more appropriate than others depending on the distribution of the data when establishing RIs using the indirect method.

These findings can hopefully guide clinical laboratories to make appropriate use of data mining algorithms and give appropriate attention to data distribution when establishing RIs by the indirect method. Besides data distribution, other factors that may affect the performance of algorithms should also be explored in future studies.

Reference

1. Ma CC, Wang XL, Wu J, Cheng XQ, Xia LY, Xue F, Qiu L. Real-world big-data studies in laboratory medicine: Current status, application, and future considerations. Clin Biochem 2020; 84:21-30.