Electronic medical record-based deep data cleaning and phenotyping improve the diagnostic validity and mortality assessment of infective endocarditis: medical big data initiative of CMUH

Background International Classification of Diseases (ICD) code–based claims databases are often used to study infective endocarditis (IE). However, the quality of ICD coding can influence the reliability of IE research. The impact of complementing the ICD-only approach with data extracted from electronic medical records (EMRs) has yet to be explored. Methods We selected the information of adult patients with discharge ICD codes for IE (ICD-9: 421, 112.81, 036.42, 098.84, 115.04, 115.14, 115.94, 424.9; ICD-10: I33, I38, I39) during 2005–2016 in China Medical University Hospital. Data extraction was conducted on the basis of the modified Duke criteria to establish a reference group comprising patients with definite or possible IE. Clinical characteristics and in-hospital mortality were compared between ICD-identified and Duke-confirmed cases. The positive predictive value (PPV) was used to quantify the IE identification performance of various phenotyping algorithms. Results A total of 593 patients with discharge ICD codes for IE were identified, only 56.7% met the modified Duke criteria. The crude in-hospital mortality for Duke-confirmed and Duke-rejected IE were 24.4% and 8.2%, respectively. The adjusted in-hospital mortality for ICD-identified IE was lower than that for Duke-confirmed IE by a difference of 5.1%. The best PPV was achieved (0.90, 95% CI 0.86–0.93) when major components of the Duke criteria (positive blood culture and vegetation) were integrated with ICD codes. Conclusion Integrating EMR data can considerably improve the accuracy of ICD-only approaches in phenotyping IE, which can improve the validity of EMR-based studies and their applications, including real-time surveillance and clinical decision support.


Introduction
T he validity of electronic medical record (EMR)-based clinical research relies on accurate disease phenotyping. With advancements in computing technology and medical data extraction, methods for identifying multiple criteriaedriven diagnoses of complex diseases should achieve higher accuracy than conventional International Classification of Diseases (ICD) codeebased case identification schemes. Coding errors and inconsistencies in claims data have been reported in studies on infectious diseases, such as sepsis and health careeassociated infections (HAIs) [1,2]. Rhee et al. reported that the incidence of sepsis was overestimated when claims-based data were used (range, 8% to 12%) relative to estimates obtained using EMR-based clinical data (range, 5% to 6.5%) [1]. A systematic review suggested that ICD codes may be inaccurate for detecting HAIs other than Clostridium difficile or surgical site infections [2]. Moreover, the accuracy of ICD-based phenotyping is affected by variations in the policies and regulations of a health insurance system, the population covered by the healthcare system, and the coding behavior of clinicians, which consequently affect the interpretation and validity of clinical research findings [3e5]. However, few studies have investigated the impact of data curation on the identification of complex diseases requiring multiple clinical criteria. In this study, we used infective endocarditis (IE), a rare but lethal disease requiring multiple diagnostic criteria (i.e., the modified Duke criteria), to demonstrate how data extraction strategies improve the positive predictive value (PPV) of case identification beyond the ICD approach and how such strategies change mortality risk estimation.

Source population
The Big Data Center and the Office of Information Technology of China Medical University Hospital (CMUH) established the CMUH-Clinical Research Data Repository (CRDR) in 2017, which carefully verified and validated data from various clinical sources to unify trackable patient information generated during the healthcare process [6]. The CMUHeCRDR documented unified views of 2,660,472 patients who had sought care at the CMUH between January 1, 2003 and December 31, 2016. Patient information included data on administration and demography, diagnosis, medical and surgical procedures, prescriptions, laboratory measurements, physiological monitoring, hospitalization, and catastrophic illness status. The CMUHeCRDR has been linked to national population-based health-related databases, such as the National Death Registry, which are systematically maintained by the Health and Welfare Data Science Center of the Ministry of Health and Welfare. All patients enrolled in the CMUHeCRDR were followed up until December 31, 2016, or death, whichever occurred earlier.

Case validation
A research assistant (YJC) and an infectious disease specialist (LYL) systematically reviewed the medical charts and classified patients with IE diagnosis codes into definite, possible, or rejected groups according to the modified Duke criteria [7]. Using the Duke criteria as the reference standard, we evaluated the performance of ICD codes and their combinations with different EMR-derived clinical data in identifying patients with IE. We selected three clinical indicators, namely fever, positive blood culture, and cardiac vegetation confirmed through echocardiography reports, because they are objective and easily available, and because positive blood culture and vegetation evidence are the only two major components of Duke criteria, making them important indicators of IE. We used natural language processing (NLP) to extract keywords for the organism, Gram staining pattern, and antimicrobial susceptibility from microbiology reports. We used text mining to search for the keyword ''vegetation'' in echocardiography reports.

Statistical analysis
We analyzed the PPV for each case identification strategy. The study population was divided into true positive (Dukeþ and case identification strategyþ), true negative (Dukee and case identification strat-egyÀ), false positive (Dukee and case identification strategyþ), and false negative (Dukeþ and case identification strategye). PPV was calculated by dividing the number of patients with IE confirmed using the Duke diagnostic criteria (definite or possible) by the total number of patients classified as IE based on different case identification strategies ( TP TP þ FP ). The age-adjusted mortality was estimated using logistic methods [8]. Data were analyzed using SAS version 9.4 (SAS Institute Inc., Cary, NC). All analyses were two-sided, and the significance level was 0.05. The study was approved by the Big Data Center of CMUH and the Research Ethics Committee/Institutional Review Board of CMUH (CMUH105-REC3-068).

Results
Of 593 adults with ICD codes for IE, only 336 (56.7%) met the modified Duke criteria (224 definite; 112 possible). Patients with Duke-confirmed IE were significantly younger and more likely to have hypertension, diabetes mellitus, chronic liver disease, and chronic kidney disease compared with those who did not meet the Duke criteria (Table 1). Among the patients with Duke-confirmed IE, 4.8% were diagnosed on the basis of minor criteria. Moreover, of the patients with Duke-confirmed IE, 70.8% had two positive blood cultures within 2 weeks of IE diagnosis and 58.3% yielded typical pathogens defined by the Duke criteria. Cardiac vegetation was detected in 88.4% of the patients with Duke-confirmed IE, but the detection rate dropped to 50.1% (297/593) in the entire population with ICD codes for IE (Table 1). Pyuria, hematuria, elevated erythrocyte sedimentation rate, or C-reactive protein was more frequently recorded among the patients with Duke-confirmed IE. The crude inhospital mortality was threefold higher in the patients with Duke-confirmed IE (24.4%) than in Duke-rejected cases (8.2%; P < 0.0001). The mortality difference between the two groups persisted for at least 1 year after IE diagnosis. ICD, International Classification of Diseases; PBC, positive blood culture; PPV, positive predictive value. a Mortality was adjusted by age using logistic regression. b Chart review was performed using the Duke criteria and definite or possible cases were considered. We also evaluated the predictive performance for IE by combining three clinical criteria, namely fever, two positive blood cultures (PBCs), and echocardiographic evidence of vegetation, with ICD codes for IE. The age-adjusted in-hospital mortality for the study population (defined only by ICD) and reference standard (Duke-confirmed IE) were 15.9% and 21.0%, respectively (Table 2). When the Boolean operator "OR" was used to maximize the number of patients with IE identified using the case identification strategies, that is, the study population includes patients who had at least one of the three clinical criteria, the best PPV (0.90; 95% confidence interval [CI], 0.86-0.93) was achieved when PBC and vegetation were included. The corresponding ageadjusted in-hospital mortality was 21.8%, which approximated that of the reference group (Table 2). By contrast, when we applied the Boolean operator "AND" to maximize the specificity of the case identification strategies, that is, the study population includes patients who had two of the three or all three clinical criteria, the PPV was 1.00 whenever vegetation was included in the algorithm. The corresponding adjusted in-hospital mortality increased from 21.5% to 24.4%. When the case identification strategies defined only patients with concomitant fever, PBC, and vegetation as having IE, the adjusted in-hospital mortality was the highest at 24.4%.
Our original list of IE ICD codes included the ICD-9 code 424.9 (endocarditis valve unspecified) or ICD-10 code I38 (endocarditis, valve unspecified) that has not been used in some of the prior studies on IE [4,5,9,10]. When we excluded patients with these two ICD codes, the PPV for the strategy applying only ICD codes (ICD-only strategy) increased to 0.83 (95% CI, 0.79e0.87), and the corresponding adjusted in-hospital mortality was lower by 1.2% relative to the reference strategy (Table 3). Introducing EMR-based phenotyping algorithms into the revised ICD-only approach did improve the PPV whenever PBC or vegetation was incorporated. However, 38 patients with Duke-confirmed IE were missed because they did not have the ICD code 424.9 or I38. These patients were more likely to be older and diagnosed on the basis of the Duke minor criteria compared with those having the ICD code 424.9 or I38 (Supplemental Table 2).

Discussion
This study revealed two notable findings. First, the cumulative incidence of IE was overestimated, but the mortality of IE was underestimated when only ICD codes were used as the estimation tool. Second, when EMR-based phenotyping was used, the accuracy of ICD-based phenotyping of IE could be improved. Despite its extensive implementation, the ICD-only approach should be reserved for claims databases.
For certain infectious diseases, such as sepsis and health careeassociated infections, increasing bodies of evidence indicate that ICD codes may be inaccurate [1,2]. In particular, the performance of an EMR-based phenotyping algorithm in retrospective databases is quantified by the PPV, although researchers must adjust for the negative predictive value or rare diseases with low prevalence and incidence, such as IE [11]. Our study identified only 56.7% of patients with discharge ICD codes, indicating that the diagnosis of IE met the Duke criteria (i.e., PPV, 0.57). Consistent with our findings, Fawcett et al. revealed that 44% and 56% of patients with IE ICD codes represented definite and possible IE, respectively, in two separate hospitals in the United Kingdom [10]. By contrast, a single-center study conducted in Canada demonstrated that the ICDonly approach could reach both high sensitivity and high specificity for definite or possible IE. However, the PPV based on ICD-10 was only 0.78 (95% CI 0.68e0.85), indicating that this approach cannot be generalized to other institutions [3]. In a study conducted in a US medical center, the PPV was 0.80 (95% CI 75.7e84.5) when an ICD extraction strategy similar to ours was used [4]. Although we could adjust the ICD search strategy (i.e., removing 424.9 or I38) to increase the PPV, a total of 38 patients with definite or possible IE were missed, leading to an underestimation of the disease burden and insufficient characterization of disease heterogeneity. Integrating EMR-based information can help avoid false-negative findings caused by the use of the highly sensitive ICD-only search strategy and can thus provide an accurate prevalence profile of IE.
Inaccurate coding may contribute to a moderate PPV and may be caused by clinicians' inexperience or attention to detail. For example, under Taiwan's National Health Insurance system, clinicians might upcode diagnoses to avoid refusal of reimbursement by health insurance agencies [12]. Moreover, patient factors constitute a major reason for upcoding. Aged individuals have a higher prevalence of valvular heart disease (VHD) and an increased risk of VHDrelated IE compared with other individuals [13]. For example, the worsening of VHD-related murmurs might cause the misclassification of a minor criterion (VHD with regurgitation) into a major one (endocardial involvement), resulting in the overestimation of IE cases [7]. The between-institution heterogeneity in the validity of ICD-based case identification approaches highlights the importance of in-house validation as a quality assessment strategy for clinical research conducted using EMRs.
Our study revealed that elderly patients with cardiovascular comorbidities tended to be assigned with IE-related ICD codes, indicating that misclassification bias can be differential with respect to mortality risk. This minimizes mortality risk underestimation in the ICD-only approach because studies that have used the ICD-only approach for IE identification have reported in-hospital mortality ranging from 14% to 20.4% [14e16]. By contrast, studies that have used the modified Duke criteria for final case identification have revealed slightly higher in-hospital mortality (ranging from 13% to 38.7%) [9, 16e24]. Although the discrepancy in mortality was not significant, it could affect the validity of the risk evaluation of potential factors, such as causative microorganisms and comorbidities. Researchers should appreciate the impact of case identification algorithms on variations in the risk of mortality due to IE in the literature. Comparison of mortality outcomes for IE that are not defined by the Duke criteria can be confounded by misclassification errors due to inadequate disease phenotyping. In our study, we observed that mortality associated with the three main clinical indicators of IE were different and that patients with PBC tended to have a higher probability of mortality than did those without PBC. Future research should evaluate whether variations in mortality arise from differences in the diagnostic components of the Duke criteria.
With the increased availability of EMR-based data, researchers can now maximize the potential of EMRs by using new computing technology, such as NLP, to improve the accuracy of case identification. Rhee et al. suggested that EMR-based clinical data provide more objective estimates in sepsis surveillance than do claims-based data [1]. Wei et al. also suggested that multiple EMR-based criteria afford higher identification performance than does a single criterion for a selected phenotype [25]. Our results demonstrate that the use of three EMR-based clinical criteria can considerably improve the PPV in identifying patients with definite or possible IE. Manually reviewing medical records to determine patients with IE on the basis of the modified Duke diagnostic criteria is a labor-and time-intensive process and requires trained personnel with clinical knowledge. By contrast, EMR-derived clinical criteria and ICD codes are mutually complementary and can be combined to automatically screen patients for IE in real time. In this study, the EMRbased algorithm identified cases that approximated the Duke-confirmed IE cases when we combined one of the two major components (i.e., PBC or vegetation) of the Duke criteria with ICD. Even when we incorporated a minor component of the Duke criteria, such as fever, with ICD, the identification performance was superior to that of the ICDonly approach. This combination approach can considerably reduce the burden of manual validation in conventional human-in-the-loop case identification processes.
This study has several limitations. First, the generalizability of our findings is limited due to the nature of a single-center setting. However, the differences in PPV and mortality arisen from data extraction strategies in EMR may be extrapolate to other databases and may highlight the importance of in-house data curation. Second, the misdiagnosis of IE was not explored. However, systematic screening of IE is not standard practice. In the future, the use of more advanced and updated NLP methodologies to systematically collect all components of the Duke criteria in EMRs will enable researchers to objectively compare the validity of ICDonly and EMR-driven phenotyping strategies.

Conclusion
In the era of EMR-driven phenotyping and knowledge discovery, integrating structured and unstructured data can considerably improve the accuracy of ICD-only approaches in phenotyping conditions such as IE, and therefore, improve the validity of EMR-based retrospective surveillance and cohort studies. In the future, automatically mapping multisource clinical data through EMRs to estimate patients' IE risk can facilitate efficient realtime case identification in clinical research and practice.
Authorship statement HYC, CCK, and CYC designed the study. CCL, LYL, and YJC performed data quality management and statistical analysis. MYW, SHC, and PHW conducted natural language processing of microbiology text reports. HYC and CYC drafted the manuscript. HYC, CCL, CCK, and CYC critically edited the manuscript. All authors read and approved the final manuscript.

Conflict of interest
All authors declare no conflict of interest.

Appendices
Supplemental Table 1. International Classification of Diseases, 9 th Revision, Clinical Modification (ICD-9-CM) diagnosis codes and ICD-10-CM diagnosis codes for defining comorbidities within 1 year of infective endocarditis diagnosis.