Development and validation of a pancreatic cancer risk model for the general population using electronic health records: An observational study


Aim: Pancreatic ductal adenocarcinoma (PDAC) is often diagnosed at a late, incurable stage. We sought to determine whether individuals at high risk of developing PDAC could be identified early using routinely collected data. Methods: Electronic health record (EHR) databases from two independent hospitals in Boston, Massachusetts, providing inpatient, outpatient, and emergency care, from 1979 through 2017, were used with case-control matching. PDAC cases were selected using International Classification of Diseases 910 codes and validated with tumour registries. A data-driven feature selection approach was used to develop neural networks and L2-regularised logistic regression (LR) models on training data (594 cases, 100,787 controls) and compared with a published model based on hand-selected diagnoses (‘baseline’). Model performance was validated on an external database (408 cases, 160,185 controls). Three prediction lead times (180, 270 and 365 days) were considered. Results: The LR model had the best performance, with an area under the curve (AUC) of 0.71 (confidence interval [CI]: 0.67-0.76) for the training set, and AUC 0.68 (CI: 0.65-0.71) for the validation set, 365 days before diagnosis. Data-driven feature selection improved results over ‘baseline’ (AUC 0.55; CI: 0.52-0.58). The LR model flags 2692 (CI 2592-2791) of 156,485 as high risk, 365 days in advance, identifying 25 (CI: 16-36) cancer patients. Risk stratification showed that the high-risk group presented a cancer rate 3 to 5 times the prevalence in our data set. Conclusion: A simple EHR model, based on diagnoses, can identify high-risk individuals for PDAC up to one year in advance. This inexpensive, systematic approach may serve as the first sieve for selection of individuals for PDAC screening programs.

European Journal of Cancer