John W. Robinson, M.D., Ph.D., LLC

Statistics ▪ Analytics ▪ Data Science

Predictive Modeling of Healthcare Cost via Machine Learning

Healthcare organizations often need to predict patients’ expected healthcare costs either prospectively, to forecast future expenditures, or retrospectively, for comparison with observed costs. Cost prediction models typically use demographic and diagnostic data, and sometimes procedural data, as inputs. When this data comes from health care claims, the number of input variables is potentially vast, since claims can include many thousands of unique diagnosis and procedure codes. And, even when these codes are sorted into clinically coherent diagnostic and procedure groups, the number of such groups may still be very large. Moreover, given the potential for various combinations of diagnostic and procedure groups to interact in their effects on cost, the complexity of the modeling task can become mind-boggling.

Faced with this complexity, organizations often rely on proprietary prediction or risk-adjustment systems developed by outside vendors. Such systems, in an attempt to reduce combinatorial complexity, typically incorporate extensive rules (assumptions) regarding how various diagnostic groups, and in some cases procedure groups, interact in their effects on cost. Unfortunately, these rules, based on a vendor’s clinical judgments or observations, might not be appropriate for the population served by a given organization. And, since the rules may be largely hidden from the user, as proprietary secrets, assessment of their suitability for a given population may not be possible.

An alternative is for an organization to develop a fully transparent, customized cost prediction model based on data from its own population, via a machine learning algorithm (e.g., boosted regression trees, neural network). Benefiting from the speed of modern computing, such algorithmic methods can search through the vast input matrix and select those combinations of diagnostic and procedure groups that are most predictive of cost in the population at hand. Furthermore, since such an approach to predictive modeling can be developed into an organizational capability, a customized prediction model can be readapted as often as appropriate, in response to evolving organizational aims and population characteristics.