Unsupervised learning methods for efficient geographic clustering and identification of disease disparities with applications to county-level colorectal cancer incidence in California

AbstractMany public health policymaking questions involve data subsets representing application-specific attributes and geographic location. We develop and evaluate standard and tailored techniques for clustering via unsupervised learning (UL) algorithms on such amalgamated (dual-domain) data sets. The aim of the associated algorithms is to identify geographically efficient clusters that also maximize the number of statistically significant differences in disease incidence and demographic variables across top clusters. Two standard UL approaches, k means with k++ initialization (k++) and the standard self-organizing map (SSOM), are considered along with a new, tailored version of the SOM (TSOM). The TSOM algorithm involves optimization of a customized objective function with terms promoting individual geographic cluster cohesion while also maximizing the number of differences across clusters, and two hyper-parameters controlling the relative weighting of geographic and attribute subspaces in a non-Euclidean distance measure within the clustering problem. The performance of these three techniques (k++, SSOM, TSOM) is compared and evaluated in the context of a data set for colorectal cancer incidence in the state of California, at the level of individual counties. Clusters are visualized via chloropleth maps and ordered graphs are also used to illustrate disparities in disease incidence among four identity groups. While all three approaches performed well, the TSOM identified t...
Source: Health Care Management Science - Category: Hospital Management Source Type: research