An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records

Conclusions We show that datasets at different scale (size of the EMRs, number of distinct codes) and with different characteristics warrant different learning approaches. For shorter narratives pertaining to a particular medical subdomain (e.g., radiology, pathology), classifier chaining is ideal given the codes are highly related with each other. For realistic in-patient full EMRs, feature and data selection methods offer high performance for smaller datasets. However, for large EMR datasets, we observe that the binary relevance approach with learning-to-rank based code reranking offers the best performance. Regardless of the training dataset size, for general EMRs, label calibration to select the optimal number of labels is an indispensable final step.
Source: Artificial Intelligence in Medicine - Category: Bioinformatics Source Type: research