Traditional Chinese medicine clinical records classification with BERT and domain specific corpora

AbstractTraditional Chinese Medicine (TCM) has been developed for several thousand years and plays a significant role in health care for Chinese people. This paper studies the problem of classifying TCM clinical records into 5 main disease categories in TCM. We explored a number of state-of-the-art deep learning models and found that the recent Bidirectional Encoder Representations from Transformers can achieve better results than other deep learning models and other state-of-the-art methods. We further utilized an unlabeled clinical corpus to fine-tune the BERT language model before training the text classifier. The method only uses Chinese characters in clinical text as input without preprocessing or feature engineering. We evaluated deep learning models and traditional text classifiers on a benchmark data set. Our method achieves a state-of-the-art accuracy 89.39% ± 0.35%, Macro F1 score 88.64% ± 0.40% and Micro F1 score 89.39% ± 0.35%. We also visualized attention weights in our method, which can reveal indicative characters in clinical text.
Source: Journal of the American Medical Informatics Association - Category: Information Technology Source Type: research