Multivariate binary classification of imbalanced datasets —A case study based on high‐dimensional multiplex autoimmune assay data

The classification of a population by a specific trait is a major task in medicine, for example when in a diagnostic setting groups of patients with specific diseases are identified, but also when in predictive medicine a group of patients is classified into specific disease severity classes that might profit from different treatments. When the sizes of those subgroups become small, for example in rare diseases, imbalances between the classes are more the rule than the exception and make statistical classification problematic when the error rate of the minority class is high. Many observations are classified as belonging to the majority class, while the error rate of the majority class is low. This case study aims to investigate class imbalance for Random Forests and Powered Partial Least Squares Discriminant Analysis (PPLS‐DA) and to evaluate the performance of these classifiers when they are combined with methods to compensate imbalance (sampling methods, cost‐sensitive learning approaches). We evaluate all approaches with a scoring system taking the classification results into consideration. This case study is based on one high‐dimensional multiplex autoimmune assay dataset describing immune response to antigens and consisting of two classes of patients: Rheumatoid Arthritis (RA) and Systemic Lupus Erythemathodes (SLE). Datasets with varying degrees of imbalance are created by successively reducing the class of RA patients. Our results indicate possible benefit of co...

http://onlinelibrary.wiley.com/resolve/doi?DOI=10.1002%2Fbimj.201600207

Source: Biometrical Journal - May 1, 2017 Category: Biotechnology Authors: Laura Schlieker, Anna Telaar, Angelika Lueking, Peter Schulz ‐Knappe, Carmen Theek, Katja Ickstadt Tags: Case Study Source Type: research