Machine learning algorithms, bull genetic information, and imbalanced datasets used in abortion incidence prediction models for Iranian Holstein dairy cattle

The objective of this study was to predict pregnancy loss in Iranian dairy herds. For this purpose, the cow history records and bull genetic information available at 6 large commercial dairy farms with cows calved between 2005 and 2014 were extracted from an on-farm record-keeping software. Using WEKA, 12 commonly used machine learning (ML) algorithms were applied to the dataset. The algorithms belonged to 5 classifier groups which were Bayes, meta, functions, rules, and trees. The original dataset including herd-cow factors was randomly divided into 2 subsets: a training dataset and a test one (at a ratio of 60:40). The original dataset was combined with the bull genetic information to create a full dataset. The average abortion rate was 15.4%, which represented an imbalanced dataset. Therefore, 2 down- and up-sampling techniques were additionally implemented on the original dataset (more specifically on the training one) to create 2 balanced datasets. This ultimately yielded 4 datasets; original, full, down-sampling, and up-sampling. Different algorithms and models were evaluated based on F-measure and area under the curve (AUC). Based on the results obtained, ML algorithms exhibited a high performance in predicting abortion when applied to the balanced dataset. However, their performance varied from 32.3% (poor) to 69.2% (medium upward) when applied to the imbalanced original dataset. In addition to the imbalance in the original dataset, the reason for these poor results w...
Source: Preventive Veterinary Medicine - Category: Veterinary Research Source Type: research