A new descriptor selection scheme for SVM in unbalanced class problem: a case study using skin sensitisation dataset.
Li S; Fedorowicz A; Andrew ME
SAR QSAR Environ Res 2007 Jul; 18(5-6):423-441
A novel descriptor selection scheme for Support Vector Machine (SVM) classification method has been proposed and its utility demonstrated using a skin sensitisation dataset as an example. A backward elimination procedure, guided by mean accuracy (the average of specificity and sensitivity) of a leave-one-out cross validation, is devised for the SVM. Subsets of descriptors were first selected using a sequential t-test filter or a Random Forest filter, before backward elimination was applied. Different kernels for SVM were compared using this descriptor selection scheme. The Radial Basis Function (RBF) kernel worked best when a sequential t-test filter was adopted. The highest mean accuracy, 84.9%, was obtained using SVM with 23 descriptors. The sensitivity and the specificity were as high as 93.1% and 76.6%, respectively. A linear kernel was found to be optimal when a Random Forest filter was used. The performance using 24 descriptors was comparable with a RBF kernel with a sequential t-test filter. As a comparison, Fisher's linear discriminant analysis (LDA) under the same descriptor selection scheme was carried out. SVM was shown to outperform the LDA.
Statistical-analysis; Analytical-methods; Analytical-processes; Skin; Skin-sensitivity; Skin-irritants; Mathematical-models
S. Li, Health Effects Laboratory Division, National Institute for Occupational Safety and Health, Morgantown, WV 26505
SAR and QSAR in Environmental Research