Turkish Journal of Electrical Engineering and Computer Sciences
DOI
10.3906/elk-1807-212
Abstract
In this paper, we describe histogram matching, a metric for measuring the distance of two datasets with exactly the same features, and embed it into a mixed integer programming formulation to partition a dataset into fixed size training and test subsets. The partition is done such that the pairwise distances between the dataset and the subsets are minimized with respect to histogram matching. We then conduct a numerical study using a well-known machine learning dataset. We demonstrate that the training set constructed with our approach provides feature distributions almost the same as the whole dataset, whereas training sets constructed via random sampling end up with significant differences. We also show that our method introduces neither positive nor negative bias in prediction accuracy of a decision tree---used as a representative example of a machine learning method.
Keywords
Distribution matching, instance selection, training set selection, optimization
First Page
1534
Last Page
1545
Recommended Citation
GENÇ, BURKAY and TUNÇ, HÜSEYİN
(2019)
"Optimal training and test sets design for machine learning,"
Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 27:
No.
2, Article 60.
https://doi.org/10.3906/elk-1807-212
Available at:
https://journals.tubitak.gov.tr/elektrik/vol27/iss2/60
Included in
Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons