DOI

10.3906/elk-1807-212

Abstract

In this paper, we describe histogram matching, a metric for measuring the distance of two datasets with exactly the same features, and embed it into a mixed integer programming formulation to partition a dataset into fixed size training and test subsets. The partition is done such that the pairwise distances between the dataset and the subsets are minimized with respect to histogram matching. We then conduct a numerical study using a well-known machine learning dataset. We demonstrate that the training set constructed with our approach provides feature distributions almost the same as the whole dataset, whereas training sets constructed via random sampling end up with significant differences. We also show that our method introduces neither positive nor negative bias in prediction accuracy of a decision tree---used as a representative example of a machine learning method.

Keywords

Distribution matching, instance selection, training set selection, optimization

First Page

1534

Last Page

1545

Recommended Citation

GENÇ, BURKAY and TUNÇ, HÜSEYİN (2019) "Optimal training and test sets design for machine learning," Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 27: No. 2, Article 60. https://doi.org/10.3906/elk-1807-212
Available at: https://journals.tubitak.gov.tr/elektrik/vol27/iss2/60

Download

Included in

Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons

COinS

Turkish Journal of Electrical Engineering and Computer Sciences

Optimal training and test sets design for machine learning

DOI

Abstract

Keywords

First Page

Last Page

Recommended Citation

Included in

Issues by Year

Search

Turkish Journal of Electrical Engineering and Computer Sciences

Optimal training and test sets design for machine learning

Authors

DOI

Abstract

Keywords

First Page

Last Page

Recommended Citation

Included in

Share

Issues by Year

Search