Test your knowledge |
|
Hi guys,
In a machine learning project, after crawling or collecting data, we have to split it into at least 2 parts: training and validation data. The training set is used to build predictive models, while the validation set is for validating the performance of those models.
Rules for splitting
While splitting data, we should adhere to the following 2 rules:
Good practices
Normally, the ratio of data points for training and validation is often 7:3 or 8:2.
K-fold cross-validation is a good practice, in which cases the ratio is often 9:1 or even leave-one-out.
A 9:1 ratio corresponds to 10-fold cross-validation, meaning we break the whole dataset into 10 equal parts (folds). Each time we train the model on 9 parts and validate on the remaining part, 10 times.
A leave-one-out strategy is n-fold cross-validation, with n is the size of the whole dataset. Each time we take only 1 sample for validation. Leave-one-out is powerful but only practical if the data size is small or the model is very simple (thus fast to train and validate).
After choosing the number of folds for your cross-validation strategy, let’s move on to determining how the data samples are distributed to each part (fold). There are 2 essential types of methods for this: Simple cross-validation and Stratified cross-validation.
Another important point is: usually when your data is not too small, you should consider making testing set alongside training and validation set. A separate testing set would help you gauss your models’ performance more objectively and accurately. A common ratio for training:validation:testing is 60:20:20.
In some cases, the ratios I introduced above may be changed. For example, if your dataset is very large, you have 10.000.000 diversified samples, the ratio could be 99:0.5:0.5 since 0.5% of 10.000.000 is 50.000, which is large enough for making statistical inferences.
Sklearn’s splitting functions
Sklearn has many pre-built functions to support users on this task. The full list can be found here. Below, I list out the most common ones.
- sklearn.model_selection.train_test_split
- sklearn.model_selection.KFold
- sklearn.model_selection.StratifiedKFold
- sklearn.model_selection.LeaveOneOut
Test your understanding |
|
References:
- Sklearn’s model_selection modules: link