machine learning:simply processes missing features and matches the number of training data and test data

Kuniyoshi Takemoto

3 years ago

This post is also available in: 日本語 (Japanese)

This post is about sample code that simply processes missing features and matches the number of training data and test data when handle machine learning.
Since the processing of missing values depends on the machine learning policy, it is convenient when you want to run the code for the time being.

However, in the sample code, it is written on the assumption that the index of training data and test data is common.

import pandas as pd
import numpy as np

#Create sample data, and substitute nan
data_X = np.random.randn(6,2)
data_X[0][1] = np.nan
data_X[4][0] = np.nan
print(data_X)
"""
#output
[[ 0.1669884          nan]
 [-0.93169488 -0.80602492]
 [ 1.34485881 -1.15684329]
 [-1.77475068  0.58345764]
 [        nan -1.34413655]
 [ 0.76400682  0.43928072]]
"""

#Create feature dataframe
train_features = pd.DataFrame(data_X, columns=["FeatureA","FeatureB"])
print(train_features)
"""
#output
   FeatureA  FeatureB
0  0.166988       NaN
1 -0.931695 -0.806025
2  1.344859 -1.156843
3 -1.774751  0.583458
4       NaN -1.344137
5  0.764007  0.439281
"""

#Create labels
train_labels = pd.Series([1,0,1,1,0,1])

#Drop the missing value (nan) of the feature
# how='any' means that delete if even one of the rows contains a missing value
train_features = train_features.dropna(how='any')

#Does not match the number of data in train_features and train_labels
print(len(train_features.index.values)) #4
print(len(train_labels.index.values)) #6

#Case where the number of data of train_labels is matched with the number of data of train_features
#In other words, make the index of train_labels the same as the index of train_features
#However, the indexes of train_features and train_labels must be the same
train_labels = train_labels[train_features.index.values]

#The number of data in train_features and train_labels match
print(len(train_features.index.values)) #4
print(len(train_labels.index.values)) #4