In: Computer Science
Standardization
Goal: Perform the transformation on validation and test sets in a right way The following code shows two ways to standardize validation and test sets (here is only shown on a test set).
Code:
LN [] import pandas as pd
X_train = pd.DataFrame([10 ,20, 30])
X_test = pd.DataFrame([5,6,7])
mu_train, sigma_train = X_train.mean(axis=0),
X_train.std(axis=0)
mu_test, sigma_test = X_test.mean(axis=0), X_test.std(axis=0)
X_train_std = (X_train - mu_train) / sigma_train
X_test_std1 = (X_test - mu_test) / sigma_test
X_test_std2 = (X_test - mu_train) / sigma_train
LN [] # Add your code for step 3 here
Since jupyter notebook can't be added, I amadding the codes as separate scripts.
1. Run the given code
import pandas as pd
X_train = pd.DataFrame([10 ,20, 30])
X_test = pd.DataFrame([5,6,7])
mu_train, sigma_train = X_train.mean(axis=0), X_train.std(axis=0)
mu_test, sigma_test = X_test.mean(axis=0), X_test.std(axis=0)
X_train_std = (X_train - mu_train) / sigma_train
X_test_std1 = (X_test - mu_test) / sigma_test
X_test_std2 = (X_test - mu_train) / sigma_train
2. Apply standardization using sklearn library
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_test)
scaler.transform(X_test)
3. What is actually indicated by re using the parameters from train set is that the scaler is fit to the training data and then used to transform the test data and the validation. If the validation and test data are splitter from the same dataset as the train then this won't affect much and will make the program better. But if the problem statement is such that the datasets are from very different circumstances then re using the parameters received from fitting the train data is not a very idea to evaluate the validation and test datasets.
Let me know in comments if further explanation is needed for any part. Please leave an upvote if this helps at all.