In: Computer Science
can anyone explain the code from R below?
spam is a dataset
sample <- sample( c(TRUE, FALSE), nrow(spam), replace=TRUE)
train <- spam[sample,]
test <- spam[!sample,]
when i run train and test, we got two different datasets that are split from the original spam dataset. But i don't understand the first line of code. not sure why this line of code can split data into two sets.
sample() is an inbuilt matlab function which randomly reorders the elements passed as the first argument and creates a vector of the size passed in the second argument.
sample <- sample( c(TRUE, FALSE), nrow(spam), replace=TRUE)
Here the sample function returns a vector whose size is the number of rows of the spam dataset and it contains randomised TRUE and FALSE values .
here is what sample function is returning (I've used a random dataset)
train <- spam[sample,]
This line returns those rows whose corresponding value in sample vector is TRUE
i.e It returns ith row if ith value in sample vector is TRUE
test <- spam[!sample,]
This line returns those rows whose corresponding value in sample vector is FALSE
i.e It returns ith row if the ith value in sample vector is FALSE
That is why you could see that train and test contains different rows and thus sample function has split your dataset.
Please Leave a LIKE . If you have any further query you can ask in the comments