In: Computer Science
1.Brief explain the concept of "weak labels" and when such labels are useful.
2.Briefly explain the concept of Missing At Random (MAR).
3.Explain what the following code does (assume pandas has been imported, and aliased as pd):
df=pd.read_csv('college.csv', na_values='.')
Please find the answers below.
1.
Weak labels :
The idea behind the weak label is to label millions of training data as imperfect and inexpensive which can be programmatically generated using heuristics, rules-of-thumb, existing databases, ontologies, etc. Weak label addresses the data labeling bottleneck. Weak labels are used to decrease the cost and increase the efficiency of human efforts and also, we can use a large set of data with weak labels to pretrain a neural network and fine tune the parameters with a small amount of data with true labels. There are three types of weak labels.
Weak labels indicate that the data is imperfect, but it can be used to create a strong predictive model.
2.
Missing At Random (MAR) :
Missing at random (MAR) is one of the missing data models or response models which occurs if the probability of being missing is the same only within groups defined by the observed data. It is a systematic relationship between the inclination of missing values and the observed data, but not the missing data. MAR is a more general and realistic than MCAR . The example of MAR can be consider as a sample of population data , where the probability to be included depends on some known properties. Most of all modern missing data methods start with the MAR assumption.
3.
df=pd.read_csv('college.csv', na_values='.')
read_csv() function converts a .csv file in to a dataframe . na_values is used to find and interpret the missing values of a dataset.
The above code will read the 'college.csv' file and replace all '.' as NAN and store the result inside the df variable.