In: Computer Science
Find out the latest Data Collection and Pre-processing Techniques available in the market. Select any one on each and give suggestion how to improve on it.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process in which an ETL tool extracts the data from various data source systems, transforms it in the staging area and then finally, loads it into the Data Warehouse system.
Let us understand each step of the ETL process in depth:
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can transformed and during that period some new data can be extracted. And while the transformed data is being loaded into the data warehouse, the already extracted data can be transformed. The block diagram of the pipelining of ETL process is shown below:
ETL Tools: Most commonly used ETL tools are Sybase, Oracle Warehouse builder, CloverETL and MarkLogic.
What is Dimensionality Reduction?
In machine learning classification problems, there are often too many factors on the basis of which the final classification is done. These factors are basically variables called features. The higher the number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these features are correlated, and hence redundant. This is where dimensionality reduction algorithms come into play. Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.
Why is Dimensionality Reduction important in Machine Learning and Predictive Modeling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail classification problem, where we need to classify whether the e-mail is spam or not. This can involve a large number of features, such as whether or not the e-mail has a generic title, the content of the e-mail, whether the e-mail uses a template, etc. However, some of these features may overlap. In another condition, a classification problem that relies on both humidity and rainfall can be collapsed into just one underlying feature, since both of the aforementioned are correlated to a high degree. Hence, we can reduce the number of features in such problems. A 3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a simple 2 dimensional space, and a 1-D problem to a simple line. The below figure illustrates this concept, where a 3-D feature space is split into two 1-D feature spaces, and later, if found to be correlated, the number of features can be reduced even further.
Components of Dimensionality Reduction
There are two components of dimensionality reduction:
a) Feature selection: Feature selection (FS) is “the process of identifying and removing as much irrelevant and redundant information as possible” . The goal is to obtain a subset of features from the original problem that still appro-priately describe it. This subset is commonly used to train a learner, with added benefits reported in the specialized litera-ture. FS can remove irrelevant and redundant features which may induce accidental correlations in learning algorithms, diminishing their generalization abilities. FS can be used in the data collection stage, saving cost in time, sampling, sensingand personnel used to gather the data.
b) Space transformations: FS is not the only way to cope with the curse of dimensionality by reducing the number of dimensions. Instead of selecting the most promising fea-tures, space transformation techniques generate a whole new set of features by combining the original ones. Such a combi-nation can be made obeying different criteria. The first ap-proaches were based on linear methods, as factor analysis
c) Instance reduction: A popular approach to minimize the impact of very large data sets in data mining algorithms is the use of Instance Reduction (IR) techniques. They reduce the size of the data set without decreasing the quality of the knowledge that can be extracted from it. Instance reduction is a complementary task regarding FS. It reduces the quantity of data by removing instances or by generating new ones. In the following we describe the most important instance reduction and generation algorithms
d) Instance selection: Nowadays, instance selection is perceived as necessary. The main problem in instance selection is to identify suitable examples from a very large amount of instances and then prepare them as input for a data mining algorithm. Thus, instance selection is comprised by a series of techniques that must be able to choose a subset of data that can replace the original data set and also being able to fulfill the goal of a data mining application .
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
how to improve ?
Choose an appropriate method
The abundance of available DR methods can seem intimidating when you want to pick one out of the existing bounty for your analysis. The truth is, you don't really need to commit to only one tool; however, you must recognize which methods are appropriate for your application.
Handle categorical input data appropriately
In many cases, available measurements are not numerical but qualitative or categorical
Use embedding methods for reducing similarity and dissimilarity input data
When neither quantitative nor qualitative features are available, the relationships between data points, measured as dissimilarities (or similarities), can be the basis of DR performed as a low-dimensional embedding. Even when variable measurements are available, computing dissimilarities and using distance-based methods might be an effective approach.
Consciously decide on the number of dimensions to retain
When performing DR, choosing a suitable number of new dimensions to compute is crucial. This step determines whether the signal of interest is captured in the reduced data, especially important when DR is applied as a preprocessing step preceding statistical analyses or machine learning tasks (e.g., clustering).
Understand the meaning of the new dimensions
Many linear DR methods, including PCA and CA, provide a reduced representation both for the observations and for the variables. Feature maps or correlation circles can be used to determine which original variables are associated with each other or with the newly generated output dimensions.