Question

In: Computer Science

Find out the latest Data Collection and Pre-processing Techniques available in the market. Select any one...

Find out the latest Data Collection and Pre-processing Techniques available in the market. Select any one on each and give suggestion how to improve on it.

Solutions

Expert Solution

ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process in which an ETL tool extracts the data from various data source systems, transforms it in the staging area and then finally, loads it into the Data Warehouse system.

Let us understand each step of the ETL process in depth:

  1. Extraction:
    The first step of the ETL process is extraction. In this step, data from various source systems is extracted which can be in various formats like relational databases, No SQL, XML and flat files into the staging area. It is important to extract the data from various source systems and store it into the staging area first and not directly into the data warehouse because the extracted data is in various formats and can be corrupted also. Hence loading it directly into the data warehouse may damage it and rollback will be much more difficult. Therefore, this is one of the most important steps of ETL process.
  2. Transformation:
    The second step of the ETL process is transformation. In this step, a set of rules or functions are applied on the extracted data to convert it into a single standard format. It may involve following processes/tasks:
    • Filtering – loading only certain attributes into the data warehouse.
    • Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States and America into USA, etc.
    • Joining – joining multiple attributes into one.
    • Splitting – splitting a single attribute into multipe attributes.
    • Sorting – sorting tuples on the basis of some attribute (generally key-attribbute).
  3. Loading:
    The third and final step of the ETL process is loading. In this step, the transformed data is finally loaded into the data warehouse. Sometimes the data is updated by loading into the data warehouse very frequently and sometimes it is done after longer but regular intervals. The rate and period of loading solely depends on the requirements and varies from system to system.
  4. ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can transformed and during that period some new data can be extracted. And while the transformed data is being loaded into the data warehouse, the already extracted data can be transformed. The block diagram of the pipelining of ETL process is shown below:

    ETL Tools: Most commonly used ETL tools are Sybase, Oracle Warehouse builder, CloverETL and MarkLogic.

What is Dimensionality Reduction?

In machine learning classification problems, there are often too many factors on the basis of which the final classification is done. These factors are basically variables called features. The higher the number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these features are correlated, and hence redundant. This is where dimensionality reduction algorithms come into play. Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

Why is Dimensionality Reduction important in Machine Learning and Predictive Modeling?

An intuitive example of dimensionality reduction can be discussed through a simple e-mail classification problem, where we need to classify whether the e-mail is spam or not. This can involve a large number of features, such as whether or not the e-mail has a generic title, the content of the e-mail, whether the e-mail uses a template, etc. However, some of these features may overlap. In another condition, a classification problem that relies on both humidity and rainfall can be collapsed into just one underlying feature, since both of the aforementioned are correlated to a high degree. Hence, we can reduce the number of features in such problems. A 3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a simple 2 dimensional space, and a 1-D problem to a simple line. The below figure illustrates this concept, where a 3-D feature space is split into two 1-D feature spaces, and later, if found to be correlated, the number of features can be reduced even further.

Components of Dimensionality Reduction

There are two components of dimensionality reduction:

  • Feature selection: In this, we try to find a subset of the original set of variables, or features, to get a smaller subset which can be used to model the problem. It usually involves three ways:
    1. Filter
    2. Wrapper
    3. Embedded
  • Space transformations
  • Instance reduction
  • Instance selection

a) Feature selection: Feature selection (FS) is “the process of identifying and removing as much irrelevant and redundant information as possible” . The goal is to obtain a subset of features from the original problem that still appro-priately describe it. This subset is commonly used to train a learner, with added benefits reported in the specialized litera-ture. FS can remove irrelevant and redundant features which may induce accidental correlations in learning algorithms, diminishing their generalization abilities. FS can be used in the data collection stage, saving cost in time, sampling, sensingand personnel used to gather the data.

b) Space transformations: FS is not the only way to cope with the curse of dimensionality by reducing the number of dimensions. Instead of selecting the most promising fea-tures, space transformation techniques generate a whole new set of features by combining the original ones. Such a combi-nation can be made obeying different criteria. The first ap-proaches were based on linear methods, as factor analysis

c) Instance reduction: A popular approach to minimize the impact of very large data sets in data mining algorithms is the use of Instance Reduction (IR) techniques. They reduce the size of the data set without decreasing the quality of the knowledge that can be extracted from it. Instance reduction is a complementary task regarding FS. It reduces the quantity of data by removing instances or by generating new ones. In the following we describe the most important instance reduction and generation algorithms

d) Instance selection: Nowadays, instance selection is perceived as necessary. The main problem in instance selection is to identify suitable examples from a very large amount of instances and then prepare them as input for a data mining algorithm. Thus, instance selection is comprised by a series of techniques that must be able to choose a subset of data that can replace the original data set and also being able to fulfill the goal of a data mining application .

Methods of Dimensionality Reduction

The various methods used for dimensionality reduction include:

  • Principal Component Analysis (PCA)
  • Linear Discriminant Analysis (LDA)
  • Generalized Discriminant Analysis (GDA)

how to improve ?

Choose an appropriate method

The abundance of available DR methods can seem intimidating when you want to pick one out of the existing bounty for your analysis. The truth is, you don't really need to commit to only one tool; however, you must recognize which methods are appropriate for your application.

Handle categorical input data appropriately

In many cases, available measurements are not numerical but qualitative or categorical

Use embedding methods for reducing similarity and dissimilarity input data

When neither quantitative nor qualitative features are available, the relationships between data points, measured as dissimilarities (or similarities), can be the basis of DR performed as a low-dimensional embedding. Even when variable measurements are available, computing dissimilarities and using distance-based methods might be an effective approach.

Consciously decide on the number of dimensions to retain

When performing DR, choosing a suitable number of new dimensions to compute is crucial. This step determines whether the signal of interest is captured in the reduced data, especially important when DR is applied as a preprocessing step preceding statistical analyses or machine learning tasks (e.g., clustering).

Understand the meaning of the new dimensions

Many linear DR methods, including PCA and CA, provide a reduced representation both for the observations and for the variables. Feature maps or correlation circles can be used to determine which original variables are associated with each other or with the newly generated output dimensions.


Related Solutions

Search and select any commercial Big Data Solution available in any domain and pick any one...
Search and select any commercial Big Data Solution available in any domain and pick any one factor below to support your selection. At least 10 sentences Scalability Sizing Performance Manageability & flexibility Complexity, Ease of use Relevant Functionality Proprietary & open source SW features & support
1)The weak form of the efficient market theory contends that Select one: A. any publicly available...
1)The weak form of the efficient market theory contends that Select one: A. any publicly available information is useless in predicting future price movements. B. past performance can help determine the general direction of future price movements. C. past price performance is useless in predicting future price movements. D. price movements are not random but follow a general trend over a period of time. 2)Which of the following beliefs would not preclude charting as a method of portfolio management? Select...
Explain the below: Techniques of data collection, Examples of issues with data collection, Overview of some...
Explain the below: Techniques of data collection, Examples of issues with data collection, Overview of some topics in data management and Overview of defining metrics
Explain the cold tip compensation in thermocouples. Find any data collection hardware / card that offers...
Explain the cold tip compensation in thermocouples. Find any data collection hardware / card that offers a solution to this problem and a sign conditioner compatible with this card and write the brand / model and solution in detail. Choose equipment that can be connected to several different types of thermocouples. Please be legible. Thanks.
Select your favorite crop plant or model plant and find out what methods are available for...
Select your favorite crop plant or model plant and find out what methods are available for its transformation, and in what year those methods were first published. How is the DNA introduced – by Agrobacterium, gene gun or other method? How are the transformed cells and plants selected, and how are transgenic plants regenerated?
Select any five product brands (goods or services) of your choice available in your market and...
Select any five product brands (goods or services) of your choice available in your market and write their Marketing Mix (4 Ps - Product Price, Place and Promotion). Example: Pepsi cola, Toyota Corolla, Apple iPad, Samsung Galaxy S20, Ford Explorer, etc. , any brand of your choice. Write brief information about • Product Type, features, variants, sizes, brand name, design, etc. • Price - Prices of the products, range, discounts, pricing terms, etc. • Place - Channels of sale, coverage,...
Discuss the pros and cons of qualitative vs. quantitative techniques in data collection.
Discuss the pros and cons of qualitative vs. quantitative techniques in data collection.
Research and find out what type of locations/places have an AED Available. Are there any places...
Research and find out what type of locations/places have an AED Available. Are there any places that having an AED is mandatory? Can you get an AED for your home? How much does an AED cost? Does an AED require maintenance? Do you think it is a good idea to have an AED in your home? Be thorough in your research. Make sure to include references.
Using your Starbucks' latest financial data (2017 or the latest quarter for 2018 that’s available), present...
Using your Starbucks' latest financial data (2017 or the latest quarter for 2018 that’s available), present the data and identify the company’s cost structure: 1. What percentage is variable cost (to sales?) 2. What percentage is fixed cost (to sales?) 3. What is the gross margin? 4. Is breakeven point predictive from the info you see in the financial statements? If yes, show computations and provide a short form (2 or 3 statements) analysis. 5. Is degree of operating leverage...
Discuss data validity and reliability, Why are they important? How do data collection techniques impact data...
Discuss data validity and reliability, Why are they important? How do data collection techniques impact data validity and reliability?
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT