In: Computer Science
What is data mining? What are the steps in the data
mining pipeline? What are the different kinds of tasks in data
mining?
What are the different types of data? How do you
manage data quality? What is sampling? How do you design algorithms
for sampling? How do you compute similarity and dissimilarity in
data?
How do you mine frequent itemsets and discover
association rules from transaction data? How are space constraints
handles in transactional data while designing algorithms to
discover association rules? How will you use the A-priori
algorithm? How does one operationalize A-priori
algorithm?
How to find similar items using locality senstive
hashing? How is min-hashing operationalized using
permutations?
What is clustering? How do hierarchical clustering and
point assignment methods work? How would you optimize clustering
algorithms when you have data that does not fit in
memory?
Data mining is the process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis. Data mining tools allow enterprises to predict future trends.
--Steps
Data science can’t answer any question without data. So, the most important thing is to obtain the data, but not just any data; it must be “authentic and reliable data.” It’s simple, garbage goes in garbage comes out.
As a rule of thumb, there must be strict checks when obtaining your data. Now, gather all of your available datasets (which can be from the internet or external/internal databases/third parties) and extract their data into a usable format (.csv, JSON, XML, etc.).
This phase of the pipeline is very time consuming and laborious. Most of the time data comes with its own anomalies like missing parameters, duplicate values, irrelevant features, etc. So it becomes very important that we do a cleanup exercise and take only the information which is important to the problem asked, because the results and output of your machine learning model are only as good as what you put into it. Again, garbage in-garbage out.
The objective should be to examine the data thoroughly to understand every feature of the data you’re working with, identifying errors, filling data holes, removal of duplicate or corrupt records, throwing away the whole feature sometimes, etc. Domain level expertise is crucial at this stage to understand the impact of any feature or value.
Skills Required:
Coding language: Python, R.
Data Modifying Tools: Python libs, Numpy, Pandas, R.
Distributed Processing: Hadoop, Map Reduce/Spark.
During the visualization phase, you should try to find out patterns and values your data has. You should use different types of visualizations and statistical testing techniques to back up your findings. This is where your data will start revealing the hidden secrets through various graphs, charts, and analysis. Domain-level expertise is desirable at this stage to fully understand the visualizations and their interpretations.
The objective is to find out the patterns through visualizations and charts which will also lead to the feature extraction step using statistics to identify and test significant variables.
Skills Required:
Python: NumPy, Matplotlib, Pandas, SciPy.
R: GGplot2, Dplyr.
Statistics: Random sampling, Inferential.
Data Visualization: Tableau.
Machine learning models are generic tools. You can access many tools and algorithms and use them to accomplish different business goals. The better features you use the better your predictive power will be. After cleaning the data and finding out the features that are most important for a given business problem by using relevant models as a predictive tool will enhance the decision-making process.
The objective of this is to do the in-depth analytics, mainly the creation of relevant machine learning models, like predictive model/algorithm to answer the problems related to predictions.
The second important objective is to evaluate and refine your own model. This involves multiple sessions of evaluation and optimization cycles. Any machine learning model can’t be superlative on the first attempt. You will have to increase its accuracy by training it with fresh ingestion of data, minimizing losses, etc.
Various techniques or methods are available to assess the accuracy or quality of your model. Evaluating your machine learning algorithm is an essential part of data science pipeline. Your model may give you satisfying results when evaluated using a metric say accuracy_score but may give poor results when evaluated against other metrics such as logarithmic_loss or any other such metric. Use of classification accuracy to measure the performance of a model is a standard way, however, it is not enough to truly judge a model. So, here you would test multiple models for their performance, error rate, etc., and would consider the optimum choice per your requirements.
Some of the commonly used methods are:
Classification accuracy.
Logarithmic loss.
Confusion matrix.
Area under curve.
F1 score.
Mean absolute error.
Mean squared error.
Skills Required:
Machine Learning: Supervised/Unsupervised algorithms.
Evaluation methods.
Machine Learning Libraries: Python (Sci-kit Learn, NumPy).
Linear algebra and Multivariate Calculus.
Interpreting the data is more like communicating your findings to the interested parties. If you can’t explain your findings to someone believe me, whatever you have done is of no use. Hence, this step becomes very crucial.
The objective of this step is to first identify the business insight and then correlate it to your data findings. You might need to involve domain experts in correlating the findings with business problems. Domain experts can help you in visualizing your findings according to the business dimensions which will also aid in communicating facts to a non-technical audience.
Skills required:
Business domain knowledge.
Data visualization tools: Tableau, D3.js, Matplotlib, ggplot2, Seaborn.
Communication: Presenting/speaking and reporting/writing.
As your model is in production, it becomes important to revisit and update your model periodically, depending on how often you receive new data or as per the changes in the nature of the business. The more data you receive the more frequent the updates will be. Assume you’re working for a transport provider company and, one day, fuel prices went up and the company had to bring out the electric vehicles in their stable. Your old model doesn’t take this into account and now you must update the model that includes this new category of vehicles. If not, your model will degrade over time and won’t perform as well, leaving your business to degrade too. The introduction of new features will alter the model performance either through different variations or possibly correlations to other features.
---Different Data Mining Tasks
There are a number of data mining tasks such as classification, prediction, time-series analysis, association, clustering, summarization etc. All these tasks are either predictive data mining tasks or descriptive data mining tasks. A data mining system can execute one or more of the above specified tasks as part of data mining.
Predictive data mining tasks come up with a model from the available data set that is helpful in predicting unknown or future values of another data set of interest. A medical practitioner trying to diagnose a disease based on the medical test results of a patient can be considered as a predictive data mining task. Descriptive data mining tasks usually finds data describing patterns and comes up with new, significant information from the available data set. A retailer trying to identify products that are purchased together can be considered as a descriptive data mining task.
a) Classification
Classification derives a model to determine the class of an object based on its attributes. A collection of records will be available, each record with a set of attributes. One of the attributes will be class attribute and the goal of classification task is assigning a class attribute to new set of records as accurately as possible.
Classification can be used in direct marketing, that is to reduce marketing costs by targeting a set of customers who are likely to buy a new product. Using the available data, it is possible to know which customers purchased similar products and who did not purchase in the past. Hence, {purchase, don’t purchase} decision forms the class attribute in this case. Once the class attribute is assigned, demographic and lifestyle information of customers who purchased similar products can be collected and promotion mails can be sent to them directly.
b) Prediction
Prediction task predicts the possible values of missing or future data. Prediction involves developing a model based on the available data and this model is used in predicting future values of a new data set of interest. For example, a model can predict the income of an employee based on education, experience and other demographic factors like place of stay, gender etc. Also prediction analysis is used in different areas including medical diagnosis, fraud detection etc.
c) Time - Series Analysis
Time series is a sequence of events where the next event is determined by one or more of the preceding events. Time series reflects the process being measured and there are certain components that affect the behavior of a process. Time series analysis includes methods to analyze time-series data in order to extract useful patterns, trends, rules and statistics. Stock market prediction is an important application of time- series analysis.
d) Association
Association discovers the association or connection among a set of items. Association identifies the relationships between objects. Association analysis is used for commodity management, advertising, catalog design, direct marketing etc. A retailer can identify the products that normally customers purchase together or even find the customers who respond to the promotion of same kind of products. If a retailer finds that beer and nappy are bought together mostly, he can put nappies on sale to promote the sale of beer.
e) Clustering
Clustering is used to identify data objects that are similar to one another. The similarity can be decided based on a number of factors like purchase behavior, responsiveness to certain actions, geographical locations and so on. For example, an insurance company can cluster its customers based on age, residence, income etc. This group information will be helpful to understand the customers better and hence provide better customized services.
f) Summarization
Summarization is the generalization of data. A set of relevant data is summarized which result in a smaller set that gives aggregated information of the data. For example, the shopping done by a customer can be summarized into total products, total spending, offers used, etc. Such high level summarized information can be useful for sales or customer relationship team for detailed customer and purchase behavior analysis. Data can be summarized in different abstraction levels and from different angles.
Data types
The 5 Pillars of Data Quality Management
Now that you understand the importance of high-quality data and want to take action to solidify your data foundation, let’s take a look at the techniques behind DQM and the 5 pillars supporting it.
1 – The people
Technology is only as efficient as the individuals who implement it. We may function within a technologically advanced business society, but human oversight and process implementation have not (yet) been rendered obsolete. Therefore, there are several DQM roles that need to be filled, including:
DQM Program Manager: The program manager role should be filled by a high-level leader who accepts the responsibility of general oversight for business intelligence initiatives. He/she should also oversee the management of the daily activities involving data scope, project budget and program implementation. The program manager should lead the vision for quality data and ROI.
Organization Change Manager: The change manager does exactly what the title suggests: organizing. He/she assists the organization by providing clarity and insight into advanced data technology solutions. As quality issues are often highlighted with the use of a dashboard software, the change manager plays an important role in the visualization of data quality.
Business/Data Analyst: The business analyst is all about the “meat and potatoes” of the business. This individual defines the quality needs from an organizational perspective. These needs are then quantified into data models for acquisition and delivery. This person (or group of individuals) ensures that the theory behind data quality is communicated to the development team.
2 – Data profiling
Data profiling is an essential process in the DQM lifecycle. It involves:
Reviewing data in detail
Comparing and contrasting the data to its own metadata
Running statistical models
Reporting the quality of the data
This process is initiated for the purpose of developing insight into existing data, with the purpose of comparing it to quality goals. It helps businesses develop a starting point in the DQM process and sets the standard for how to improve their information quality. The data quality metrics of complete and accurate data are imperative to this step. Accurate data is looking for disproportionate numbers, and complete data is defining the data body and ensuring that all data points are whole. We will go over them in the third part of this article.
3 – Defining data quality
The third pillar of DQM is quality itself. “Quality rules” should be created and defined based on business goals and requirements. These are the business/technical rules with which data must comply in order to be considered viable.
Business requirements are likely to take a front seat in this pillar, as critical data elements should depend upon industry. The development of quality rules is essential to the success of any DQM process, as the rules will detect and prevent compromised data from infecting the health of the whole set.
Much like antibodies detecting and correcting viruses within our bodies, data quality rules will correct inconsistencies among valuable data. When teamed together with online BI tools, these rules can be key in predicting trends and reporting analytics.
4 – Data reporting
DQM reporting is the process of removing and recording all compromising data. This should be designed to follow as a natural process of data rule enforcement. Once exceptions have been identified and captured, they should be aggregated so that quality patterns can be identified.
The captured data points should be modeled and defined based on specific characteristics (e.g., by rule, by date, by source, etc.). Once this data is tallied, it can be connected to an online reporting software to report on the state of quality and the exceptions that exist within a dashboard. If possible, automated and “on-demand” technology solutions should be implemented as well, so dashboard insights can appear in real-time.
Reporting and monitoring are the crux of data quality management ROI, as they provide visibility into the state of data at any moment in real time. By allowing businesses to identify the location and domiciles of data exceptions, teams of data specialists can begin to strategize remediation processes.
Knowledge of where to begin engaging in proactive data adjustments will help businesses move one step closer to recovering their part of the $9.7 billion lost each year to low-quality data.
5 – Data repair
Data repair is the two-step process of determining:
The best way to remediate data
The most efficient manner in which to implement the change
The most important aspect of data remediation is the performance of a “root cause” examination to determine why, where, and how the data defect originated. Once this examination has been implemented, the remediation plan should begin.
Data processes that depended upon the previously defective data will likely need to be re-initiated, especially if their functioning was at risk or compromised by the defected data. These processes could include reports, campaigns, or financial documentation.
This is also the point where data quality rules should be reviewed again. The review process will help determine if the rules need to be adjusted or updated, and it will help begin the process of data evolution. Once data is deemed of high-quality, critical business processes and functions should run more efficiently and accurately, with a higher ROI and lower costs.
--Sampling
Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.
--Similarity and dissimilarity
Please go through this link : https://newonlinecourses.science.psu.edu/stat508/lesson/1b/1b.2/1b.2.1