In: Computer Science
Data Mining Techniques
Please discuss whether or not the following problems are data mining tasks. Explain why.
(a). Retrieve students' records from a relational table with grade = "A". [5 points]
(b). From the table of students' information, check if attributes last name and address have any correlations. [5 points]
(c). Find all the documents from the text database containing keywords "data mining". [5 points]
(d). Divide the text database into several groups, each group containing near-duplicate or similar documents. [5 points]
(e). Based on historical stock data, as well as other attributes (e.g., gold price, gas price, etc.) for the past few days, predict the trend of a stock tomorrow. [5 points]
(f). Please provide your own example of the data mining. [5 points]
ANSWER:
A) Relational Model:
Definition: It is a model that organizes data into one or more tables of columns and rows with a unique key identifying each row.Row called as "tuples" which represents the entity or the product and columns called as attributes which represents attribute to that instances.
Retrieve students records from a relational table with grade = "A" is not a "Data Mining Technique but its a Data Base Management Technique".In data mining there is a concept called realational data mining which is used for relational databases.
The query for retrieve students record is:
SELECT * FROM student WHERE grade ='A'; //output gives the list of the data from relational table where whose grades are displayed as "A".
B) Correlations:
Definition:
It is a SQL concept not a data mining. A correlated query is a subquery that uses value from outer query.
Query:
SELECT lastname,address
FROM student outer
WHERE lastname operator
(SELECT lastname,address
FROM student
WHERE lastname = outer.address);
C) I did not have a clear idea about this part of question.
D) Duplicate Record dectection is a process of identifying different or multiple records that refer to one unique object.The process of duplication is detected while data preparation stage,during which data entries are stored in database.the data preparation stage includes a parsing, a data transformation, and standardization steps.This entire process is done in the data mining.As the records contain multiple fields which makes a duplicate detection problem more complicated.
There are two categories for matching the multiple data:
1) Train the data to learn how to match records.
2 )Approaches on domain knowledge to match records.
There are some techniques for matching the models:
1) Bayes decision rule for minimum error
2) Bayes decision rule for minimum cost
3) Decision with a reject region.
4) Supervised and unsupervised learning etc.,