DATA MINING : Find an interesting data set on the Web. Provide a high level description...

DATA MINING : Find an interesting data set on the Web. Provide a high level description of the data set and minimally give its name, location, number of features (with some discussion of the feature types), and number of entries. Describe how data mining can be applied to it (e.g., for classification, etc.) and describe why you think it is interesting.

Expert Solution

The dataset that i want to discuss is about the KDD Cup 1998 Data.

The location of the dataset is http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html

This dataset was used in the 1998 kdd cup data mining competition. It was collected by PVA, a non-profit organisation which provides programs and services for US veterans with spinal cord injuries or disease. They raise money via direct mailing campaigns. The organisation is interested in lapsed donors: people who have stopped donating for at least 12 months. The available dataset contains a record for every donor who received the 1997 mailing and did not make a donation in the 12 months before that. For each of them it is given whether and how much they donated as a response to this. Apart from that, data are given about the previous and the current mailing campaign, as well as personal information and the giving history of each lapsed donor. Also overlay demographics were added.

Size:

191779 records: 95412 training cases and 96367 test cases
481 attributes
236.2 MB: 117.2 MB trainin data and 119 MB test data.

Variable                    Description
--------------------------  ------------------------------------------
ODATEDW                     Origin Date. Date of donor's first gift
                            to PVA YYMM format (Year/Month).
                           
OSOURCE                     Origin Source 
                            - (Only 1rst 3 bytes are used)
                            - Defaulted to 00000 for conversion
                            - Code indicating which mailing list the
                              donor was originally acquired from
                            - A nominal or symbolic field.

STATE                       State abbreviation (a nominal/symbolic field)
ZIP                         Zipcode (a nominal/symbolic field)
MAILCODE                    Mail Code
                            " "= Address is OK
                            B = Bad Address
                           
PVASTATE                    EPVA State or PVA State
                            Indicates whether the donor lives in a state 
                            served by the organization's EPVA chapter
                            P = PVA State
                            E = EPVA State (Northeastern US)

and there also many more featrues.

Split of the dataset into train (calibration) and test (validation).
Explode raw variables into predictors using transformations. A variable such as AGE can be used to create four binary catagorical variables based upon the distribution of AGE by quartile. Several transformations are created for each variable. For example, AGE can also be transformed into: Chi-Square categories, a LOG transform, and a Piece-Wise Linear transform. Each type of transformation of an individual variable is referred to as a set of predictors. GainSmarts arranges these predictors hierarchically and then tests each set to determine the "best" transformation to represent the variable in the subsequent modeling processes.
Univariate analysis by individual predictor
Correlation analysis by predictor (within the hierarchy) to eliminate highly correlated predictors.
GainSmarts selects the best available representation for each attribute using an expert system (rule based) approach, thereby selecting either AGE by QUARTILES, or Piece-Wise Linear transform for AGE, or ...etc.
Select the best set of attributes using a stepwise methodology.
Correlation analysis across all remaining attributes to remove highly correlated attributes.
Select the final set of predictors in the model, using a rule based mechanism, to eliminate overfitting. This is achieved by limiting the number of coefficients (or weights), proper setting of parameters and introducing/eliminating entire representations of variables.
Parameter estimation and calibration
Cross validation and generate output (to EXCEL)
Model scoring (or code generation)

Note: The process from 2-10 was repeated for both stages of the modeling process. Therefore, each stage of the modeling process could contain it's own unique variables with unique transformations.

This is definitely not an easy dataset. To start with, some of the attributes have quite a lot of missing values, and there are some records with formatting errors. An important issue is feature selection. There are far too many features, and it will be necessary to select the most relevant ones, or to construct your own features by combining existing ones (the kdd cup winners claim that the secret of their success lies in good feature selection). Also case selection will be important: the training set is huge (95,412 cases), but contains only 5% positive examples. Finally, building a useful model for this dataset is made more difficult by the fact that there is an inverse relationship between the probability to donate and the amount donated.

venereology answered 3 months ago

The Iris data set is a well-known data set among data mining analysts. Please provide some...

The Iris data set is a well-known data set among data mining analysts. Please provide some background of this data set and the information contained in it.

Find student project for data mining application and describe what industry, where the web site )...

Find student project for data mining application and describe what industry, where the web site ) you locate the project then briefly describe the purpose of the project and how the problem was solved and major finding .

Go to the Web and find out two ways employers can use data mining to retain...

Go to the Web and find out two ways employers can use data mining to retain employees. You cannot use an article that has already been selected by one of your classmates: doing so will result in a grade of zero (0) for this assignment. Once you find your own article, submit the following: 1. Name of the article 2. Author(s) 3. Date of publication 4. Source 5. Copy of the web link to the article (this is so I...

Provide a high level (cartoon or box model) description of how an ICP-MS and how an...

Provide a high level (cartoon or box model) description of how an ICP-MS and how an IC work.

Provide a high-level description of dollar value LIFO and discuss why many firms use it.

What are some pros and cons to data mining? Provide an example of when data mining...

What are some pros and cons to data mining? Provide an example of when data mining was used and the outcome provided an incorrect assumption or issue. How can these types of situations be avoided in the future?

Explain how/why data mining and web mining respectively pose a threat to both individual and group...

Explain how/why data mining and web mining respectively pose a threat to both individual and group privacy. Why does it matter? List at least three privacy-enhancement technologies (PETS) that exist to protect users and how/whether they work

Social Media Marketing. What is "big data"? Please find a definition from the web and provide...

Social Media Marketing. What is "big data"? Please find a definition from the web and provide the link to the reference. Why does big data create more options for identifying and targeting customers? Please provide resources from the book.

Healthcare data sets is an interesting topic. What are data sets? Why would a data set...

Healthcare data sets is an interesting topic. What are data sets? Why would a data set be developed? Provide one to two examples only not a list.

Locate a data visualization on the web that you think is good. Provide the web address...

Locate a data visualization on the web that you think is good. Provide the web address (URL) to the visualization. Is this a static or interactive data visualization? What does it do well? Are there any things you would do differently if you were constructing this visualization?

Question