In: Computer Science
The dataset that i want to discuss is about the KDD Cup 1998 Data.
The location of the dataset is http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html
This dataset was used in the 1998 kdd cup data mining competition. It was collected by PVA, a non-profit organisation which provides programs and services for US veterans with spinal cord injuries or disease. They raise money via direct mailing campaigns. The organisation is interested in lapsed donors: people who have stopped donating for at least 12 months. The available dataset contains a record for every donor who received the 1997 mailing and did not make a donation in the 12 months before that. For each of them it is given whether and how much they donated as a response to this. Apart from that, data are given about the previous and the current mailing campaign, as well as personal information and the giving history of each lapsed donor. Also overlay demographics were added.
Size:
Variable Description -------------------------- ------------------------------------------ ODATEDW Origin Date. Date of donor's first gift to PVA YYMM format (Year/Month). OSOURCE Origin Source - (Only 1rst 3 bytes are used) - Defaulted to 00000 for conversion - Code indicating which mailing list the donor was originally acquired from - A nominal or symbolic field.
STATE State abbreviation (a nominal/symbolic field) ZIP Zipcode (a nominal/symbolic field) MAILCODE Mail Code " "= Address is OK B = Bad Address PVASTATE EPVA State or PVA State Indicates whether the donor lives in a state served by the organization's EPVA chapter P = PVA State E = EPVA State (Northeastern US)
and there also many more featrues.
This is definitely not an easy dataset. To start with, some of the attributes have quite a lot of missing values, and there are some records with formatting errors. An important issue is feature selection. There are far too many features, and it will be necessary to select the most relevant ones, or to construct your own features by combining existing ones (the kdd cup winners claim that the secret of their success lies in good feature selection). Also case selection will be important: the training set is huge (95,412 cases), but contains only 5% positive examples. Finally, building a useful model for this dataset is made more difficult by the fact that there is an inverse relationship between the probability to donate and the amount donated.