In: Computer Science
When working with data sets, what are some of the difficulties of working with large data sets? What problems will arise when data mining ? What information will be lost when reducing dimensions?
What are some of the difficulties of working with large data sets?
Several challenges are associated with using these data for clinical research, including issues surrounding access and information security, poor data quality, inconsistency of data within and across institutions, and a paucity of staff with expertise to manage and manipulate large clinical data sets. Every data is uniquely different; however, at a higher level we can expect to get the following issues with large datasets
Time - Large datasets require large amount of time to collect data, process data, annotate data, analyze data, build models. How much time you got?
Storage - Yes, if you are dealing with TB of data, then you have to store it somehow, somewhere and that will not be your typical desktop or laptop. It may be cloud or some additional space for which you have to pay extra dollars.
Money - Besides purchasing additional storage, you may need additional computing resources because your RAM cannot read and process all that huge data. So you need to shell out additional money on purchasing high end (read GPU) devices.
Resources - If the data volume is really large, then one person might not be able to manage it properly and we may need a team of people working on it together. That means, unfortunately, spending more money.
Privacy - If you are dealing with human data, then there is a big ethics issue. In short, you cannot release such data in public domain, not even on public services like cloud.
Missingness - With large data some large problems and missingness is one of them. No data collection mechanism is full-proof and be ready to have strategy in place to deal with missing values[1].
Mixed Data - The world is not merely numerical, it is categorical, ordinal, textual and everything in between. Be ready to handle mixed data in a real world application.
What problems will arise when data mining ?
Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various heterogeneous data sources. These factors also create some issues. Here in this tutorial, we will discuss the major issues regarding -
Mining Methodology and User Interaction
Performance Issues
Diverse Data Types Issues
The following diagram describes the major issues.
What information will be lost when reducing dimensions?
It is useful because it often does not lose important information when you use it to reduce dimension of your data. When you lose data it is often the higher frequency data and often that is less important. The large-scale, general trends, are captured in the components associated with the larger eigenvalues.