Question

In: Computer Science

When working with data sets, what are some of the difficulties of working with large data...

When working with data sets, what are some of the difficulties of working with large data sets? What problems will arise when data mining ? What information will be lost when reducing dimensions?

Solutions

Expert Solution

What are some of the difficulties of working with large data sets?

Several challenges are associated with using these data for clinical research, including issues surrounding access and information security, poor data quality, inconsistency of data within and across institutions, and a paucity of staff with expertise to manage and manipulate large clinical data sets. Every data is uniquely different; however, at a higher level we can expect to get the following issues with large datasets

Time - Large datasets require large amount of time to collect data, process data, annotate data, analyze data, build models. How much time you got?

Storage - Yes, if you are dealing with TB of data, then you have to store it somehow, somewhere and that will not be your typical desktop or laptop. It may be cloud or some additional space for which you have to pay extra dollars.

Money - Besides purchasing additional storage, you may need additional computing resources because your RAM cannot read and process all that huge data. So you need to shell out additional money on purchasing high end (read GPU) devices.

Resources -  If the data volume is really large, then one person might not be able to manage it properly and we may need a team of people working on it together. That means, unfortunately, spending more money.

Privacy -  If you are dealing with human data, then there is a big ethics issue. In short, you cannot release such data in public domain, not even on public services like cloud.

Missingness - With large data some large problems and missingness is one of them. No data collection mechanism is full-proof and be ready to have strategy in place to deal with missing values[1].

Mixed Data - The world is not merely numerical, it is categorical, ordinal, textual and everything in between. Be ready to handle mixed data in a real world application.

What problems will arise when data mining ?

Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various heterogeneous data sources. These factors also create some issues. Here in this tutorial, we will discuss the major issues regarding -

Mining Methodology and User Interaction

Performance Issues

Diverse Data Types Issues

The following diagram describes the major issues.

What information will be lost when reducing dimensions?

It is useful because it often does not lose important information when you use it to reduce dimension of your data. When you lose data it is often the higher frequency data and often that is less important. The large-scale, general trends, are captured in the components associated with the larger eigenvalues.


Related Solutions

In dealing with large data sets, addressing missing values is an important step. But, some datasets...
In dealing with large data sets, addressing missing values is an important step. But, some datasets contain variables that have a large amount of missing values. In other words, several rows of the dataset have missing values. In such cases, dropping the variable with missing values will lead to a loss of significant data. Imputing the missing values might also be useless, as these imputations will be based on a small number of records. In such cases, what alternatives can...
Healthcare data sets is an interesting topic. What are data sets? Why would a data set...
Healthcare data sets is an interesting topic. What are data sets? Why would a data set be developed? Provide one to two examples only not a list.
Six data sets are presented, some of them are samples from a normal distribution, and some...
Six data sets are presented, some of them are samples from a normal distribution, and some of them are samples from populations that are not normally distributed. Identify the samples that are not from normally distributed populations. L1: Drug concentration six hours after administration L2: Reading scores on standardized test for elementary children L3: The number of minutes clerical workers took to complete a certain worksheet L4: The level of impurities in aluminum cans (in percent) L5: The number of...
Six data sets are presented, some of them are samples from a normal distribution, and some...
Six data sets are presented, some of them are samples from a normal distribution, and some of them are samples from populations that are not normally distributed. Identify the samples that are not from normally distributed populations. L1: Drug concentration six hours after administration L2: Reading scores on standardized test for elementary children L3: The number of minutes clerical workers took to complete a certain worksheet L4: The level of impurities in aluminum cans (in percent) L5: The number of...
Six data sets are presented, some of them are samples from a normal distribution, and some...
Six data sets are presented, some of them are samples from a normal distribution, and some of them are samples from populations that are not normally distributed. Identify the samples that are not from normally distributed populations. L1: Drug concentration six hours after administration L2: Reading scores on standardized test for elementary children L3: The number of minutes clerical workers took to complete a certain worksheet L4: The level of impurities in aluminum cans (in percent) L5: The number of...
What are some disadvantages to having a large number of choices when it comes to purchasing...
What are some disadvantages to having a large number of choices when it comes to purchasing a product like a television?
Find a data set on the internet. Some suggested search terms: Free Data Sets, Medical Data...
Find a data set on the internet. Some suggested search terms: Free Data Sets, Medical Data Sets, Education Data Sets. Introduce your Data Set and Cite the Source. What trends do you notice in your data set? Based on the trends and the history of your data set, make a claim. What kind of test (left, right, two tailed) would you have to complete? Explain the steps needed to complete the Hypothesis Test. What is needed?
What difficulties might you anticipate when using the rational problem-solving process?  Why? What additional difficulties might arise...
What difficulties might you anticipate when using the rational problem-solving process?  Why? What additional difficulties might arise because of personal attributes? Which of these have you experienced? Explain. What were the consequences? How can these difficulties be avoided?
What are some things you need to be aware of when working in a cross-cultural environment?
What are some things you need to be aware of when working in a cross-cultural environment?
What are some of the tasks that financial analyst must do when working in travel agency?...
What are some of the tasks that financial analyst must do when working in travel agency? (Please be detailed as possible.)
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT