Question

In: Computer Science

discussed big data , data warehouse and google database for big data and bootstrapping technique for...

discussed big data , data warehouse and google database for big data and bootstrapping technique for data analytics to a real life business scenario.

Writing Requirements

3-5 pages in length in word document

Solutions

Expert Solution

Big data is a combination of structured, semistructured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modeling and other advanced analytics applications.

Systems that process and store big data have become a common component of data management architectures in organizations. Big data is often characterized by the 3Vs: the large volume of data in many environments, the wide variety of data types stored in big data systems and the velocity at which the data is generated, collected and processed. These characteristics were first identified by Doug Laney, then an analyst at Meta Group Inc., in 2001; Gartner further popularized them after it acquired Meta Group in 2005. More recently, several other Vs have been added to different descriptions of big data, including veracity, value and variability.

Although big data doesn't equate to any specific volume of data, big data deployments often involve terabytes (TB), petabytes (PB) and even exabytes (EB) of data captured over time.

Companies use the big data accumulated in their systems to improve operations, provide better customer service, create personalized marketing campaigns based on specific customer preferences and, ultimately, increase profitability. Businesses that utilize big data hold a potential competitive advantage over those that don't since they're able to make faster and more informed business decisions, provided they use the data effectively.

For example, big data can provide companies with valuable insights into their customers that can be used to refine marketing campaigns and techniques in order to increase customer engagement and conversion rates.

Furthermore, utilizing big data enables companies to become increasingly customer-centric. Historical and real-time data can be used to assess the evolving preferences of consumers, consequently enabling businesses to update and improve their marketing strategies and become more responsive to customer desires and needs.

Big data is also used by medical researchers to identify disease risk factors and by doctors to help diagnose illnesses and conditions in individual patients. In addition, data derived from electronic health records (EHRs), social media, the web and other sources provides healthcare organizations and government agencies with up-to-the-minute information on infectious disease threats or outbreaks.

In the energy industry, big data helps oil and gas companies identify potential drilling locations and monitor pipeline operations; likewise, utilities use it to track electrical grids. Financial services firms use big data systems for risk management and real-time analysis of market data. Manufacturers and transportation companies rely on big data to manage their supply chains and optimize delivery routes. Other government uses include emergency response, crime prevention and smart city initiatives.

Examples of big data

Big data comes from myriad different sources, such as business transaction systems, customer databases, medical records, internet clickstream logs, mobile applications, social networks, scientific research repositories, machine-generated data and real-time data sensors used in internet of things (IoT) environments. The data may be left in its raw form in big data systems or preprocessed using data mining tools or data preparation software so it's ready for particular analytics uses.

Using customer data as an example, the different branches of analytics that can be done with the information found in sets of big data include the following:

Comparative analysis. This includes the examination of user behavior metrics and the observation of real-time customer engagement in order to compare one company's products, services and brand authority with those of its competition.
Social media listening. This is information about what people are saying on social media about a specific business or product that goes beyond what can be delivered in a poll or survey. This data can be used to help identify target audiences for marketing campaigns by observing the activity surrounding specific topics across various sources.
Marketing analysis. This includes information that can be used to make the promotion of new products, services and initiatives more informed and innovative.
Customer satisfaction and sentiment analysis. All of the information gathered can reveal how customers are feeling about a company or brand, if any potential issues may arise, how brand loyalty might be preserved and how customer service efforts might be improved.

More characteristics of big data

Looking beyond the original 3Vs, data veracity refers to the degree of certainty in data sets. Uncertain raw data collected from multiple sources -- such as social media platforms and webpages -- can cause serious data quality issues that may be difficult to pinpoint. For example, a company that collects sets of big data from hundreds of sources may be able to identify inaccurate data, but its analysts need data lineage information to trace where the data is stored so they can correct the issues.

Bad data leads to inaccurate analysis and may undermine the value of business analytics because it can cause executives to mistrust data as a whole. The amount of uncertain data in an organization must be accounted for before it is used in big data analytics applications. IT and analytics teams also need to ensure that they have enough accurate data available to produce valid results.

Some data scientists also add value to the list of characteristics of big data. As explained above, not all data collected has real business value, and the use of inaccurate data can weaken the insights provided by analytics applications. It's critical that organizations employ practices such as data cleansing and confirm that data relates to relevant business issues before they use it in a big data analytics project.

Variability also often applies to sets of big data, which are less consistent than conventional transaction data and may have multiple meanings or be formatted in different ways from one data source to another -- factors that further complicate efforts to process and analyze the data. Some people ascribe even more Vs to big data; data scientists and consultants have created various lists with between seven and 10 Vs.

How big data is stored and processed

The need to handle big data velocity imposes unique demands on the underlying compute infrastructure. The computing power required to quickly process huge volumes and varieties of data can overwhelm a single server or server cluster. Organizations must apply adequate processing capacity to big data tasks in order to achieve the required velocity. This can potentially demand hundreds or thousands of servers that can distribute the processing work and operate collaboratively in a clustered architecture, often based on technologies like Hadoop and Apache Spark.

Achieving such velocity in a cost-effective manner is also a challenge. Many enterprise leaders are reticent to invest in an extensive server and storage infrastructure to support big data workloads, particularly ones that don't run 24/7. As a result, public cloud computing is now a primary vehicle for hosting big data systems. A public cloud provider can store petabytes of data and scale up the required number of servers just long enough to complete a big data analytics project. The business only pays for the storage and compute time actually used, and the cloud instances can be turned off until they're needed again.

To improve service levels even further, public cloud providers offer big data capabilities through managed services that include the following:

Amazon EMR (formerly Elastic MapReduce)
Microsoft Azure HDInsight
Google Cloud Dataproc

In cloud environments, big data can be stored in the following:

Hadoop Distributed File System (HDFS);
lower-cost cloud object storage, such as Amazon Simple Storage Service (S3);
NoSQL databases; and
relational databases.

For organizations that want to deploy on-premises big data systems, commonly used Apache open source technologies in addition to Hadoop and Spark include the following:

YARN, Hadoop's built-in resource manager and job scheduler, which stands for Yet Another Resource Negotiator but is commonly known by the acronym alone;
the MapReduce programming framework, also a core component of Hadoop;
Kafka, an application-to-application messaging and data streaming platform;
the HBase database; and
SQL-on-Hadoop query engines, like Drill, Hive, Impala and Presto.

Users can install the open source versions of the technologies themselves or turn to commercial big data platforms offered by Cloudera, which merged with former rival Hortonworks in January 2019, or Hewlett Packard Enterprise (HPE), which bought the assets of big data vendor MapR Technologies in August 2019. The Cloudera and MapR platforms are also supported in the cloud.

Big data challenges

Besides the processing capacity and cost issues, designing a big data architecture is another common challenge for users. Big data systems must be tailored to an organization's particular needs, a DIY undertaking that requires IT teams and application developers to piece together a set of tools from all the available technologies. Deploying and managing big data systems also require new skills compared to the ones possessed by database administrators (DBAs) and developers focused on relational software.

Both of those issues can be eased by using a managed cloud service, but IT managers need to keep a close eye on cloud usage to make sure costs don't get out of hand. Also, migrating on-premises data sets and processing workloads to the cloud is often a complex process for organizations.

Making the data in big data systems accessible to data scientists and other analysts is also a challenge, especially in distributed environments that include a mix of different platforms and data stores. To help analysts find relevant data, IT and analytics teams are increasingly working to build data catalogs that incorporate metadata management and data lineage functions. Data quality and data governance also need to be priorities to ensure that sets of big data are clean, consistent and used properly.

Big data collection practices and regulations

For many years, companies had few restrictions on the data they collected from their customers. However, as the collection and use of big data have increased, so has data misuse. Concerned citizens who have experienced the mishandling of their personal data or have been victims of a data breach are calling for laws around data collection transparency and consumer data privacy.

The outcry about personal privacy violations led the European Union to pass the General Data Protection Regulation (GDPR), which took effect in May 2018; it limits the types of data that organizations can collect and requires opt-in consent from individuals or compliance with other specified lawful grounds for collecting personal data. GDPR also includes a right-to-be-forgotten provision, which lets EU residents ask companies to delete their data.

While there aren't similar federal laws in the U.S., the California Consumer Privacy Act (CCPA) aims to give California residents more control over the collection and use of their personal information by companies. CCPA was signed into law in 2018 and is scheduled to take effect on Jan. 1, 2020. In addition, government officials in the U.S. are investigating data handling practices, specifically among companies that collect consumer data and sell it to other companies for unknown use.

A data warehouse is a large collection of business data used to help an organization make decisions. The concept of the data warehouse has existed since the 1980s, when it was developed to help transition data from merely powering operations to fueling decision support systems that reveal business intelligence. The large amount of data in data warehouses comes from different places such as internal applications such as marketing, sales, and finance; customer-facing apps; and external partner systems, among others.

On a technical level, a data warehouse periodically pulls data from those apps and systems; then, the data goes through formatting and import processes to match the data already in the warehouse. The data warehouse stores this processed data so it’s ready for decision makers to access. How frequently data pulls occur, or how data is formatted, etc., will vary depending on the needs of the organization.

Some benefits of a data warehouse

Organizations that use a data warehouse to assist their analytics and business intelligence see a number of substantial benefits:

Better data — Adding data sources to a data warehouse enables organizations to ensure that they are collecting consistent and relevant data from that source. They don’t need to wonder whether the data will be accessible or inconsistent as it comes in to the system. This ensures higher data quality and data integrity for sound decision making.
Faster decisions — Data in a warehouse is in such consistent formats that it is ready to be analyzed. It also provides the analytical power and a more complete dataset to base decisions on hard facts. Therefore, decision makers no longer need to reply on hunches, incomplete data, or poor quality data and risk delivering slow and inaccurate results.

What a data warehouse is not

1. It is not a database

It’s easy to confuse a data warehouse with a database, since both concepts share some similarities. The primary difference, however, comes into effect when a business needs to perform analytics on a large data collection. Data warehouses are made to handle this type of task, while databases are not. Here’s a comparison chart that tells the difference between the two:

	Database	Data Warehouse
What it is	Data collected for multiple transactional purposes. Optimized for read/write access.	Aggregated transactional data, transformed and stored for analytical purposes. Optimized for aggregation and retrieval of large data sets.
How it’s used	Databases are made to quickly record and retrieve information.	Data warehouses store data from multiple databases, which makes it easier to analyze.
Types	Databases are used in data warehousing. However, the term usually refers to an online, transactional processing database. There are other types as well, including csv, html, and Excel spreadsheets used for database purposes.	A data warehouse is an analytical database that layers on top of transactional databases to allow for analytics.

2. It is not a data lake

Although they both are built for business analytics purposes, the major difference between a data lake and a data warehouse is that a data lake stores all types of raw, structured, and unstructured data from all data sources in its native format until it is needed. By contrast, a data warehouse stores data in files or folders in a more organized fashion that is readily available for reporting and data analysis.

3. It is not a data mart

Data warehouses are also sometimes confused with data marts. But data warehouses are generally much bigger and contain a greater variety of data, while data marts are limited in their application.

Data marts are often subsets of a warehouse, designed to easily deliver specific data to a specific user, for a specific application. In the simplest terms, data marts can be thought of as single-subject, while data warehouses cover multiple subjects.

The future of the data warehouse: move to the cloud

As businesses make the move to the cloud, so too do their databases and data warehousing tools. The cloud offers many advantages: flexibility, collaboration, and accessibility from anywhere, to name a few. Popular tools like Amazon Redshift, Microsoft Azure SQL Data Warehouse, Snowflake, Google BigQuery, and have all offered businesses simple ways to warehouse and analyze their cloud data.

The cloud model lowers the barriers to entry — especially cost, complexity, and lengthy time-to-value — that have traditionally limited the adoption and successful use of data warehousing technology. It permits an organization to scale up or scale down — to turn on or turn off — data warehouse capacity as needed. Plus, it’s fast and easy to get started with a cloud data warehouse. Doing so requires neither a huge up-front investment nor a time-consuming (and no less costly) deployment process.

The cloud data warehouse architecture largely eliminates the risks endemic to the on-premises data warehouse paradigm. You don’t have to budget for and procure hardware and software. You don’t have to set aside a budget line item for annual maintenance and support. In the cloud, the cost considerations that have traditionally preoccupied data warehouse teams — budgeting for planned and unplanned system upgrades — go away.

A data warehouse example

Beachbody, a leading provider of fitness, nutrition, and weight-loss programs, needed to better target and personalize offerings to customers, in order to produce in better health outcomes for clients, and ultimately better business performance.

The company revamped its analytics architecture by adding a Hadoop-based cloud data lake on AWS, powered by Talend Real-Time Big Data. This new architecture has allowed Beachbody to reduce data acquisition time by 5x, while also improving the accuracy of the database for marketing campaigns.

Discover the power of the data warehouse

Organizations can get more from their analytics efforts by moving beyond simple databases and into the world of data warehousing. Finding the right warehousing solution to fit business needs can make a world of difference in how effectively a company serves its customers and grows its operations.

Google are probably responsible for introducing people to the benefits of analysing and interpreting Big Data in their day‐to‐day lives. This chapter explains how Big Data is at the heart of Google's business model. Google uses the data from its Web index to initially match queries with potentially useful results. This is augmented with data from trusted sources and other sites that have been ranked for accuracy by machine‐learning algorithms designed to assess the reliability of data. Google monetized their search engine by working out how to capture the data it collects from us as we browse the Web, building up vast revenues by becoming the biggest sellers of online advertising in the world. Then they used the huge resources they were building up to rapidly expand, identifying growth areas such as mobile and Internet of Things in which to also apply their data‐driven business model.

Explanation about Bootstrap

To illustrate the main concepts, following explanation will evolve some mathematics definition and denotation, which are kind of informal in order to provide more intuition and understanding.

1. Initial Scenario

Assume we want to estimate the standard error of our statistic to make an inference about population parameter, such as for constructing the corresponding confidence interval (just like what we have done before!). And:

We don’t know anything about population.
There is no precise formula for estimating the standard error of statistic.

Let X1, X2, … , Xn be a random sample from a population P with distribution function F. And let M= g(X1, X2, …, Xn), be our statistic for parameter of interest, meaning that the statistics a function of sample data X1, X2, …, Xn. What we want to know is the variance of M, denoted as Var(M).

First, since we don’t know anything about population, we can’t determine the value of Var(M) that requires known parameter of population, so we need to estimate Var(M) with a estimated standard error , denoted as EST_Var(M). (Remember the estimated standard error of sample mean?)
Second, in real world we always don’t have a simple formula for evaluating the EST_Var(M) other than the sample mean’s.

It leads us need to approximate the EST_Var(M). How? Before answer this , let’s introduce an common practical way is simulation, assume we know P.

2. Simulation

Let’s talk about the idea of simulation. It’s useful for obtaining information about a statistic’s sampling distribution with the aid of computers. But it has an important assumption — Assume we know the population P.

Now let X1, X2, … , Xn be a random sample from a population and assume M= g(X1, X2, …, Xn) is the statistic of interest, we could approximate mean and variance of statistic M by simulation as follows:

Draw random sample with size n from P.
Compute statistic for the sample.
Replicate B times for process 1. and 2 and get B statistics.
Get the mean and variance for these B statistics.

Why does this simulation works? Since by a classical theorem, the Law of Large Numbers:

The mean of these B statistic converges to the true mean of statistic M as B → ∞.

And by Law of Large Numbers and several theorem related to Convergence in Probability:

The sample variance of these B statistic converges to the true variance of statistic M as B → ∞.

With the aid of computer, we can make B as large as we like to approximate to the sampling distribution of statistic M.

Following is the example Python codes for simulation in the previous phone-picks case. I use B=100000, and the simulated mean and standard error for sample mean is very close to the theoretical results in the last two cells. Feel free to check out.

Example codes for simulation applied with the previous phone-picks case start from cell [10].

3. The Empirical Distribution Function and Plug-in Principle

We have learned the idea of simulation. Now, can we approximate the EST_Var(M) by simulation? Unfortunately, to do the simulation above, we need to know the information about population P. The truth is that we don’t know anything about the P. For addressing this issue, one of most important component in bootstrap Method is adopted:

Using Empirical distribution function to approximate the distribution function of population, and applying Plug-in Principle to get an estimate for Var(M) — the Plug-in estimator.

(1) Empirical Distribution Function

The idea of Empirical distribution function (EDF) is building an distribution function (CDF) from an existing data set. The EDF usually approximates the CDF quite well, especially for large sample size. In fact, it is a common, useful method for estimating a CDF of a random variable in pratical.

The EDF is a discrete distribution that gives equal weight to each data point (i.e., it assigns probability 1/ n to each of the original n observations), and form a cumulative distribution function that is a step function that jumps up by 1/n at each of the n data points.

(2) Statistical Functional

Bootstrap use the EDF as an estimator for CDF of population. However, we know the EDF is a type of cumulative distribution function(CDF). To apply the EDF as an estimator for our statistic M, we need to make the form of M as a function of CDF type, even the parameter of interest as well to have the some base line. To do this, a common way is the concept called Statistical Functional. Roughly speaking, a statistical functional is any function of a distribution function. Let’s take an example:

Suppose we are interested in parameters of population. In statistic field , there is always a situation where parameters of interest is a function of the distribution function, these are called statistical functionals. Following list that population mean E(X) is a statistical functional:

From above we can see the mean of population E(X) can also be expressed as a form of CDF of population F — this is a statistical functional. Of course, this expression can be applied to any function other than mean, such as variance.

Statistical functional can be viewed as quantity describing the features of the population. The mean, variance, median, quantiles of F are features of population. Thus, using statistical functional, we have a more rigorous way to define the concepts of population parameters. Therefore, we can say, our statistic M can be : M=g(F), with the population CDF F.

(3) Plug-in Principle = EDF + Statistical Functional

We have made our statistic is M= g(X1, X2, …, Xn)=g(F) be a statistical functional form. However, we don’t know F. So we have to “plug-in” a estimator for F, “into” our M=g(F), in order to make this M can be evaluate.

It is called plug-in principle. Generally speaking, the plug-in principle is a method of estimation of statistical functionals from a population distribution by evaluating the same functionals, but with the empirical distribution which is based on the sample. This estimation is called a plug-in estimate for the population parameter of interest. For example, a median of a population distribution can be approximated by the median of the empirical distribution of a sample. The empirical distribution here, is form just by the sample because we don’t know population. Put it simply:

If our parameter of interest , say θ, has the statistical function form θ=g(F), which F is population CDF.
The plug-in estimator for θ=g(F), is defined to be θ_hat=g(F_hat):

From above formula we can see we “plug in” the θ_hat and F_hat for the unknown θ and F. F_hat here, is purely estimated by sample data.
Note that both of the θ and θ_hat are determined by the same function g(.).

Let’s take an mean example as follows, we can see g(.) for mean is — averaging all data points, and it is also applied for sample mean. F_hat here, is form by sample as an estimator of F. We say the sample mean is a plug-in estimator of the population mean.(A more clear result will be provided soon.)

So, what is the F_hat? Remember bootstrap use Empirical distribution function(EDF) as an estimator of CDF of population? In fact, EDF is also a common estimator that be widely used in plug-in principle for F_hat.

Let’s take a look what does our estimator M= g(X1, X2, …, Xn)=g(F) will look like if we plug-in with EDF into it.

Let Statistic of interest be M=g(X1, X2, …, Xn)= g(F) from a population CDF F.
We don’t know F, so we build a Plug-in estimator for M, M becomes M_hat= g(F_hat). Let’s rewrite M_hat as follows:

We know EDF is a discrete distribution that with probability mass function PMF assigns probability 1/ n to each of the n observations, so according this, M_hat becomes:

According this, for our mean example, we can find the plug-in estimator for mean μ is just the sample mean:

Hence, we through Plug-in Principle, to make an estimate for M=g(F), say M_hat=g(F_hat). And remember that, what we want to find out is Var(M), and we approximate Var(M) by Var(M_hat). But in general case, there is no precise formula for Var(M_hat) other than sample mean! It leads us to apply a simulation.

(4) Bootstrap Variance Estimation

It’s nearly the last step! Let’s refresh the whole process with the Plug-in Principle concept.

Our goal is to estimate the variance of our estimator M, which is Var(M). The Bootstrap principle is as follows:

We don’t know the population P with CDF denoted as F, so bootstrap use Empirical distribution function(EDF) as estimate of F.
Using our existing sample data to form a EDF as a estimated population.
Applied the Plug-in Principle to make M=g(F) can be evaluate with EDF. Hence, M=g(F) becomes M_hat= g(F_hat), it’s the plugged-in estimator with EDF — F_hat.
Take simulation to approximate to the Var(M_hat).

Recall that to do the original version of simulation, we need to draw a sample data from population, obtain a statistic M=g(F) from it, and replicate the procedure B times, then get variance of these B statistic to approximate the true variance of statistic.

Therefore, to do simulation in step 4, we need to:

Draw a sample data from EDF.
Obtain a plug-in statistic M_hat= g(F_hat).
Replicate the two procedure B times.
Get the variance of these B statistic, to approximate the true variance of plug-in statistic.(It’s an easily confused part.)

What’s the simulation? In fact, it is the bootstrap sampling process that we mentioned in the beginning of this article!

Two questions here(I promise these are last two!):

How does draw from EDF look like in step 1?
How does this simulation work?

How does draw from EDF look like?

We know EDF builds an CDF from existing sample data X1, …, Xn, and by definition it puts mass 1/n at each sample data point. Therefore, drawing an random sample from an EDF, can be seen as drawing n observations, with replacement, from our existing sample data X1, …, Xn. So that’s why the bootstrap sample is sampled with replacement as shown before.

How does simulation work?

The variance of plug-in estimator M_hat=g(F_hat) is what the bootstrap simulation want to simulate. At the beginning of simulation, we draw observations with replacement from our existing sample data X1, …, Xn. Let’s denote these re-sampled data X1* , …, Xn*. Now, let’s compare bootstrap simulation with our original simulation version again .

Original simulation process for Var(M=g(F)):

Original Simulation Version- Approximate EST_Var(M|F) with known FLet X1, X2, … , Xn be a random sample from a population P and assume M= g(X1, X2, …, Xn) is the statistic of interest, we could approximate variance of statistic M by simulation as follows:1. Draw random sample with size n from P.
2. Compute statistic for the sample.
3. Replicate B times for process 1. and 2 and get B statistics.
4. Get the variance for these B statistics.

Same with previous Simulation part for simulating Var(M).

Bootstrap Simulation for Var(M_hat=g(F_hat))

Bootstrap Simulation Version- Approximate Var(M_hat|F_hat) with EDFNow let X1, X2, … , Xn be a random sample from a population P with CDF F, and assume M= g(X1, X2, …, Xn ;F) is the statistic of interest. But we don't know F, so we:1.Form a EDF from the existing sample data by draw observations with replacement from our existing sample data X1, …, Xn. These are denote as X1*, X2*, …, Xn*. We call this is a bootstrap sample.2.Compute statistic M_hat= g(X1*, X2*, …, Xn* ;F_hat) for the bootstrap sample.3. Replicate B times for steps 2 and 3, and get B statistics M_hat.4. Get the variance for these B statistics to approximate the Var(M_hat).

Simulating for Var(M_hat).

Would you feel familiar with processes above? In fact, it’s the same process with bootstrap sampling method we have mentioned before!

III. What Does the Bootstrap Work?

Finally, let’s check out how does our simulation will work. What we will get the approximation from this bootstrap simulation is for Var(M_hat), but what we really concern is whether Var(M_hat) can approximate to Var(M). So two question here:

Will bootstrap variance simulation result, which is S², can approximate well for Var(M_hat)?
Can Var(M_hat) can approximate to Var(M)?

To answer this ,let’s use a diagram to illustrate the both types simulation error:

From bootstrap variance estimation, we will get an estimate for Var(M_hat) — the plug-in estimate for Var(M). And the Law of Large Number tell us, if our simulation times B is large enough, the bootstrap variance estimation S², is a good approximate for Var(M_hat). Fortunately, we can get a larger B as we like with aid of a computer. So this simulation error can be small.
The Variance of M_hat, is the plug-in estimate for variance of M from true F. Is the Var(M_hat; F_hat) a good estimator for Var(M; F)? In other words, does a plug-in estimator approximate well to the estimator of interest ? That’s the key point what we really concern. In fact, the topic of asymptotic properties for plug-in estimators is classified in high level mathematical statistic. But let’s explain the main issues and ideas.

First, We know the empirical distribution will converges to true distribution function well if sample size is large, say F_hat → F.
Second, if F_hat → F, and if it’s corresponding statistical function g(.) is a smoothness conditions, then g(F_hat) → g(F). In our case, the statistical function g(.) is Variance, which satisfy the required continuity conditions. Therefore, that explains why the bootstrap variance is a good estimate of the true variance of the estimator M.

Generally, the smoothness conditions on some functionals is difficult to verify. Fortunately, most common statistical functions like mean, variance or moments satisfy the required continuity conditions. It provides that bootstrapping works. And of course, make the original sample size not too small as we can.

Below is my Bootstrap sample code for pickup case, fell free to check out.

Bootstrap Recap

Let’s recap the main ideas of bootstrap with following diagram!

venereology answered 1 year ago

Next > < Previous

Question

discussed big data , data warehouse and google database for big data and bootstrapping technique for...

Solutions

Expert Solution

Related Solutions

discussed big data , data warehouse and google database for big data and bootstrapping technique for...

What are the similarities and differences between database, data warehouse, and data mining?

1. What is Big Data? Why Is Big Data Different? (from data mart, data warehouse) 2. What Are...

Compare a data warehouse used for decision support to an operational database. How do the data...

With a cube from a database or data warehouse, you are bringing data "forward." Explain what...

a.How does big data impact and change enterprise Data warehouse and data management infrastructure ? b.How...

Big Data is increasingly important to companies and accountants. Using a web search on Google or...

What type of database software would be ideal for a clinical data warehouse that involves several...

Describe whether big data or a business intelligence application that uses a relational database, is a...

Discuss impact of Big data on databases and database design (Hadoop). Give examples of application.