For example the company has 10 edge sites (two in each continent). The edge sites are...

For example the company has 10 edge sites (two in each continent). The edge sites are europe-east, europe-west, asia-east, asia-west, africa-north, africa-south, americas-north, americas-south, australia-east and australia-west. Each edge site has 10 servers serving web content. The servers are named server1 - server10. Every hour on the hour, you want to bring HTTP web access logs files in to centralized Hadoop Distributed File System located in Chicago data center. How would you organize the directory structure in HDFS under /data/web/access-logs. Please illustrate your directory design and why you to organize it your way.

Expert Solution

Given: We have 10 servers, 2 servers in each continent, we are getting data on hourly basis from these servers.

Solution:

We should have 3 folders within /data/web/access_logs for raw incoming logs, archiving raw log files and for processed logs as:

/data/web/access_logs/raw_logs
/data/web/access_logs/archive
/data/web/access_logs/processed_logs

Structure of raw_logs:

The data from each server will be kept in as ServerName/YYYYMMDD structure, and hence the folder will look like:

Server1/09122019
Server2/09122019 etc

This HDFS will get the hourly data from log servers, Lets assume that we are getting data from log server with 10 min delay, so we are getting data at 10:10 for time ranging 9:00 to 10:00. We should schedule another job(Spark/Map-Reduce) hourly which will process the log files from unstructured/semi-structured to structured data and store these structured data in ORC format in the processed_logs folder, with similar folder structure as raw_logs folder. After storing processed files into processed_log, move the raw file to archive/ServerXX/YYYYMMDD folder for future use, and hence cleaning raw_logs folder for next batch.

Why we need raw_logs, archive and processed_logs?

The reason behind organizing this way is that we are getting incoming hourly logs in raw_logs and we process the data into structured format so that it can be queried directly from hive and saving it in processed_logs. While we are archiving the actual logs into archive folder for further use in future, if new analytical use case requires old log mining.

Why am I storing data in ServerXX/YYYYMMDD Format?

We are storing the data in the above-mentioned format because we are taking advantage of partitioning of hive, we partitioned data on server and on Date, Mostly the data will be queried on server and dates e.g. How many failed logins attempts in Europe-west yesterday? With partitioning on server and date will optimize these kinds of queries exponentially. We can change the partitioning based on analytical requirements.

For querying based on Continents, we can save a small mapping data for Servers as

(ServerNo, ServerLocation,Continent), which will help us query on continent too.

venereology answered 1 year ago

In FCC iron, carbon atoms are located at octahedral sites at the centre of each edge...

In FCC iron, carbon atoms are located at octahedral sites at the centre of each edge of the unit cell (1/2, 0, 0) and at the centre of the unit cell (1/2, 1/2, 1/2). In BCC iron, carbon atoms enter tetrahedral sites, such as (1/4, 1/2, 0). The lattice parameter is 0.3571nm for FCC iron and 0.2866nm for BCC iron. Assume that carbon atoms have a radius of 0.071nm. (a) Would we expect a greater distortion of the crystal by...

Continent Construction Company is a building contractor specializing in small commercial buildings. The company has the...

Continent Construction Company is a building contractor specializing in small commercial buildings. The company has the opportunity to accept one of two jobs; it cannot accept both because they must be performed at the same time and Continent does not have the necessary labor force for both jobs. Indeed, it will be necessary to hire a new supervisor if either job is accepted. Furthermore, additional insurance will be required if either job is accepted. The revenue and costs associated with...

Stock Inc. has two sites in Pittsburgh that are four miles apart. Each site consists of...

Stock Inc. has two sites in Pittsburgh that are four miles apart. Each site consists of a large factory with office space for 25 users at the front of the factory and up to 50 workstations in two work cells on each factory floor. All office users need access to an inventory database that runs on a server at the Allegheny Street location; they also need access to a billing application with data residing on a server at the Monongahela...

Stock Inc. has two sites in Pittsburgh that are four miles apart. Each site consists of...

Stock Inc. has two sites in Pittsburgh that are four miles apart. Each site consists of a large factory with office space for 25 users at the front of the factory and up to 50 workstations in two work cells on each factory floor. All office users need access to an inventory database that runs on a server at the Allegheny Street location; they also need access to a billing application with data residing on a server at the Monongahela...

a recent survey of 10 networking sites has a mean of 10 million visitors for a...

a recent survey of 10 networking sites has a mean of 10 million visitors for a specific month. The sample standard deviation was 2.2 million. Find the 90% confidence level

Two children (m = 28.0 kg each) stand opposite each other on the edge of a...

Two children (m = 28.0 kg each) stand opposite each other on the edge of a merry-go-round. The merry-go-round, which has a mass of 1.62 ✕ 102 kg and a radius of 1.2 m, is spinning at a constant rate of 0.48 rev/s. Treat the two children and the merry-go-round as a system. (a) Calculate the angular momentum of the system, treating each child as a particle. (Give the magnitude.) kg · m2/s (b) Calculate the total kinetic energy of...

15. A sample of 10 networking sites for a specific month has a mean number of...

15. A sample of 10 networking sites for a specific month has a mean number of visits of 26.1 and a standard deviation of 4.2. Find a 99 percent confidence interval of the true mean. NOTE: be sure to take the square root of a given Variance in the word problems to obtain the standard deviation and use that for your figuring a. 21.8, 30.4 b. 22.2, 28.4 c. 20.1, 33.2 d. 21.9, 36.8

A company has inventory of 10 units at a cost of $10 each on June 1....

A company has inventory of 10 units at a cost of $10 each on June 1. On June 3, it purchased 20 units at $12 each. 12 units are sold on June 5. Using the average-cost periodic inventory method, what is the cost of the 12 units that were sold? Select one: a. $136. b. $130. c. $204. d. $340.

Find an example of a publicly-traded company that lists two risk factors in their 10-K that...

Find an example of a publicly-traded company that lists two risk factors in their 10-K that you think will become greater liabilities for them in the near future. If you were the CEO, how would you mitigate those risks?

A moving company has a truck filled for deliveries to eight different sites. If the order...

A moving company has a truck filled for deliveries to eight different sites. If the order of the deliveries is randomly? selected, what is the probability that it is the shortest?route? Assuming boys and girls are equally? likely, find the probability of a couple having a baby girl when their sixth child is? born, given that the first five children were all girls.

Question

For example the company has 10 edge sites (two in each continent). The edge sites are...

Solutions

Expert Solution

Related Solutions

In FCC iron, carbon atoms are located at octahedral sites at the centre of each edge...

Continent Construction Company is a building contractor specializing in small commercial buildings. The company has the...

Stock Inc. has two sites in Pittsburgh that are four miles apart. Each site consists of...

Stock Inc. has two sites in Pittsburgh that are four miles apart. Each site consists of...

a recent survey of 10 networking sites has a mean of 10 million visitors for a...

Two children (m = 28.0 kg each) stand opposite each other on the edge of a...

15. A sample of 10 networking sites for a specific month has a mean number of...

A company has inventory of 10 units at a cost of $10 each on June 1....

Find an example of a publicly-traded company that lists two risk factors in their 10-K that...

A moving company has a truck filled for deliveries to eight different sites. If the order...