In: Computer Science
For example the company has 10 edge sites (two in each continent). The edge sites are europe-east, europe-west, asia-east, asia-west, africa-north, africa-south, americas-north, americas-south, australia-east and australia-west. Each edge site has 10 servers serving web content. The servers are named server1 - server10. Every hour on the hour, you want to bring HTTP web access logs files in to centralized Hadoop Distributed File System located in Chicago data center. How would you organize the directory structure in HDFS under /data/web/access-logs. Please illustrate your directory design and why you to organize it your way.
Given: We have 10 servers, 2 servers in each continent, we are getting data on hourly basis from these servers.
Solution:
We should have 3 folders within /data/web/access_logs for raw incoming logs, archiving raw log files and for processed logs as:
Structure of raw_logs:
The data from each server will be kept in as ServerName/YYYYMMDD structure, and hence the folder will look like:
This HDFS will get the hourly data from log servers, Lets assume that we are getting data from log server with 10 min delay, so we are getting data at 10:10 for time ranging 9:00 to 10:00. We should schedule another job(Spark/Map-Reduce) hourly which will process the log files from unstructured/semi-structured to structured data and store these structured data in ORC format in the processed_logs folder, with similar folder structure as raw_logs folder. After storing processed files into processed_log, move the raw file to archive/ServerXX/YYYYMMDD folder for future use, and hence cleaning raw_logs folder for next batch.
Why we need raw_logs, archive and processed_logs?
The reason behind organizing this way is that we are getting incoming hourly logs in raw_logs and we process the data into structured format so that it can be queried directly from hive and saving it in processed_logs. While we are archiving the actual logs into archive folder for further use in future, if new analytical use case requires old log mining.
Why am I storing data in ServerXX/YYYYMMDD Format?
We are storing the data in the above-mentioned format because we are taking advantage of partitioning of hive, we partitioned data on server and on Date, Mostly the data will be queried on server and dates e.g. How many failed logins attempts in Europe-west yesterday? With partitioning on server and date will optimize these kinds of queries exponentially. We can change the partitioning based on analytical requirements.
For querying based on Continents, we can save a small mapping data for Servers as
(ServerNo, ServerLocation,Continent), which will help us query on continent too.