Question

In: Computer Science

QUESTION REGARDING BIG DATA. Please answer those questions as soon as possible. 2) Spark is increasingly...

QUESTION REGARDING BIG DATA. Please answer those questions as soon as possible.

2) Spark is increasingly popular in the Hadoop market. Describe the following associated
with Spark
a. Spark Core
b. Describe what an RDD is and it’s importance in a Spark environment
c. The four key API of Spark
d. Four reasons why Spark is attractive to users vs Hadoop MapReduce
3) In a Hadoop environment, there are many capabilities which allow for Hadoop to be
integrated as an integral part of a warehouse/analytics ecosystem. There are both open
source options and proprietary options for most. For each of the following tasks, list the
open source and the proprietary option for accomplishing, if they exist.
a. Integrating ETL into a Hadoop environment
b. Creating a Highly Available environment
c. Performing object matching between structured and unstructured data
d. Replicating data between Hadoop clusters
e. Replicating data within a single Hadoop cluster
f. Security and Auditability
g. Running SQL against data in files in HDFS
h. Data movement between a traditional relational and Hadoop environment
i. Dealing with data in motion within a Hadoop environment
j. Administration and Management of a Hadoop environment
4) (a) List the eight file formats that Db2 Big SQL supports, describe the advantages and
disadvantages of each and provide one example of when you would use each .
(b) List two data types of Db2 Big SQL which are treated differently in Hadoop than
would be handled in a non-Hadoop environment .
5) Show the SQL commands you would use, with Db2 Big SQL, to accomplish the following
tasks
a. High performing loading of data into Hadoop from a relational database
b. Inserting a single row into a Hadoop file
c. Creating a new Hadoop table in a parquet file
d. Creating a new Hadoop table with a partitioning key
e. Creating a logical schema pointing to a Hadoop csv file which already exists
f. Creating a new table in HDFS in a native Db2 table format
g. Creating a view over two Hadoop tables
h. Dropping a schema called mydb
i. Inserting data into a Hadoop file with the same schema as another data source
j. Running a PL/SQL stored procedure in Db2 Big SQL

Solutions

Expert Solution

2)a)=Spark Core is the basic of the whole project. It provides distributed task for handling large amount of data.

b) RDD {Resillient Distributed Dataset} that is a logical collection of data partitioned across machine. It can be implemented in two ways-

1.Referencing dataset.

2.Transformations (map , filter , join, etc)

c) Spark is a open source plattform, it is easy to use and easy to implement. It can be easily embedded with Java, Scala , Python, R , SQL for the purpose of filter , select , join, map ,reduce, query based programe.

d) 1.The performance level of spark is much better then Hadoop, Spark has be used to sort 100TB of data to 2-3 time faster than hadoop.

2.Spark is not bounded by its peripheral concerns.

3.Spark is complete open source plattform, ie with zero installation costs.

4.Spark is less fault-tolerant because of RDD operartions.

3)a) To implement ETL with Hadoop some steps are required-

1.Set up a hadoop cluster.

2.Connect data source

3.Design Metadata.

4.ETL jobs

5.Workflow to implement.

3)b) For creating highly available envirnoment to use some steps are required-

1.Set up of time synchronization = For control hub machines.

2. Set Up a Load Balancer for Control Hub = HTTP header X-Forwarded-For , IP address and port numbers for each Control Hub instance , HTTPS protocol , URLs , etc

3.Installation of Control Hub Instance = the initial Control Hub instance for a highly available environment

4. Install Additional Control Hub Instances = Installation of Control Hub on a separate machine for connection , Download JDBC Driver of connections , Download drivers ,download files , importing of variables , configration of directories.

5.Initialize Control Hub Instances for use.

3)c). Performing object matching approches between structured and unstructured data are-

1.If there is no mutual between both the data-types.

2.Some textual matches.

3.Matches of hard wired.

4.Probabilistic matching

5.Metadata matching.

3)d).Replicating data between Hadoop clusters= If we want to replicate the data from one hadoop cluster to another hadoop cluster we need one hadoop cluster with some data and another empty hadoop cluster.We can copy files or directories between different clusters by using the hadoop 'distcp' command.You must include a credentials file in your copy request so the source cluster can validate that you are authenticated to the source cluster and the target cluster.

3)e) No. You can't have more than one replication factor for a single node cluster.

3)f).Security available in Spark are -

1.encryption

2.audit

3.authentication

4.authorization

Auditability of spark = The Spark's jobs are run on YARN and read from HDFS. So audit logs for YARN and HDFS access is still applicable and you can use Ranger to view this.

3)g). Running SQL against data in files in HDFS= SQL is a structured query language used to extract data from HDFS. Serval ways to query Hadoop with SQL-

1.Apache Hive

2.Stinger

3.Apache Drill

4.Spark SQL.

5.Apache Phoenix

6.Clouder Impla

7.Oracle Big data SQl

etc

3)h). Data movement between a traditional relational and Hadoop environment= Apache Hadoop is a comprehensive ecosystem which now features many open source components that can fundamentally change an enterprise’s approach to storing, processing, and analyzing data. Unlike traditional relational database management systems, Hadoop now enables different types of analytical workloads to run the same set of data and can also manage data volumes at a massive scale with advanced hardware and software applications. We can see many examples like CDH, which is Cloudera’s open source platform as popular distributions of Hadoop.

3)i) In hadoop ecosystem Data is stored in both structured and non-structurd formate. There are sevral operstions are present to make opersations on them. Such as data entering, data exraction , data manipulations etc. for these operations we have different sources such as sql , codes, etc.

3) j).In the Hadoop world, a Systems Administrator is called a Hadoop Administrator. Hadoop Admin Roles and Responsibilities include setting up Hadoop clusters. Other duties involve backup, recovery and maintenance. Hadoop administration requires good knowledge of hardware systems and excellent understanding of Hadoop architecture.

Hadoop Administrator deals with-1.Cluster Maintenance

2.Resource Management

3.Security Management

4.TroubleshootingCluster

5.MonitoringBackup And Recovery Task

The hadoop manientance system helps the cluster to enter or leave safe mode, which is also called as maintenance mode. In this mode, Namenode does not accept any changes to the name space, it does not replicate or delete blocks.

1.Enter Safe mode

2.Leave Safe Mode.

3.Get the status

4)a). eight file formats that Db2 Big SQL supports are=

1.Optimized Row Columnar (ORC) =

The ORC file format provides a highly efficient way to store data. ORC files store collections of rows in a columnar format, which enables parallel processing of row collections across your cluster. As of Big SQL 5.0.2, the ORC file format is recommended for optimal performance and functionality.

2.Parquet = The Parquet file format is an open source columnar storage format for Hadoop that supports efficient compression and encoding schemes.

SET HADOOP PROPERTY 'dfs.blocksize' = 268435456

3.Text=

The text file format is the default storage format for a table. The underlying data is stored in delimited form with one record per line and new line characters separating individual records.

You can specify delimiters by using the ROW FORMAT DELIMITED clause in the CREATE TABLE (HADOOP) statement. For example:

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY '\\'
LINES TERMINATED BY '\n'
NULL DEFINED AS '\N'

4.Avro =An Apache open source project, provides a convenient way to represent complex data structures within the Hadoop environment. By using an Avro SerDe in your CREATE TABLE (HADOOP) statement, you can read or write Avro data as Big SQL tables. The following Avro data types are mapped to Big SQL data types:

Double

Bollean

Integer

Syntax=

TBLPROPERTIES (
'avro.schema.literal' =
'{"namespace": "com.howdy",
"name": "some_schema",
"type": "record",
"fields": [{ "name":"string1","type":"string"}]}'
)
...

5.Record Columer=The RC file format is an efficient high performance format that uses binary key/value pairs. It partitions rows horizontally into row splits and then partitions each row split vertically. The metadata pertaining to a row split is the key part, and all of the actual data in the row split is stored as the value part of a record.

CREATE TABLE my_table (
i INT,
s STRING)
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
STORED AS RCFILE;

6.Sequence= The sequence file format is used to hold arbitrary data that might not otherwise be splittable. For example, in a text file, newline characters (\n) are used to determine the boundaries of a record, so a DFS block can be processed simply by looking for newline characters. However, if the data in that file is in binary form or is compressed with an algorithm that does not maintain markers for a record boundary, reading that block is impossible.

brzip

gzip

snappy etc...

4)b). Array = The ARRAY type can be an ordinary ARRAY or an associative ARRAY. You can use the ARRAY syntax that Hive uses to define an ordinary ARRAY data type, or the MAP syntax that Hive uses to define an associative ARRAY data type, but the SQL standard is recommended.

Row = The ROW type contains field definitions that contain a field name and a data type. You can use the STRUCT syntax that Hive uses to define a ROW data type, but the SQL standard is recommended.

5)a)  High performing loading of data into Hadoop from a relational database=

select * from table where split_by_col>=1 and split_by_col<2500

5)b).Inserting a single row into a Hadoop file=I want to insert data into this table. Command is=

insert into foo (id, name) VALUES (12,"xyz);

5)c).Creating a new Hadoop table in a parquet file=Creates a new table and specifies its characteristics. While creating a table, you optionally specify aspects such as:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE PARQUET 'hdfs_path_of_parquet_file'
[COMMENT 'table_comment']
[PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
[WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
[
[ROW FORMAT row_format] [STORED AS file_format]
]
[LOCATION 'hdfs_path']
[TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
[CACHED IN 'pool_name' [WITH REPLICATION = integer] | UNCACHED]
data_type:
primitive_type
| array_type
| map_type
| struct_type


Related Solutions

Please answer the questions as soon as possible. Thanks in advance. Please explain in brief in...
Please answer the questions as soon as possible. Thanks in advance. Please explain in brief in one to two paragraphs and provide graphs. 1. In two paragraphs, describe the sources of the gains from trade and why countries use import tariffs. 2b In two paragraphs, describe how the COVID-19 pandemic has impacted the global macroeconomy and multiple ways it has impacted a specific agricultural commodity market (e.g. corn, beef, wheat).
CAN YOU PLEASE ANSWER ALL QUESTIONS AND PLEASE ANSWER AS SOON AS POSSIBLE 1. The processing,...
CAN YOU PLEASE ANSWER ALL QUESTIONS AND PLEASE ANSWER AS SOON AS POSSIBLE 1. The processing, where different kinds of information are processed in different brain structures, is called: a. Stream segregation b. Serial processing c. Distributed processing d. Parallel processing 2. In vision, what does dark adaptation mean? a. Decrease in color discrimination that occurs after a period in the dark b. The increased sensitivity of the eye that occurs when being in the dark for a long time...
Please try to answer as soon as possible. Thanks in advance. Answer the questions true, false...
Please try to answer as soon as possible. Thanks in advance. Answer the questions true, false or uncertain and provide a brief explanation or a graph to defend your answer. 1 a) A change in relative prices will always change a utility maximizing consumer’s marginal rate of substitution. 1b) Competitive firms shutdown production when price falls below the minimum of the average total cost curve. 1c) An increase the price of firms output will always increase the use of an...
CAN YOU PLEASE ANSWER AS SOON AS POSSIBLE AND PLEASE ANSWER ALL QUESTIONS THANK YOU 1/...
CAN YOU PLEASE ANSWER AS SOON AS POSSIBLE AND PLEASE ANSWER ALL QUESTIONS THANK YOU 1/ What is one similarity and one difference between voluntary motor system that innervate the head versus voluntary motor system that innervate the body? 2/ Does olfactory bulb (direct) relay to primary sensory cortex via the thalamus? 3/ Write a short paragraph using the following terms: opiates; endorphins, pain relief. 4/ In your own words, explain one way in which neuroplasticity allows learning and memory...
Can you please answer all questions and please answer as soon as possible THANK YOU 1/...
Can you please answer all questions and please answer as soon as possible THANK YOU 1/ what do you think wernicke’s area of an infant develops prior to Broca’s? 2/ Create a short paragraph using the following terms: fovea, cones, rods, peripheral retina, acuity, center of the visual field? 3/ Why cone receptors are able to send information about different frequencies of light? 4/why do you think it is easier to name a taste in food than a smell in...
CAN YOU PLEASE ANSWER AS SOON AS POSSIBLE AND PLEASE ANSWER ALL QUESTIONS THANK YOU 1-...
CAN YOU PLEASE ANSWER AS SOON AS POSSIBLE AND PLEASE ANSWER ALL QUESTIONS THANK YOU 1- What is one similarity and one difference between voluntary motor system that innervate the head versus voluntary motor system that innervate the body? 2- Does olfactory bulb (direct) relay to primary sensory cortex via the thalamus? 3- Write a short paragraph using the following terms: opiates; endorphins, pain relief. 4- In your own words, explain one way in which neuroplasticity allows learning and memory...
Please I need The right answer for this question as soon as possible. A sag vertical...
Please I need The right answer for this question as soon as possible. A sag vertical curve (equal tangent) has PVI at station 212+00 and elevation 540.75 ft. The initial grade is -2.5% and the final grade is +4.5%. The length of the curve is 900 ft. Determine the following, 1. Stationing of the low point, PVC, and PVT. 2. Elevation at station 213+00, PVC, low point, and PVT.
PLEASE ANSWER #8, 9 AND 10 AS SOON AS POSSIBLE. THANK YOU IF POSSIBLE PLEASE DO...
PLEASE ANSWER #8, 9 AND 10 AS SOON AS POSSIBLE. THANK YOU IF POSSIBLE PLEASE DO WHOLE QUESTIONS. I WANT TO DOUBLE CHECK WITH MINE Required:#1.Prepare journal entries to record the December transactions in the General Journal Tab in the excel template file "Accounting Cycle Excel Template.xlsx". Use the following accounts as appropriate: Cash, Accounts Receivable, Supplies, Prepaid Insurance, Equipment, Accumulated Depreciation, Accounts Payable, Wages Payable, Common Stock, Retained Earnings, Dividends, Service Revenue, Depreciation Expense, Wages Expense, Supplies Expense, Rent...
Please answer as soon as possible. 2-4 sentences in fine. 1. Describe the Avalanche Effect in...
Please answer as soon as possible. 2-4 sentences in fine. 1. Describe the Avalanche Effect in your own words. Give an example of the avalanche effect in cryptographic algorithms. 2. What is a collision in terms of a hash value and what implications does it have for the hash function? Give examples of possible situations. 3. Describe how symmetric and asymmetric key cryptography are different. List one example of symmetric key cryptography and one example of asymmetric key cryptography highlighting...
Please answer as soon as possible, I will upvote if it is answered in 40 minutes....
Please answer as soon as possible, I will upvote if it is answered in 40 minutes. Suppose that a bond portfolio with a duration of 5 years is hedged using a futures contract in which the underlying asset has a duration of 14 years. What is likely to be the impact on the hedge of the fact that the 14-year rate is less volatile than the 5-year rate?
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT