In: Computer Science
QUESTION REGARDING BIG DATA. Please answer those questions as soon as possible.
2) Spark is increasingly popular in the Hadoop market. Describe
the following associated
with Spark
a. Spark Core
b. Describe what an RDD is and it’s importance in a Spark
environment
c. The four key API of Spark
d. Four reasons why Spark is attractive to users vs Hadoop
MapReduce
3) In a Hadoop environment, there are many capabilities which allow
for Hadoop to be
integrated as an integral part of a warehouse/analytics ecosystem.
There are both open
source options and proprietary options for most. For each of the
following tasks, list the
open source and the proprietary option for accomplishing, if they
exist.
a. Integrating ETL into a Hadoop environment
b. Creating a Highly Available environment
c. Performing object matching between structured and unstructured
data
d. Replicating data between Hadoop clusters
e. Replicating data within a single Hadoop cluster
f. Security and Auditability
g. Running SQL against data in files in HDFS
h. Data movement between a traditional relational and Hadoop
environment
i. Dealing with data in motion within a Hadoop environment
j. Administration and Management of a Hadoop environment
4) (a) List the eight file formats that Db2 Big SQL supports,
describe the advantages and
disadvantages of each and provide one example of when you would use
each .
(b) List two data types of Db2 Big SQL which are treated
differently in Hadoop than
would be handled in a non-Hadoop environment .
5) Show the SQL commands you would use, with Db2 Big SQL, to
accomplish the following
tasks
a. High performing loading of data into Hadoop from a relational
database
b. Inserting a single row into a Hadoop file
c. Creating a new Hadoop table in a parquet file
d. Creating a new Hadoop table with a partitioning key
e. Creating a logical schema pointing to a Hadoop csv file which
already exists
f. Creating a new table in HDFS in a native Db2 table format
g. Creating a view over two Hadoop tables
h. Dropping a schema called mydb
i. Inserting data into a Hadoop file with the same schema as
another data source
j. Running a PL/SQL stored procedure in Db2 Big SQL
2)a)=Spark Core is the basic of the whole project. It provides distributed task for handling large amount of data.
b) RDD {Resillient Distributed Dataset} that is a logical collection of data partitioned across machine. It can be implemented in two ways-
1.Referencing dataset.
2.Transformations (map , filter , join, etc)
c) Spark is a open source plattform, it is easy to use and easy to implement. It can be easily embedded with Java, Scala , Python, R , SQL for the purpose of filter , select , join, map ,reduce, query based programe.
d) 1.The performance level of spark is much better then Hadoop, Spark has be used to sort 100TB of data to 2-3 time faster than hadoop.
2.Spark is not bounded by its peripheral concerns.
3.Spark is complete open source plattform, ie with zero installation costs.
4.Spark is less fault-tolerant because of RDD operartions.
3)a) To implement ETL with Hadoop some steps are required-
1.Set up a hadoop cluster.
2.Connect data source
3.Design Metadata.
4.ETL jobs
5.Workflow to implement.
3)b) For creating highly available envirnoment to use some steps are required-
1.Set up of time synchronization = For control hub machines.
2. Set Up a Load Balancer for Control Hub = HTTP header X-Forwarded-For , IP address and port numbers for each Control Hub instance , HTTPS protocol , URLs , etc
3.Installation of Control Hub Instance = the initial Control Hub instance for a highly available environment
4. Install Additional Control Hub Instances = Installation of Control Hub on a separate machine for connection , Download JDBC Driver of connections , Download drivers ,download files , importing of variables , configration of directories.
5.Initialize Control Hub Instances for use.
3)c). Performing object matching approches between structured and unstructured data are-
1.If there is no mutual between both the data-types.
2.Some textual matches.
3.Matches of hard wired.
4.Probabilistic matching
5.Metadata matching.
3)d).Replicating data between Hadoop clusters= If we want to replicate the data from one hadoop cluster to another hadoop cluster we need one hadoop cluster with some data and another empty hadoop cluster.We can copy files or directories between different clusters by using the hadoop 'distcp' command.You must include a credentials file in your copy request so the source cluster can validate that you are authenticated to the source cluster and the target cluster.
3)e) No. You can't have more than one replication factor for a single node cluster.
3)f).Security available in Spark are -
1.encryption
2.audit
3.authentication
4.authorization
Auditability of spark = The Spark's jobs are run on YARN and read from HDFS. So audit logs for YARN and HDFS access is still applicable and you can use Ranger to view this.
3)g). Running SQL against data in files in HDFS= SQL is a structured query language used to extract data from HDFS. Serval ways to query Hadoop with SQL-
1.Apache Hive
2.Stinger
3.Apache Drill
4.Spark SQL.
5.Apache Phoenix
6.Clouder Impla
7.Oracle Big data SQl
etc
3)h). Data movement between a traditional relational and Hadoop environment= Apache Hadoop is a comprehensive ecosystem which now features many open source components that can fundamentally change an enterprise’s approach to storing, processing, and analyzing data. Unlike traditional relational database management systems, Hadoop now enables different types of analytical workloads to run the same set of data and can also manage data volumes at a massive scale with advanced hardware and software applications. We can see many examples like CDH, which is Cloudera’s open source platform as popular distributions of Hadoop.
3)i) In hadoop ecosystem Data is stored in both structured and non-structurd formate. There are sevral operstions are present to make opersations on them. Such as data entering, data exraction , data manipulations etc. for these operations we have different sources such as sql , codes, etc.
3) j).In the Hadoop world, a Systems Administrator is called a Hadoop Administrator. Hadoop Admin Roles and Responsibilities include setting up Hadoop clusters. Other duties involve backup, recovery and maintenance. Hadoop administration requires good knowledge of hardware systems and excellent understanding of Hadoop architecture.
Hadoop Administrator deals with-1.Cluster Maintenance
2.Resource Management
3.Security Management
4.TroubleshootingCluster
5.MonitoringBackup And Recovery Task
The hadoop manientance system helps the cluster to enter or leave safe mode, which is also called as maintenance mode. In this mode, Namenode does not accept any changes to the name space, it does not replicate or delete blocks.
1.Enter Safe mode
2.Leave Safe Mode.
3.Get the status
4)a). eight file formats that Db2 Big SQL supports are=
1.Optimized Row Columnar (ORC) =
The ORC file format provides a highly efficient way to store data. ORC files store collections of rows in a columnar format, which enables parallel processing of row collections across your cluster. As of Big SQL 5.0.2, the ORC file format is recommended for optimal performance and functionality.
2.Parquet = The Parquet file format is an open source columnar storage format for Hadoop that supports efficient compression and encoding schemes.
SET HADOOP PROPERTY 'dfs.blocksize' = 268435456
3.Text=
The text file format is the default storage format for a table. The underlying data is stored in delimited form with one record per line and new line characters separating individual records.
You can specify delimiters by using the ROW FORMAT DELIMITED clause in the CREATE TABLE (HADOOP) statement. For example:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY '\\'
LINES TERMINATED BY '\n'
NULL DEFINED AS '\N'
4.Avro =An Apache open source project, provides a convenient way to represent complex data structures within the Hadoop environment. By using an Avro SerDe in your CREATE TABLE (HADOOP) statement, you can read or write Avro data as Big SQL tables. The following Avro data types are mapped to Big SQL data types:
Double
Bollean
Integer
Syntax=
TBLPROPERTIES (
'avro.schema.literal' =
'{"namespace": "com.howdy",
"name": "some_schema",
"type": "record",
"fields": [{ "name":"string1","type":"string"}]}'
)
...
5.Record Columer=The RC file format is an efficient high performance format that uses binary key/value pairs. It partitions rows horizontally into row splits and then partitions each row split vertically. The metadata pertaining to a row split is the key part, and all of the actual data in the row split is stored as the value part of a record.
CREATE TABLE my_table (
i INT,
s STRING)
ROW FORMAT SERDE
"org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
STORED AS RCFILE;
6.Sequence= The sequence file format is used to hold arbitrary data that might not otherwise be splittable. For example, in a text file, newline characters (\n) are used to determine the boundaries of a record, so a DFS block can be processed simply by looking for newline characters. However, if the data in that file is in binary form or is compressed with an algorithm that does not maintain markers for a record boundary, reading that block is impossible.
brzip
gzip
snappy etc...
4)b). Array = The ARRAY type can be an ordinary ARRAY or an associative ARRAY. You can use the ARRAY syntax that Hive uses to define an ordinary ARRAY data type, or the MAP syntax that Hive uses to define an associative ARRAY data type, but the SQL standard is recommended.
Row = The ROW type contains field definitions that contain a field name and a data type. You can use the STRUCT syntax that Hive uses to define a ROW data type, but the SQL standard is recommended.
5)a) High performing loading of data into Hadoop from a relational database=
select * from table where split_by_col>=1 and split_by_col<2500
5)b).Inserting a single row into a Hadoop file=I want to insert data into this table. Command is=
insert into foo (id, name) VALUES (12,"xyz);
5)c).Creating a new Hadoop table in a parquet file=Creates a new table and specifies its characteristics. While creating a table, you optionally specify aspects such as:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.]table_name
LIKE PARQUET 'hdfs_path_of_parquet_file'
[COMMENT 'table_comment']
[PARTITIONED BY (col_name data_type [COMMENT 'col_comment'],
...)]
[WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2',
...)]
[
[ROW FORMAT row_format] [STORED AS file_format]
]
[LOCATION 'hdfs_path']
[TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
[CACHED IN 'pool_name' [WITH REPLICATION = integer] |
UNCACHED]
data_type:
primitive_type
| array_type
| map_type
| struct_type