In: Computer Science
This question is on Apache Spark setup on both local machine and
Amazon Web Services (AWS) cloud platform. Please include detail
elaboration and screenshot for all key steps.
- Discuss on how to setup Apache Spark using Amazon EMR cluster.
Demonstrate how to plan and execute any one of the built-in PySpark
examples in non-interactive mode.
Solution :-
Apache spark is a distributed processing framework and programming model that helps to do machine learning, stream processing, or graph analytics using Amazon EMR clusters. Similar to Apache Hadoop, Spark is an open-source, distributed processing system commonly used for big data workloads
We can install Spark on an EMR cluster along with other Hadoop applications, and it can also leverage the EMR file system (EMRFS) to directly access data in Amazon S3.
To launch a cluster with Spark installed
Open the Amazon EMR consoleC
Choose Create cluster to use Quick Create.
For Software Configuration, choose Amazon Release Version
For Select Applications, choose either All Applications or Spark.
Select other options as necessary and then choose Create cluster.
When running Spark with Docker, make sure the following prerequisites are met:
The docker package and CLI are only installed on core and task nodes.
On Amazon EMR 6.1.0 and later, you can alternatively install Docker on a master node
The spark submit command should always be run from a master instance on the Amazon EMR cluster.
The Docker registries used to resolve Docker images must be defined using the Classification API with the container executer classification key to define additional parameters when launching the clustecluster
When using Amazon ECR to retrieve Docker images, you must configure the cluster to authenticate itself ............
Thank You Sir....!