In: Computer Science
Select two PaaS cloud service providers and Analyze each service provider’s capability to provide big-data cloud services using the following framework
Preprocessing of unstructured data: Describe the solution. Does the capability exist within the offering or is the capability the customer’s responsibility?
Social graphs, API, and visualization tools: What tools exist within the offering, if any?
Data Analytics software tools: What tools exist? And what are their capabilities?
Machine learning: What capabilities exist, if any?
Data governance and security: What processes and tools are embedded in the offering? How are responsibilities identified between vendor and customer?
Hello Learner,
Thanks for your question.
I will be comparing Google cloud with AWS
1. Preprocessing Unstructured Data:
Google: Cloud Dataprep
Google provides a serverless solution for the preprocessing of
unstructured data called Cloud Dataprep. Cloud
Dataprep is operated by Trifacta. The tools help users to
structure, cleanse and blend data without writing a single line of
code by just providing UI input. Once you have provided the data tO
Dataprep it used Cloud Dataflow to process the unstructured data.
Being a serverless less solution Clod Dataprep can scale on
demand.
AWS: Kinesis Analytics
Kinesis Analytics will read the unstructured data and will
create a schema having a single column. However, we can use AWS
Lamda for creating the schema of unstructured data based on our
requirements. We need to create schema because Kinesis
Analytics uses SQL to analyze your data.
So Google provides an application for Data pre-processing but AWS provides very limited inbuilt Data-preprocessing capability, it is customer responsibility to create a schema based on the requirement
2. Social graphs, API, and visualization tools:
Google - Data Studio:
Data Studio is the tool provided by Google which helps in visualizing the BigQuery data. We can use the tool to see the trend in the data and makes business decisions. We can visualize the data by connecting it to BigQuery Source and then select the data source. Once the data is loaded to Data Studio we can select the type of the chart.
AWS - QuickSight:
QuickSight is the tool provided by Amazon for easy visualization of the data and to get insights from the data, anytime, on any device. QuickSight can take data from different sources like MS Excel, CSV or any of the SaaS applications. QuickSight is a smart Business Intelligence solution for any kind of analytics.
3. Data Analytics software tools
Google - Google Cloud BI solution:
Google Cloud BI solution is the data analytics tool provided by Google. We can load the data from any source to BigQuery, which will then do data processing & cataloging. Now we can use tools like BigQuery(SQL interface), Data Studio, Google Sheet to perform ad-hoc analysis, advance analytics, visualization & reporting.
AWS- QuickSight:
QuickSight is the BI tool provided by Amazon which helps users or organizations to build visualizations, perform ad-hoc analysis, and quickly get business insights from their data, anytime, on any device. We can upload the data from different data sources like Excel, CSV or any of the SaaS applications. QuickSight helps organizations to scale the business analytics capabilities to a huge number of users by using a robust in-memory engine (SPICE– a Super-fast, Parallel, In-memory Calculation Engine). SPICE supports a rich number of calculations to help derive valuable insights from data Data in SPICE is persisted.
4. Machine learning:
Google- TensorFlow:
TensorFlow is an end to end platform for Machine Learning provided by Google. It has a flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications. TensorFlow makes the machine learning easy by training the models by using the high-level Keras API.TensorFlow Extended tool can be used for full production ML pipeline
AWS: Amazon SageMaker:
Amazon SageMaker is an end to end machine learning platform that enables users to quickly build, train, and host machine learning models at any scale of the data. There are 3 components for SageMaker i.e. Authoring, Model Training, and Model Hosting. In order to get the machine learning solution, we first need to create a Jupyter notebook(inbuilt with all the functionality), then use the various algorithms (Supervised or Unsupervised ) to train the model and then hostel the model using HTTPs endpoints to get realtime inferences.
5. Data governance and security:
Google:
Data governance and security are the core of any storage provider.
Google has various processes to provide data security.
1. Data Encryption: Data will be encrypted while in-transit and
the rest of the journey. So only the authorized user will be able
to view the stored data.
2. Cloud Key Management System (KMS): Google manages cryptographic
keys and will rotate the key frequently.
3.Cloud Identity and Access Management (IAM): It is the tool
provided by Google which helps administrators to authorize access
to specific resources. Giving full control and visibility to manage
cloud resources centrally.
4. Data Backup: Data is also automatically replicated and encrypted
for backup and disaster recovery.
5. Data Deletion: When data is ready to be deleted, it is first
marked as "scheduled for deletion," and then it is removed in
accordance with service-specific policies.
AWS-
AWS uses a de-identified data lake (DIDL) to provide data security on the cloud. DIDL architecture approach helps to provide data privacy by de-identifying and protecting sensitive information while in in-transit. DIDL solutions help enterprises get to the root cause of risk associated with the data architectures and protecting PII. A DIDL on AWS can help to discover, identify, catalog, monitor, and protect your data. It removes personally identifiable information before it enters your data lake.
Please let me know if you need any further information, I will be more than happy to help you.