Google Cloud Tools for Big Data Projects
Among the variety of cloud platforms that are present on the market, Google Cloud deserves special mention.
Google Cloud platform is a bunch of services that allow analysing and managing data for big projects and applications.
We will look at what these services are, consider their advantages and disadvantages and see what alternatives to these services are available.
Cloud Pub/Sub is an asynchronous messaging service, which helps to send, receive and filter events or data streams. It gives durable message storage, scalable in-order message delivery, consistent high availability and performance at any scale. It scales global data delivery from zero to millions messages per second. With Pub/Sub data producers do not need to change anything when the consumers of their data change as the publisher takes care of the distribution. This allows more of the services be entirely stateless. Pub/Sub can be set up between services or applications by defining topics and subscriptions, which allows services receiving the messages published on those topics. This means one-to-many communications get much simpler. The service enables to spread batch image analysis over multiple workers or send logs from security system for archiving, processing and analytic services. With steady stream of data, Pub/Sub can be used to stream it into BigQuery or Dataflow for intelligent processing. Pub/Sub has very convenient notification tool that allows informing the right teams if a system or service goes down. It is a great way to connect multiple services, applications and data sources.
The downside of Pub/Sub is each message or request cannot exceed 10 MB and the attribute value size is 1 MB.
An alternative to Pub/Sub can be Amazon SQS, Azure Service Bus, Apache Kafka, RabbitMQ or other open source message brokers.
Google Cloud IoT Core
Cloud IoT Core is a tool for creating device registry. Through connecting to the Google Cloud platform, the service allows users connecting devices and send and receive messages to those devices. IoT Core uses Pub/Sub services to receive messages from devices. The advantage of IoT Core is that gives secure device connection and allows managing the entire IoT data network. IoT Core accepts both MQTT and HTTP/HTTPS transfer protocols. It scales well and has no limits in this respect. It provides end-to-end security, gives role-level access control and allows pushing your device updates.
An alternative to Google Cloud IoT Core can be Amazon IoT, AWS IoT Core, Azure IoT, Resin.io, AWS Greengrass.
Dataproc is a managed service for any OSS jobs that support big data processing including ETL and machine learning. It provides excellent support for the most popular open-source software. Dataproc enables to move on premise OSS clusters to the cloud to maximise efficiency and enable scale. It can also be used with Cloud AI Notebook or BigQuery to build end-to-end data science environment. Dataproc can spin up an IT governed, auto-scaling cluster in 90 seconds. It manages the cluster creation, monitoring and job coordination. Dataproc simplifies data and analytics processing.
The downside of Dataproc is the inability to choose a particular version of the framework, lack of pause/stop option for Dataproc cluster and inability to choose a cluster manager.
An alternative to Google Cloud Dataproc can be Amazon EMR, Azure HDInsight or set-up cluster on virtual machines.
Dataflow is a serverless, fast and cost-effective data processing service for stream and batch data, which removed operational overhead by automating the infrastructure provisioning and auto-scaling as your data grows. Dataflow is easy to use; what it takes is read the data from the source, transform it and write it back into a sink. It provides portability with processing pipeline created using open-source Apache Beam library in the language of your choice and applying it as Dataflow job, which then executes the processing on worker virtual machines. Creating and running Dataflow jobs are possible with Cloud Console UI, gCloud CLI or the APIs. The service provides free-built or custom templates or write SQL statements to develop pipelines straight from BigQuery UI or use API platform notebooks. The data is encrypted at rest and in transit with an option to use cluster-managed encryption keys. Additionally, private IPs and VPC service controls can be used to secure the environment. Google Cloud Dataflow is a great service for real-time AI, data warehousing or stream analytics.
The downside of the Dataflow is that is it subject to the same limitations as Apache Beam. The single element value in streaming engine cannot exceed 100 MB.
An alternative to Dataflow can be to set-up cluster on virtual machines and run Apache Beam via in-built runner.
Google Cloud Dataprep
Dataprep is an intelligent data service for visually exploring, cleaning and preparing both structured and unstructured data for analysis. It is a browser-based service and works at any scale with no infrastructure to deploy or manage. Cloud Dataprep allows preparing big data for training machine learning models, studying visual analytics or for powerful interactive queries that unlock insights from the data. It simplifies building of ETL pipelines, has clear web interface and a build-in scheduler and uses Google Dataflow for ETL jobs.
The downside of Dataprep is that in only works with BigQuery and GCS.
Cloud Composer is a managed Apache Airflow service that helps to create, schedule, monitor and manage workflows. Composer workflows can connect data processing and services in Google Cloud, other public clouds and on premise environments. It is built upon open-source Apache Airflow, therefore users are free from lock-in and benefit from the community’s active and rich library of connectors and plugins. Composer is easy to interact – it can be used with Cloud Console, gCloud command line tool or API with client libraries of your choice. Composer is great for organising ETL pipeline that process data in your data warehouse and data lake. In addition, this service helps to automate analytics and machine learning jobs.
Composer is a fuller service, compared to Dataproc, and has all the advantages of Apache Airflow.
The downside of Cloud Composer is that it provides the Airflow web UI on a public IP address and subject to the rules of Apache Airflow.
An alternative to Cloud Composer can be a custom deployed Apache Airflow or other open-source orchestration solution.
BigQuery is Google Cloud’s fully managed enterprise data warehouse designed to help users ingest, store, analyse and visualise big data with no effort. Data can be ingested into BigQuery either through batch uploading or through streaming data directly to deliver real-time insights. BigQuery uses SQL and can be integrated with all Google Cloud Platform services. It is a great tool for interactive queuing and offline analytics, has huge capacity and built-in ML. In addition, BigQuery is serverless and allows sharing datasets between different projects.
The downside of BigQuery is that it cannot be used outside Google Cloud Platform and has a limitation of 1 MB per row.
An alternative to BigQuery can be Azure Cosmos DB and Amazon Redshift.
Cloud BigTable is a fully managed, scalable NoSQL database for large analytical and operational workloads. It is designed as a sparsely populated table that can scale to billions of rows and thousands of columns, enabling to store petabytes of data. Bigtable is a key-value store that supports high read and write throughput at low latency, supporting millions of requests per second. It is an ideal data source for MapReduce type of operations and integrates easily with existing big data tools, like Hadoop, Dataflow and Dataproc. It also supports open-source HBase API standard.
The downside of Bigtable is that it has poor performance on 300 GB data or less, not suitable for real-time, it does not support ACID operations and needs to have at least three nodes in the cluster.
An alternative to Bigtable can be a custom deployed Apache HBase.
Google Cloud Storage is global, secure and scalable object or blob store for mutable, unstructured data, like images, videos and documents. When building an app, objects are stored in buckets, which are associated with a project. The service allows uploading and downloading objects to and from the bucket, using console or gsutil tool. The data at rest is encrypted by default, but users have option to secure it with the own encryption keys. The service allows granting permission to specific members in teams or make the objects fully public for mobile and web application. Cloud Storage offers various buckets depending on your budget, availability and access frequency. It is optimal to use for hosting static websites, streaming and storing documents, backups and archives. It is a great data-like choice for Big Data and ML.
The downside of the Cloud Storage is that it has a limited choice for programming languages and has regional restrictions for storing data.
An alternative to Cloud Storage can be Amazon S3 and Azure Blob Storage.
As you can see the variety of cloud tools for big data project is immense. Having the basic knowledge and understanding of Google Cloud services will help you choose the right solution for your business. If in doubt, contact us and one of our Magnise experts will guide you through the most suitable service for your project.