ITSM & DevOps Category Banner Image

Data Engineering on Google Cloud

  • Length 4 days
Course overview
View dates &
book now
Course locations >>

Why study this course

This four-day instructor-led course provides participants a hands-on introduction to designing and building data processing systems on Google Cloud. Through a combination of presentations, demonstrations, and hands-on labs, participants will learn how to design data processing systems, build end-to-end data pipelines, analyse data, and carry out machine learning. The course covers structured, unstructured, and streaming data.

Request Course Information


What you’ll learn

This course teaches participants the following skills:

  • Design and build data processing systems on Google Cloud

  • Process batch and streaming data by implementing autoscaling data pipelines on Cloud Dataflow

  • Derive business insights from extremely large datasets using Google BigQuery

  • Leverage unstructured data using Spark and ML APIs on Dataproc.

  • Enable instant insights from streaming data.

  • Understand ML APIs and BigQuery ML, and learn to use AutoML to create powerful models without coding.


logo: Google Cloud Partner

Google Cloud at Lumify Work

Lumify Work is Australia's only national Google Cloud Authorised Training Partner. Get the skills needed to build, test, and deploy applications on this highly scalable infrastructure. Engineered to handle the most data-intensive work you can throw at it, Lumify Work can support you through training wherever you are in your Cloud adoption journey.


Who is the course for?

This course is intended for experienced developers who are responsible for managing big data transformations including:

  • Extracting, loading, transforming, cleaning, and validating data

  • Designing pipelines and architectures for data processing

  • Creating and maintaining machine learning and statistical models

  • Querying datasets, visualising query results, and creating reports


Course subjects

Module 1: Introduction to Data Engineering

  • Explore the role of a data engineer

  • Analyse data engineering challenges

  • Introduction to BigQuery

  • Data lakes and data warehouses

  • Transactional databases versus data warehouses

  • Partner effectively with other data teams

  • Manage data access and governance

  • Build production-ready pipelines

  • Review Google Cloud customer case study

  • Lab: Using BigQuery to do Analysis

Module 2: Building a Data Lake

  • Introduction to data lakes

  • Data storage and ETL options on Google Cloud

  • Building a data lake using Cloud Storage

  • Securing Cloud Storage

  • Storing all sorts of data types

  • Cloud SQL as a relational data lake

  • Lab: Loading Taxi Data into Cloud SQL

Module 3: Building a Data Warehouse

  • The modern data warehouse

  • Introduction to BigQuery

  • Getting started with BigQuery

  • Loading data

  • Exploring schemas

  • Schema design

  • Nested and repeated fields

  • Optimising with partitioning and clustering

  • Lab: Loading Data into BigQuery

  • Lab: Working with JSON and Array Data in BigQuery

Module 4: Introduction to Building Batch Data Pipelines

  • EL, ELT, ETL

  • Quality considerations

  • How to carry out operations in BigQuery

  • Shortcomings

  • ETL to solve data quality issues

Module 5: Executing Spark on Dataproc

  • The Hadoop ecosystem

  • Run Hadoop on Dataproc

  • Cloud Storage instead of HDFS

  • Optimise Dataproc

  • Lab: Running Apache Spark jobs on Dataproc

Module 6: Serverless Data Processing with Dataflow

  • Introduction to Dataflow

  • Why customers value Dataflow

  • Dataflow pipelines

  • Aggregating with GroupByKey and Combine

  • Side inputs and windows

  • Dataflow templates

  • Dataflow SQL

  • Lab: A Simple Dataflow Pipeline (Python/Java)

  • Lab: MapReduce in Dataflow (Python/Java)

  • Lab: Side inputs (Python/Java)

Module 7: Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

  • Building batch data pipelines visually with Cloud Data Fusion

  • Components

  • UI overview

  • Building a pipeline

  • Exploring data using Wrangler

  • Orchestrating work between Google Cloud services with Cloud Composer

  • Apache Airflow environment

  • DAGs and operators

  • Workflow scheduling

  • Monitoring and logging

  • Lab: Building and Executing a Pipeline Graph in Data Fusion

  • Optional Lab: An introduction to Cloud Composer

Module 8: Introduction to Processing Streaming Data

  • Process Streaming Data

Module 9: Serverless Messaging with Pub/Sub

  • Introduction to Pub/Sub

  • Pub/Sub push versus pull

  • Publishing with Pub/Sub code

  • Lab: Publish Streaming Data into Pub/Sub

Module 10: Dataflow Streaming Features

  • Steaming data challenges

  • Dataflow windowing

  • Lab: Streaming Data Pipelines

Module 11: High-Throughput BigQuery and Bigtable Streaming Features

  • Streaming into BigQuery and visualising results

  • High-throughput streaming with Cloud Bigtable

  • Optimising Cloud Bigtable performance

  • Lab: Streaming Analytics and Dashboards

  • Lab: Streaming Data Pipelines into Bigtable

Module 12: Advanced BigQuery Functionality and Performance

  • Analytic window functions

  • Use With clauses

  • GIS functions

  • Performance considerations

  • Lab: Optimising your BigQuery Queries for Performance

  • Optional Lab: Partitioned Tables in BigQuery

Module 13: Introduction to Analytics and AI

  • What is AI?

  • From ad-hoc data analysis to data-driven decisions

  • Options for ML models on Google Cloud

Module 14: Prebuilt ML Model APIs for Unstructured Data

  • Unstructured data is hard

  • ML APIs for enriching data

  • Lab: Using the Natural Language API to Classify Unstructured Text

Module 15: Big Data Analytics with Notebooks

  • What’s a notebook?

  • BigQuery magic and ties to Pandas

  • Lab: BigQuery in Jupyter Labs on AI Platform

Module 16: Production ML Pipelines

  • Ways to do ML on Google Cloud

  • Vertex AI Pipelines

  • AI Hub

  • Lab: Running Pipelines on Vertex AI

Module 17: Custom Model Building with SQL in BigQuery ML

  • BigQuery ML for quick model building

  • Supported models

  • Lab option 1: Predict Bike Trip Duration with a Regression Model in BigQuery ML

  • Lab option 2: Movie Recommendations in BigQuery ML

Module 18: Custom Model Building with AutoML

  • Why AutoML?

  • AutoML Vision

  • AutoML NLP

  • AutoML tables


Prerequisites

To get the most out of this course, participants should have:

  • Completed Google Cloud Big Data and Machine Learning Fundamentals course OR have equivalent experience

  • Basic proficiency with common query language such as SQL

  • Experience with data modeling and ETL (extract, transform, load) activities

  • Experience with developing applications using a common programming language such Python

  • Familiarity with Machine Learning and/or statistics


Terms & Conditions

The supply of this course by Lumify Work is governed by the booking terms and conditions. Please read the terms and conditions carefully before enrolling in this course, as enrolment in the course is conditional on acceptance of these terms and conditions.


Request Course Information

Awaiting course schedule

If you would like to receive a notification when this course becomes available, enter your details below.

Personalise your schedule with Lumify USchedule

Interested in a course that we have not yet scheduled? Get in touch, and ask for your preferred date and time. We can work together to make it happen.



Loading