Data Engineering on Google Cloud

Length 4 days

Course overview

View dates &
book now

Why study this course

This four-day instructor-led course provides participants a hands-on introduction to designing and building data processing systems on Google Cloud. Through a combination of presentations, demonstrations, and hands-on labs, participants will learn how to design data processing systems, build end-to-end data pipelines, analyse data, and carry out machine learning. The course covers structured, unstructured, and streaming data.

Aligns to certification

Google Cloud Certified: Professional Data Engineer

Request Course Information

What you’ll learn

This course teaches participants the following skills:

Design and build data processing systems on Google Cloud
Process batch and streaming data by implementing autoscaling data pipelines on Cloud Dataflow
Derive business insights from extremely large datasets using Google BigQuery
Leverage unstructured data using Spark and ML APIs on Dataproc.
Enable instant insights from streaming data.
Understand ML APIs and BigQuery ML, and learn to use AutoML to create powerful models without coding.

Google Cloud at Lumify Work

Lumify Work is Australia's only national Google Cloud Authorised Training Partner. Get the skills needed to build, test, and deploy applications on this highly scalable infrastructure. Engineered to handle the most data-intensive work you can throw at it, Lumify Work can support you through training wherever you are in your Cloud adoption journey.

View all Google Cloud at Lumify Work courses

Who is the course for?

This course is intended for experienced developers who are responsible for managing big data transformations including:

Extracting, loading, transforming, cleaning, and validating data
Designing pipelines and architectures for data processing
Creating and maintaining machine learning and statistical models
Querying datasets, visualising query results, and creating reports

Course subjects

Module 1: Introduction to Data Engineering

Explore the role of a data engineer
Analyse data engineering challenges
Introduction to BigQuery
Data lakes and data warehouses
Transactional databases versus data warehouses
Partner effectively with other data teams
Manage data access and governance
Build production-ready pipelines
Review Google Cloud customer case study
Lab: Using BigQuery to do Analysis

Module 2: Building a Data Lake

Introduction to data lakes
Data storage and ETL options on Google Cloud
Building a data lake using Cloud Storage
Securing Cloud Storage
Storing all sorts of data types
Cloud SQL as a relational data lake
Lab: Loading Taxi Data into Cloud SQL

Module 3: Building a Data Warehouse

The modern data warehouse
Introduction to BigQuery
Getting started with BigQuery
Loading data
Exploring schemas
Schema design
Nested and repeated fields
Optimising with partitioning and clustering
Lab: Loading Data into BigQuery
Lab: Working with JSON and Array Data in BigQuery

Module 4: Introduction to Building Batch Data Pipelines

EL, ELT, ETL
Quality considerations
How to carry out operations in BigQuery
Shortcomings
ETL to solve data quality issues

Module 5: Executing Spark on Dataproc

The Hadoop ecosystem
Run Hadoop on Dataproc
Cloud Storage instead of HDFS
Optimise Dataproc
Lab: Running Apache Spark jobs on Dataproc

Module 6: Serverless Data Processing with Dataflow

Introduction to Dataflow
Why customers value Dataflow
Dataflow pipelines
Aggregating with GroupByKey and Combine
Side inputs and windows
Dataflow templates
Dataflow SQL
Lab: A Simple Dataflow Pipeline (Python/Java)
Lab: MapReduce in Dataflow (Python/Java)
Lab: Side inputs (Python/Java)

Module 7: Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

Building batch data pipelines visually with Cloud Data Fusion
Components
UI overview
Building a pipeline
Exploring data using Wrangler
Orchestrating work between Google Cloud services with Cloud Composer
Apache Airflow environment
DAGs and operators
Workflow scheduling
Monitoring and logging
Lab: Building and Executing a Pipeline Graph in Data Fusion
Optional Lab: An introduction to Cloud Composer

Module 8: Introduction to Processing Streaming Data

Process Streaming Data

Module 9: Serverless Messaging with Pub/Sub

Introduction to Pub/Sub
Pub/Sub push versus pull
Publishing with Pub/Sub code
Lab: Publish Streaming Data into Pub/Sub

Module 10: Dataflow Streaming Features

Steaming data challenges
Dataflow windowing
Lab: Streaming Data Pipelines

Module 11: High-Throughput BigQuery and Bigtable Streaming Features

Streaming into BigQuery and visualising results
High-throughput streaming with Cloud Bigtable
Optimising Cloud Bigtable performance
Lab: Streaming Analytics and Dashboards
Lab: Streaming Data Pipelines into Bigtable

Module 12: Advanced BigQuery Functionality and Performance

Analytic window functions
Use With clauses
GIS functions
Performance considerations
Lab: Optimising your BigQuery Queries for Performance
Optional Lab: Partitioned Tables in BigQuery

Module 13: Introduction to Analytics and AI

What is AI?
From ad-hoc data analysis to data-driven decisions
Options for ML models on Google Cloud

Module 14: Prebuilt ML Model APIs for Unstructured Data

Unstructured data is hard
ML APIs for enriching data
Lab: Using the Natural Language API to Classify Unstructured Text

Module 15: Big Data Analytics with Notebooks

What’s a notebook?
BigQuery magic and ties to Pandas
Lab: BigQuery in Jupyter Labs on AI Platform

Module 16: Production ML Pipelines

Ways to do ML on Google Cloud
Vertex AI Pipelines
AI Hub
Lab: Running Pipelines on Vertex AI

Module 17: Custom Model Building with SQL in BigQuery ML

BigQuery ML for quick model building
Supported models
Lab option 1: Predict Bike Trip Duration with a Regression Model in BigQuery ML
Lab option 2: Movie Recommendations in BigQuery ML

Module 18: Custom Model Building with AutoML

Why AutoML?
AutoML Vision
AutoML NLP
AutoML tables

Prerequisites

To get the most out of this course, participants should have:

Completed Google Cloud Big Data and Machine Learning Fundamentals course OR have equivalent experience
Basic proficiency with common query language such as SQL
Experience with data modeling and ETL (extract, transform, load) activities
Experience with developing applications using a common programming language such Python
Familiarity with Machine Learning and/or statistics

Terms & Conditions

The supply of this course by Lumify Work is governed by the booking terms and conditions. Please read the terms and conditions carefully before enrolling in this course, as enrolment in the course is conditional on acceptance of these terms and conditions.