8 minute read

Note: Most of the contents in this project are based on Udacity’s Data Engineering Nanodegree Program, if you aim to know more about databases, ETL on the cloud, and data pipelines, consider enrolling in it.

Note: Code for this project is available at my personal repo.

Requirements

Introduction

What is this project about?

Imagine you have just been hired as a data engineer in a new music streaming startup called Sparkify. The company is constantly growing, and now, they want you to move their processes and data onto the cloud. As their data engineer, you are tasked with building an ETL pipeline that:

  • Extracts their data from a S3 Bucket.
  • Stages the data in Redshift.
  • Transforms data into a set of dimensional tables using SQL statements.

This will help their analytics team to continue finding insights in what songs their users are listening to.

So, in this project you will build an ETL pipeline for a database hosted in Redshift.

Project Datasets

The data we are going to use resides in an Amazon S3 bucket. Here are the links for each:

  • Song data: s3://udacity-dend/song_data
  • Log data: s3://udacity-dend/log_data

Log data json path: s3://udacity-dend/log_json_path.json

Alternatively, you can find all this data in the data folder.

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset . Each file is in JSON format and contains metadata about a song and the artist of that song.

These files are partitioned by the first three letters of each song’s track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.

{
    "num_songs": 1,
    "artist_id": "ARJIE2Y1187B994AB7",
    "artist_latitude": null,
    "artist_longitude": null,
    "artist_location": "",
    "artist_name": "Line Renaud",
    "song_id": "SOUPIRU12A6D4FA1E1",
    "title": "Der Kleine Dompfaff",
    "duration": 152.92036,
    "year": 0
}

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.

The log files in the dataset you’ll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

And below is an example of what the data in a log file, 2018-11-12-events.json, looks like.

Log data

Project Steps

Decide a schema for Song Play Analysis

In order to simplify queries and enable fast aggregations, we are going to build a Star Schema from the song and event datasets. These tables will consist on:

1 Fact Table

  • songplays: records in event data associated with song plays.
    • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

4 Dimension Tables

  • users: Data from users in the app.
    • user_id, first_name, last_name, gender, level
  • songs: Songs the music database.
    • song_id, title, artist_id, year, duration
  • artists: Artists in music database.
    • artist_id, name, location, lattitude, longitude
  • time: Timestamps of records in songplays broken down into specific units.
    • start_time, hour, day, week, month, year, weekday

Check scripts

  • Support files:
    • sql_queries.py defines the SQL statements used in the project, which will be imported into the script files.
    • dwh.cfg stores all the information required to connect to S3 and Amazon Redshift.
  • Scripts (In order of intented execution):
    1. create_table.py will create empty staging, fact and dimension tables in Redshift.
    2. etl.py will load data from S3 into staging tables on Redshift and then process that data into your analytics tables on Redshift.
    3. analytics.py will help us verify that the data was successfully integrated into the cluster.

Create a Redshift cluster

Create a Security Group

Here, you’ll create a security group you will later use to authorize access to your Redshift cluster.

  1. Go to your Amazon EC2 console and under Network and Security in the left navigation pane, select Security Groups.
  2. Choose the Create Security Group button.
  3. Enter redshift_security_group for Security group name.
  4. Enter “authorize redshift cluster access” for Description.
  5. Select the Inbound tab under Security group rules.
  6. Click on Add Rule and enter the following values:
    • Type: Custom TCP Rule.
    • Protocol: TCP.
    • Port Range: 5439. (The default port for Amazon Redshift is this one).
    • Source: select Custom IP, then type 0.0.0.0/0 or use your Current IP.
    • Important: Using 0.0.0.0/0 is not recommended for anything other than demonstration purposes because it allows access from any computer on the internet. In a real environment, you would create inbound rules based on your own network settings.

Create an IAM Role

For the cluster to access the S3 bucket, we need to grant it the AmazonS3ReadOnlyAccess policy. This can be achieved by creating a new IAM role called myRedshiftRole with that policy. Follow instructions in Creating an IAM role to proceed.

Start the cluster

Follow “Getting started with Amazon Redshift: Steps 1-3” at AWS Documentation in order to create a new cluster.

  • The cluster configuration should match the following:
    • Cluster Identifier: redshift-cluster-1
    • Type of machine: dc2.large
    • Number of compute nodes: 1
  • For database configurations:
    • Database name: dev
    • Database port: 5439
    • Master username: awsuser
    • Master user password: this_password_is_PRETTY_exposed_123
  • Make sure to add these additional configurations:
    • Cluster permissions: myRedshiftRole
    • VPC security groups: redshift_security_group
    • Publicly accessible: Yes
    • Enhanced VPC routing: Disabled
  • Leave the rest of the parameters as default.

Starting the cluster takes AWS around 5 minutes by the time this file was written.

Beware: The cluster that you create will be live, and you will be charged the standard Amazon Redshift usage fees for the cluster until you delete it. Make sure to delete your cluster each time you’re finished working to avoid large, unexpected costs for yourself. Also, you can set up Billing Alarms to prevent this from happening.

Add redshift database and IAM role to your cfg file

Once you have it, you must copy down your cluster endpoint, the ARN for myRedshiftRole, and put them in dwh.cfg.

It should look as follows:

[CLUSTER]
HOST = redshift-cluster-1.cnus2ii2liz1.us-west-2.redshift.amazonaws.com <CHANGE THIS WITH YOURS!>
DB_NAME = dev
DB_USER = awsuser
DB_PASSWORD = this_password_is_PRETTY_exposed_123
DB_PORT = 5439

[IAM_ROLE]
ARN = arn:aws:iam::814165424567:role/myRedshiftRole <CHANGE THIS WITH YOURS!>

[S3]
LOG_DATA = s3://udacity-dend/log-data
LOG_JSONPATH = s3://udacity-dend/log_json_path.json
SONG_DATA = s3://udacity-dend/song_data

Create the skeleton of the Data Warehouse

In this step we will run create_tables.py, which will take around 10 seconds.

This script will create the staging tables and those needed for the star schema.

Load data into the cluster

In this step we will run etl.py, which will take around 10 minutes.

First, the staging tables will be loaded with data extracted from S3. Then, the data in these staging tables will be transformed into the Star Schema.

Verify the content in the database

Finally, we will run analytics.py, which takes just around 5 seconds.

Running:
SELECT COUNT(*) FROM staging_events
8056

Running:
SELECT COUNT(*) FROM staging_songs
14896

Running:
SELECT COUNT(*) FROM songplays
333

Running:
SELECT COUNT(*) FROM users
104

Running:
SELECT COUNT(*) FROM songs
14896

Running:
SELECT COUNT(*) FROM artists
10025

Running:
SELECT COUNT(*) FROM time
333

This verifies that our ETL process was successful.

FAQ

Why is it a good idea to build a Data Warehouse in AWS?

Using the cloud gives us many important advantages over on-premises HW deployments:

  • Scalability (unlimited capability to expand horizontally).
  • Elasticity (rapidly grow and shrink resources).
  • Up front costs and the capability to stop paying whenever you want.

However, the cloud will probably incur in higher operational cost than keeping on-premises HW.

What is Amazon Redshift?

Amazon Redshift is a cloud managed column oriented (MPP) database which consists on a cluster running multiple nodes.

Internally, Amazon Redshift is a modified PostgreSQL with modified extensions for custom columnar storage.

It has useful built-in features, for example, when doing ETL using the COPY command, Redshift performs Automatic Compression Optimization .

MPP

Most relational databases are capable of executing multiple queries in parallel, but every query is always executed on a single CPU of a single machine.

Massively Parallel Processing databases are capable of parallelizing the execution of one query on multiple CPUs/machines.

This is done by partitioning tables and distributing them in CPUs, processing them in parallel.

mpp

Column-oriented storage

Columnar storage for database tables is an important factor in optimizing analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk.

Column oriented storage is best suited for storing OLAP workloads*.

*Operations primarily focused on read operations which are optimized for analytical and ad hoc queries, including aggregations.

Cluster

Amazon Redshift consists on a cluster composed of:

  • One Leader Node:
    • Coordinates Compute Nodes
    • Handles external communication
    • Optimizes query execution
  • One or more Compute Nodes:
    • Each of them has its own CPU, memory and disk. (Determined by node type)

In order to increase cluster capabilities we can either Scale-up (get more powerfull nodes) or Scale-out (get more nodes).

Each Compute Node is logically divided into a number of processing slices. Thus, if a cluster has n slices, it can process n partitions of a table simultaneously

Redshift Cluster

*Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS) aimed to make it independent of database systems and operating systems.

*Java Database Connectivity (JDBC) is an application programming interface (API) for the programming language Java, which defines how a client may access a database. It is a Java-based data access technology used for Java database connectivity.

What are the costs of running Amazon Redshift?

It depends on the type of cluster you deploy, ranging from less than 0.25 USD to more than 13 USD an hour.

By the time this project was written, prices in Paris (eu-west-3) were as follows:

  vCPU Memory Addressable storage capacity I/O Price
Dense Compute DC2          
dc2.large 2 15 GiB 0.16TB SSD 0.60 GB/s $0.32 per Hour
dc2.8xlarge 32 244 GiB 2.56TB SSD 7.50 GB/s $5.88 per Hour
Dense Storage DS2          
ds2.xlarge 4 31 GiB 2TB HDD 0.40 GB/s $1.00 per Hour
ds2.8xlarge 36 244 GiB 16TB HDD 3.30 GB/s $7.98 per Hour
RA3 with Redshift Managed Storage          
ra3.4xlarge 12 96 GiB 64TB RMS 2.00 GB/s $3.795 per Hour
ra3.16xlarge 48 384 GiB 64TB RMS 8.00 GB/s $15.18 per Hour

Leave a comment