Did you define your data sources? Did you manage to set up a data lake? Is it ready? Does the pipeline work properly and transit your data to the data lake?
If there is at least one “no” answer to these questions, please read my previous article. Otherwise, let’s move on — our next station is the data warehouse.
You already know what a data warehouse is but l want to extend it keeping in mind new knowledge you’ve acquired. A data warehouse is a clean and well-organized database that effectively stores all product’s data integrated from all sources.
What storage should be used for a data warehouse? Well, there is only one choice here: a relational analytics database. It’s time to talk about analytics databases in detail.
There are 2 types of databases: OLTP and OLAP.
Online transactional processing (OLTP)
Enables the real-time execution of large numbers of database transactions by large numbers of people, typically over the internet. OLTP systems are behind many of our everyday transactions, from ATMs to in-store purchases to hotel reservations.
OLTP systems use a database that can do the following:
- Process a large number of relatively simple transactions — usually insertions, updates and deletions to data.
- Enable multi-user access to the same data, while ensuring data integrity.
- Support very rapid processing, with response times measured in milliseconds.
- Provide indexed data sets for rapid searching, retrieval and querying.
- Be available 24/7/365, with constant incremental backups.
Examples of OLTP databases are Postgres, MySQL, MongoDB.
Online analytical processing (OLAP)
A system for performing multi-dimensional analysis at high speeds on large volumes of data. Typically, this data is from a data warehouse or some other centralized data store.
OLAP is ideal for:
- Data mining.
- Business intelligence and complex analytical calculations.
- Business reporting functions like financial analysis, budgeting and sales forecasting.
- Calculation metrics.
- Complex data analysis for smarter decision-making.
The analytics database is a database that belongs to the OLAP group. Examples are RedShift, BigQuery, SnowFlake.
Creating a data warehouse requires data engineering effort because you have to clean, transform and reorganize your data from a data lake into a structure that will be queried by a non-engineering team.
Table and column names are typically created by engineers to be used by the application the data comes from. Table names, column names, and even a column’s purpose for being in the table can be confusing to others.
This makes it challenging for team members to analyze the data without consulting the engineer.
To address these issues we need to keep the analyst/business user in mind and make all of the fields easy for them to interpret.
The first step is to develop guidelines for how you want to transform the data. It completely depends on your data but there are a couple basic steps.
Consolidate Data Sources
When your company has used multiple tools to track the same type of data, if you can, migrate the data from the previous tools into the latest tool. If this is not an option, use the data warehouse to create a table which consolidates the data from both sources.
Imagine you were tracking sign-ups via Hubspot and after a year you decided to switch to Salesforce. That means that prior to your switch, the Salesforce data will be empty. So you have to merge this data into one table of ‘sign-ups’.
Simplify the Schema
In a Data Lake the schema reflects the transactional logic of an application and follows best practices (such as a 3rd normal form). Build a data warehouse by creating tables on top of the existing schema. There is no need to move away from 3rd normal form.
The main thing to do is to simplify the schema — exclude tables from the new tables that only contain app specific logic and are not useful for analysis.
If you want to make it even easier to work with a specific set of data, you can create a wide table that does all the joins with other tables.
Simplify Tables and Columns
When going through and recreating the schema with relevant tables you should also clean up what’s in each table. Exclude irrelevant columns and rename any columns that are confusing. Naming conventions help people analyze data consistently because they clarify what each column is and how it can be used. Examples:
- extract relevant data from complex data types
- change flags and cryptic abbreviations to meaningful values
- denormalize data where possible
- changing column types
- include fields with obvious analytical purpose
- remove irrelevant data
- fix common data inconsistencies
- plural Table Names
- id as primary key
- lower case, underscored naming
We’ve set up a data warehouse and we now know what transformations should be applied to your data. The last step is to move data from the data lake to the data warehouse. And you already know how to move data between databases. Say hi to ETL jobs again.
In the last article, I omitted the T (Transform) letter in ETL because it was not relevant for a data lake. But as a data warehouse is built on the top of data transformations here letter T is very important and relevant.
But let’s explore ETL in detail depending on the data lake architecture. I will remind you that a data lake might be an analytics database or object storage.
Data Lake as an analytics database
In this case, the data lake is already your data warehouse. All you have to do is to create database views that will use “data lake” tables and will produce data warehouse tables.
For example, you already have the table users from your application database and a table users from the CRM system. This data belongs to the data lake. Create a database view that joins both these tables, removes useless columns and renames the others.
As a result you have a cleaned and well-organized table ‘users’ that already belongs to the data warehouse. This process has another name called ELT, because you Extracted data from sources, Loaded it into a data lake and Transformed it in a data warehouse.
Data Lake as an object storage
This case is more complicated because you should create an ETL job that will Extract data from your data lake, Transform it according to the rules above and Load it into a data warehouse.
Every ETL tool already has transformation functions. For example, Fivetran uses a dbt tool, Amazon’s AWS Glue provides a set of built-in transforms that you can use to process your data. Just check documentation of the instrument you use.
A data warehouse is a clean and well-organized database. Before creating a data warehouse make a data transformation guideline to prepare your data properly. Use ETL/ELT jobs to transit your data from data lake to data warehouse.
Congratulations, we have built a data warehouse and we’re ready to start analyzing our product’s data. But, as you remember, a data warehouse is merely a database that stores data. We need to deliver an interface to the warehouse to force this data work on us. This is what we will go over in the next article.