ETL Basics

We often think of the foundational skills of data science being Data Manipulation and Management (SQL), Graphing/Charting, Modeling, and Development (R/Python). But as the toolset changes, so do the skills needed. With the emergence of data science platforms like Dataiku, the development skillset has become less important. Similarly, these tools have nearly removed the need for a separate data engineering stage of the project. As a result, the process of building data models that follow ETL (Extract-Transfer-Load) principles becomes a more important skill set for the data scientist.

I’ll have a more detailed post coming shortly on specific guidelines of ETL, but for now, let’s focus on what the overall picture of an ETL should look like.

Master Data and Transactional Data

First, we need to talk about two kinds of tables: Master Data and Transactional. Usually, master data is the data describing the objects the database cares about. Frequently this data is not necessarily version controlled but is often “as-is”. An example of this would be a “customer” and within the customer master data could be the “sales plan” the customer is on. If you go into the transactional system and look up the customer, you might see a drop-down list of the possible options. In our imaginary system, we’ll say that the customer can be on the “free tier”, “silver” or “gold” sales plans.

The second type of table, the transactional, is generally the transactions between the master data elements. For example, if you have a customer, you might have a table that is the list of all of their orders. Usually in a normalized database, the customer listed on the order table is just the customer number or some other unique identifier. There is also probably a subsequent table that joins the orders with the products sold. “Products” would be another master data table.

Basic Recommendations for an ETL

ETL’s frequently are then de-normalizing the database from multiple tables into one large table. This process is similar to the steps used to build reports and data warehouses. If you have access to someone who has built (good) reporting systems, look to tap their expertise. Otherwise here are my “basics”:

  • Don’t do everything in Python/R – you are doing data manipulation use a database tool
  • Run the extract in the memory of the source database
  • Limit the transaction extraction by a variable set to be only those greater than the start date/time of the last successful run
  • Transfer to destination database prior to joins
  • Execute joins/transformations in destination database
  • Build any aggregations needed to speed performance
  • Build appropriate checks in the process. Every ETL will eventually fail. By setting up the last successful run variable, the ETL should be setup so that it will automatically catch up. A full re-build may be necessary and that is just setting the date/time of the start of the last good run to the date/time of the first transaction.
  • Build the joins tables assuming you will re-use them. Aggregations and transformations may vary as time goes by so do that in a separate step.
  • Unless you are working with very large data sets, use as many intermediate tables on your destination database as you need. You’ll get better performance results and the extra storage space is no big deal.

The Challenge of Turning a Master Data table into a Transactional Table

One last bit of advice to solve the problem of if your master data does not contain its change history and you need that for analysis. Besides taking a baseball bat up to the developers and demanding they re-design the transactional system to benefit you, you’ll need to build a compare function that compares a snapshot to the current master data table. Then capture the deltas between the old table and the current table and append those to a running table of transactions. Use the date that you execute the load as the date inside the transaction table. There is one weakness to this approach which is not one than can be conveniently worked around. If you skip over any days of processing, then those days of data are “lost”. I’m sorry, but they just are lost. There is no way to re-create history because all you have is a picture of it before and after. Without a set of data in between you are simply at a loss.

As I mentioned, the process for building ETLs has been around for quite some time. If you come from a background of building them into data science today, you have a leg up. Otherwise, you’ll want to incorporate good ETL practices into your process of building models.