Which ETL did you write?

What is your ETL (Historical, or daily)?

ETL processes are one of the most important processes for any data warehouse project.
English: ETL job design
English: ETL job design (Photo credit: Wikipedia)

In non-technical terms and ETL process is a method of moving data from one place to another and prepare it for use in the second location.


    Pull data from a location


    This transformation could be any number of things. Transform epoch time to a human readable form.


    Load this transformed data into it's next destination.

ETL can be as simple as a single SQL script, it could be as complicated as a workflow in Informatica, Pentaho, or other ETL tools.

ETL processes are not the only mechanism for moving data within the enterprise. An alternative is a service. However, services are not ETL. A service request gives a point in-time snapshot of a small set of data specifically related to the parameters that call the service. A good tool for understanding all of the data flows throughout your organization is to create a  Data Structure Graph including ETL, FTP, web-services, and ESB applications as nodes on your graph.

ETL historically has been an extract of a large data set, or table from one repository, then transform it or even restructure the data, then load it into some type of table.

Generally this process goes like this:
Source System -> Staging Area -> Final Destination.

In data warehouse terminology this staging area is volatile, and refreshed with new data before loading into some final structure.

Funny enough some Big Data projects follow this same methodology. Pull data from a source system, do some transformations on the data, then store it in a final location for consumption by your users.

This is not the Data Science portion of a big data project.  I will write more about data science in a later article, for now here is where I talk about the difference between Data Science and Business Intelligence

For any process where data is extracted from a source system, there should be two different methods of extraction. The Daily process, and the historical process.

The daily process is usually quite simple, just grab all the data from a particular table, that has been created or updated since the last extract.

The historical process eliminates the timestamp requirement.

This sounds simple.

Generally it is, however, I have seen on more than one occasion the "full-historical" data behaves quite differently from the "daily" snapshot.

This is due to the fact that the daily process is usually designed to handle small quantities of data, whereas if you do a full historical process the behavior is quite different.

A daily process may or may not scale when dealing with larger volumes than it normally does. 

Do not assume that daily processes can handle the same volume. Also, there have been times where a quality problem goes unnoticed for some period of time, then when found a full historical refresh of the data warehouse was required. We did the math, and determined that running 2 years worth of daily extract and load processes would take 3 weeks. If we used a different process we could get it done in 48 hours.

The processes are optimized for different workloads. Keep this in mind when you design future ETL work.

No comments:

Post a Comment