ETL (Extract, Transform, and Load) is the process of transferring data from source database to the destination database. The tools that execute these functions are called ETL Solution tools. In software development world, ETL is often used to migrate data from legacy system(s) to a new system or pull data from online transaction database (OLTP) to data warehouse.
If you are in charge of designing and implementing an ETL process, consider these practical tips in the whole lifecycle of the project.
1. Tool Selection
The first question to address is, “Which tool should be used?”. There are many factors to consider when selecting an ETL tool like cost, performance, volume of data, complexity of transformation logic, configuration, implementation, and deployment effort, etc.. The following guidelines should help guide you to the right answer for your situation:
- Consider using internal tools (database script or in-house developed tool) for low complexity projects. For example, if the source and target databases have a similar structure, bulk export and bulk insert commands, along with some scripts to slightly adjust the data may be sufficient. This will bring the best performance with not too much effort.
- For more complex transformations, you should consider using an industry proven ETL tool. For price-sensitive efforts, you may want to follow open source path. Talend, Pentaho, CloverETL are examples of solutions available in this category. However, most users find those editions to be limited for real-life situations. However, since they are open source, you can customize the code to suit your requirements, but of course this requires coding knowledge and effort.
- For additional features, you may consider purchase of more commercially licensed tools. For small projects, there are many affordable ETL tools like Microsoft SSIS, SAS Data Management, Pervasive Data Integrator, etc. I used Microsoft SSIS in most of my projects in this range and it worked perfectly. SSIS can load tens of millions of records in minutes and it also allows you to write your own code (.NET) in the script component to handle any complicated requirements.
- For large and highly complex transformations, enterprise-scale ETL tools like Informatica PowerCenter and IBM InfoSphere DataStage are leading the market with their ability to scale performance in handling very large data volumes in complex, heterogeneous environments. I used Informatica PowerCenter in a large healthcare project with hundreds of millions transactional records per day and Informatica PowerCenter could handle it easily. The only obstacle of those tool is their price, which may go up to six digit license cost.
2. Define Data Mapping
This is the most important step in an ETL project. The output of this step is a data lineage (or source-to-target) document, to be used by both developers and QAs. This document must include the following information:
- Sources and Targets: can be tables, flat files, APIs, etc.
- Mappings: can be straight mappings from source columns to target columns, or business logic to transform source data to target data.
Many database design software include features to design and manage data lineage document. For example, Embarcadero ER/Studio allows us to enter data mapping for each column in Logical model. Then the data architect can export those mappings to an Excel data lineage document at any time.
3. Driver Selection
In general, most ETL tools support all common data sources (flat files, relational databases, NoSQL databases, etc.) with variety of drivers (ODBC, OLEDB, ADO, ADO.NET, FILE, etc.). However, in some cases using different drivers bring noticeable difference in performance. For example, if you use Microsoft SSIS to extract and load data from Oracle, try Microsoft Attunity Connector driver, which is developed specifically for Oracle, and experience the significant improvement (3-5 times faster).
I suggest to search in the community for the best drivers to be used for each source/target database before you start coding. Most products have their own community pages like https://network.informatica.com/welcome for Informatica, https://docs.microsoft.com for Microsoft, https://docs.oracle.com for Oracle. Or you can search on some of my favorite popular sites like https://stackoverflow.com/, https://www.codeproject.com/, or https://www.mssqltips.com/, etc.
4. Data Processing
Most ETL developers adopt one of the two following approaches to process data – either process the data in memory, or on the database server. Amount of data to process, and speed of your servers are key factors influencing this design decision.
Approach 1: Bring all necessary data to memory and process on the fly.
This approach visualizes all processing flows on the UI so it helps developer and reviewer observe the flow easily. It also helps reduce workload to the transactional database server. However, this approach is only suitable if you have a strong ETL server and a fast processing ETL tool. Based on my past experience, if you choose Informatica PowerCenter, this approach fits the best. The reason is Informatica can process data extremely fast and normally the server must be very strong (who installs a million dollar license software to a cheap server?).
Approach 2: Process data as much as possible in the source database server before bringing it to memory.
Some mid-level ETL tools, such as Microsoft SSIS, suggests that we should process in the database server and only leave limited work (for example, joining data from sources located in different server) to ETL server.
5. Error Handling
Of course when an error happens at any point, we will need to log as much error information as possible (error source, error description, etc.) to either log file or log table. The only thing that needs to be considered is to let the process continue or stop and rollback.
Normally, if the destination is a transactional database, I would prefer to stop the process immediately and rollback the data to maintain integrity in the transactional database.
In case the destination is a reporting database or data warehouse, we may let the process continue to load other datasets then come back and reload the failed dataset later.
6. Data Logging
Always capture and store “bad” data (duplicate data, outlier value, etc.) to audit tables and better implement an auto-notification process to alert product or support team to analyze those data.
In case the amount of this kind of data is small, we can store them in one common audit table. This common audit table must include a natural key to identify original data in the source system. However, if the bad data is huge on some datasets (especially transactional tables), consider to create an audit table for each dataset to help the analyst analyze and identify the issue faster. Once the data issue has been analyzed and resolved, this data can be removed manually or by an automatic process after a predefined period.
7. Data Staging
Staging area is the place where we dump our source data before start processing it. This phase helps reduce communication time between our process and source systems. This will bring main benefits: reduce risk of error (when communicating with external source, you would want to keep the transaction as short as possible), and reduce workload to source database (if you keep long transaction, it may hold resources in source database and reduce source database’ performance or locking). So the bottom line is do NOT process source data while loading it. But split the process to two phases: load data from external sources to staging area first, then process data from internal staging area and load to destination.
Staging can be flat files, formatted files, or database. If you need to store the snapshot of staging data to external disk, consider using files to store staging data. However, if you want to rely on database to process data, I suggest to use staging database. One of advantage of using staging database is the ability to merge data from staging database to the destination (of course they must be in the same server). Regarding speed perspective, merging data to destination database typically outperforms inserting data rows into destination table (either single row or bulk insert).
8. Unit Testing
Unit testing is very important because the nature of this process does not lend itself to typical UI validation. It’s heavily technical focused validation, requiring direct access to back end data sources and expertise in query and data analysis tooling.
Data issues can be categorized into these types: data duplication, data truncation, data missing, transformation error. Those kinds of errors can be discovered by using the following checks:
- For dataset with low number of records, query and export both source and target dataset into flat files, remember to order by the same natural key. Then use any text comparing tool to detect the differences.
- For dataset with more data (millions of records), it is tough to test all the records but at least we can do a general test and some spot checks.
- General test hints:
- Compare the total rows of source and target
- For column with discrete values (for example, genders), aggregate the counts of each distinct value in source and target (using count function in combination with a group by), then compare the result
- For number or date column, compare the range (max, min) and the number of distinct values between source and target data
- or character based column, compare the range of the length and the number of distinct values between source and target data
- Spot checks: randomly select some records in source and target tables and compare data in each column
- General test hints:
- Always test initial loaded data and incremental loaded data. Initial test helps discover duplicate, truncate and transformation issues while incremental test helps discover missing data issue.
To sum up, I suggest you investigate the characteristics of your project to pick up the right tool, the key factors are: size of your data, load frequency, source and destination types, budget, available resource – both hardware and developer. After that, have the data architect and data analyst profile data source and analyze destination database to define a precise source-to-target document. At the same time, software architect and developer can design the framework of your project, choose the approach to stage source data, process data, handle error and audit data. Then, you are ready to implement ETL code and don’t forget to do unit test thoroughly.