ETL is a data integration approach (extract-transfer-load) that is an important part of the data engineering process. At KORE Software, we pride ourselves on building best in class ETL workflows that help our customers and partners win. To do this, as an organization, we regularly revisit best practices; practices, that enable us to move more data around the world faster than even before.
In general, ETL covers the process of how the data are loaded from a source system into a data warehouse. This operation is critical for data products, software applications, and analytics / data science & AI work. In most organizations, this process includes a cleaning step which ensures that the highest quality data is preserved within our partners - as well as our own - central repositories. At KORE Software, focusing on data cleaning is critically important due to the priority that we place on data quality and security. This work is also an important part of our evolving, rigorous master data management (MDM) governance processes. This work helps us ensure that the right information is available in the right place and at the right time for every customer, thus enabling them to make timely decisions with qualitative and quantitative data. In pursuing and prioritizing this work, as a team, we are able to avoid creating long term data problems, inconsistencies and downstream data issues that are difficult to solve, engineer around, scale, and which could conspire to prevent our partners from undertaking great analysis and insights. Below are some key principles:
Rigorously enforce the idempotency constraint: In general, I believe that the result of any ETL run should always have idempotency characteristics. This is important, as it means that, if a process runs multiple times with the same parameters on different days, times, or under different conditions, the outcome remains the same. One should not end up with multiple copies of the same data within ones environment, assuming that the process has never been modified. If rules changes, the target data will be expected to be different. That said, all rule changes should be logged, and logic requirements properly audited.
For efficiency, seek to load data incrementally: When a table or dataset is small, most developers are able to extract the entire dataset in one piece and write that data set to a single destination using a single operation. Unfortunately, as the data sets grow in size and complexity, the ability to do this reduces. Moreover, with data coming from multiple locations at different times, incremental data execution is often the only alternative. Thus, one should always seek to load data incrementally where possible!
Always ensure that you can efficiently process historic data: In many cases, one may need to go back in time and process historical at a date that is before the day or time of the initial code push. To ensure this, always make sure that you can efficiently run any ETL process against a variable start parameter, enabling a data process to back-fill data through to that historical start data irrespective of the initial date or time of the most code push. To enable this, one must ensure that all processes are built efficiently, enabling historical data loads without manual coding or programming. This ensures repeatability and simplicity and is a key part of building a scalable data system.
Partition ingested data at the destination: This principle is important because it enables developers of ETL processes to parallelize extraction runs, avoid write locks on data that is being ingested, and optimize the system performance when the same data is being read. It also allows developers to efficiently create historical snapshots that show what the data looked like at specific moments, a key part of the data audit process. This enables partitions that are no longer relevant to be archived and removed from the database.
Rest data between tasks: Resting data between tasks is an important concept. In any system with multiple workers or parallelized task execution, thought needs to be put into how to store data and rest it between various steps. In a perfect world, an operator would read from one system, create a temporary local file, then write that file to some destination system. What one should avoid doing is depending on temporary data (files, etc.) that are created by one task for use in later tasks downstream. This is because task instances of the same operator can get executed on different workers with a local resource that won’t be there. Thus, it is a good idea to ensure that data is read from services that are accessible to all workers, while also ensuring that data is stored at rest within those services when tasks start and terminate.
Pool resources for efficiency: Efficiency in any system is important, and pooling resources is key. In a simple ETL environment, simple schedulers often have little control over the use of resources within scripts. On the other hand, best practice dictates that one should seek to create resource pools before work begins and then require tasks to acquire a token from this pool before doing any work. If the pool is fully used up, other tasks that require the token will not be scheduled until another token becomes available when another task finishes. This approach is tremendously useful if you want to manage access to shared resources such as a database, GPU, or CPU.
Store all metadata together in one place: Just like pooling resources together is important, the same roles apply with meta-data. Within good ETL, one should always seek to store all meta-data together. Once this is done, allow the system that you are running or workflow engine to manage logs, job duration, landing times, and other components together in a single location. This will allow one to reduce the amount of overhead that development teams face when needing to collect this metadata to solve analysis problems.
Manage login details in one place: With the theme of keeping like components together and remaining organized, the same can be said for login details and access credentials. In any ETL process, one should always seek to manage login details together in a single place. This could be achieved by maintaining the login details for external services within its own database. This allows users to reference these configurations simply by referring to the name of that connection and making this name available to the operator, sensor or hook. If one allows the workflow files to contain login details, this can create duplication, which makes changing logins and access complicated. Thus, always keep this principle in mind.
Specify configuration details once: When thinking about configuration, once must always follow the DRY principle. The DRY principle (Don't Repeat Yourself), is a basic strategy for reducing complexity to manageable units is to divide a system into pieces. Let us assume that one is building a simple system. This system can likely be broken down into components and sub components. At lowest level, one will arrive at a point where the complexity is reduced to a single responsibility. These responsibilities can be implemented in a class (we assume that we're building an object-oriented application). Classes contain methods and properties. Methods implement algorithms. Algorithms and sub-parts of algorithms are calculating or containing the smallest pieces that build your business logic. The DRY principle states that these small pieces of knowledge may only occur exactly once in your entire system. They must have a single representation within it. In fact, every piece of knowledge should have a single, unambiguous, authoritative representation within a system. Thus, following the DRY principle and relating it to configuration, one must seek to avoid duplication of configuration details by specifying them in a single place once and then building the system to look up the correct configuration from the code. There are many other examples that could be described in the ETL process that illustrate the importance of the DRY principle. That said, it is important in our discussion of configurations.
Parameterize sub flows and dynamically run tasks where possible: In many new ETL applications, because the workflow is code, it is possible to dynamically create tasks or even complete processes through that code. If one has routine code that runs frequently, such as checking the number of rows in a database and sending that result as a metric to some service, one can design that work in such a way that one uses a factory method in a library to instantiate this functionality. One can also choose to do things like create a text file with instructions that show how they want to proceed, and allow the ETL application to use that file to dynamically generate parameterized tasks that are specific to that instruction file.
Execute conditionally: Solid execution is important. That said conditional execution within an ETL has many benefits, including allowing a process to conditionally skip downstream tasks if these tasks are not part of the most recent execution. This principle can also allow workers to ensure that they finish completing their work before starting the next piece of work; a principle, that can allow data to rest between tasks more effectively.
Develop your own workflow framework and reuse workflow components: Reuse of components is important, especially when one wants to scale up development process. But just as reusing code itself is important, treating code as a workflow is an important factor as it can allow one to reuse parts of various ETL workflows as needed. This reduces code duplication, keeps things simple, and reduces system complexity which saves time. Moreover, if you are fortune enough to be able to pick one of the newer ETL applications that exist, you can not only code the application process, but the workflow process itself.
The following discussion includes a high level overview of some principles that have recently come to light as we work to scale up our ETL practices at KORE software. They are also principles and practices that I keep in mind through the course of my graduate research work in the iSchool at the University of British Columbia where I work with Dr. Victoria Lemieux! If you have questions, please do not hesitate to reach out!
Nathaniel Payne is a Data and Engineering Lead at KORE Software
Questions about data engineering?