Step by step ETL in Ruby

Sergey Tsvetkov
5 min readMay 22, 2019

Lately I’ve been spending all my evenings and nights on classic ETL process development. We in Spectoos are migrating users and their data from original version of the app which exists for many years to completely new version we developed recently which is in private beta currently. It’s quite complicated multistep process since our new implementation has completely reworked model under the hood: improved and enriched in comparison to what we had previously.

ETL = Extract, Transform, Load

From the beginning it was decided that once development will be more or less finished we provide some of our users ability to switch and give us some feedback without loosing their access to Spectoos 1.0. And it was important to keep in mind that, as a consequence, migration is not going to be one time process. First of all, because users who participate in beta can keep using Spectoos 1.0 and we somehow need to keep their accounts synced. And secondly, because migration never being successful from first attempt. Or at least you should be suspicious if it was :-)

So, in order to deal with it I developed separated app which has been using old API of Spectoos 1.0 to get data and fresh Spectoos 2.0 API to push this data into new system after transformation. This app is also saving state of migration and mapping between entities in order to be able to do migration again if it’s needed without duplication of data and other conflicts. So, for example if partner with ID = 123 from Spectoos 1.0 has been already created in Spectoos 2.0 with ID = 456 this app knows it and next time when you run migration for the same partner it performs updates instead of creating one more clone.

Many things were quite small and fast to migrate: partners themselves, their widgets and settings, etc. But other information took a lot of time and space. For example, events we are using for building our dashboard can take about hour for some partners. What I quickly realised after migrating few clients is that I don’t want to have all steps performed each time when I need to adjust something in one place. For instance, if I found out that code which is responsible for saving background color for testimonials in some widgets is broken it doesn’t mean I want stats to be updated now as well. It’s enough to update only widgets, right? So, how I can do it?

Generally speaking problem can be formalised in this form:

There is a process which consists of many steps. Each step can be also separated to substeps and so on. We should be able to define this process tree in our code and control which steps are being executed and which are ignored.

What I like to do in such cases is define desired behaviour first. After some thoughts and experiments I came to this kind of code:

Hash stored in variable tree defines here which parts of code should and should not be executed. Default behaviour is specified using appropriately named key argument of root function. Looks pretty clean and nice, as for me. And requires minimal changes in existing migration code. So, can we make it work? Yep, consider it as kind of exercise in Ruby meta programming.

First of all let’s define our Stepify module. It should have root function which accepts reference to the root of the tree + additional arguments + block.

In order to support call of step method inside of the block without any recipient we have to execute this block in scope of some object. So, our root method is creating this object, passes received arguments down and asks it to execute our block.

Now we can define Step class itself. Basically it should have 2 main methods: call which is used to execute block and step which can be used in a context of that block.

Let’s start from simplest thing and define how do we execute block. In order to get execution of step possible without any context we can use instance_eval in Ruby. It does exactly the thing. Here is how it looks like:

Now it’s basically possible to run our test script because all required methods are defined and available for the execution in a right context. Here is a link to the working example: https://repl.it/@kimrgrey/step-by-step-etl-in-ruby-step-call. The only one problem is that inner context is not executed since we are not doing anything useful in step . So, let’s fix it.

When method step is being called we are getting the name of the step in context of current level + block to execute.

First things first, so we are checking if this block should be executed according to our configuration. To make it happen we are taking appropriate subtree. If it’s Hash — good, it probably has substeps, so we should execute. It also may be true or false directly defining desired result. If it’s nilor not presented or any other stuff — not a big deal, we have our default.

Now, once we decided should we go farther or not it’s ok to create substep and call it with our block passing down our subtree as a context for next steps. Here is a working example of this version: https://repl.it/@kimrgrey/step-by-step-etl-in-ruby-step-call-and-step.

Now what’s left is few beauty improvements. For example, we can pass additional args to the step without actually diving into it, so each substep can have it’s own defaults or something else. Or rename something for the sake of clarity:

Here is how it looks like live: https://repl.it/@kimrgrey/step-by-step-etl-in-ruby-step-complete.

I’m not publishing this code as a separated gem because it’s only 38 lines of code and I believe it can be owned by project. But maybe I’ll change my mind in the future if feature set will grow. So, let’s stay in touch! ;-) For now, please, feel free to just copy / paste if you feel it useful for your own needs.

--

--