Lately I’ve been spending all my evenings and nights on classic ETL process development. We in Spectoos are migrating users and their data from original version of the app which exists for many years to completely new version we developed recently which is in private beta currently. It’s quite complicated multistep process since our new implementation has completely reworked model under the hood: improved and enriched in comparison to what we had previously.
From the beginning it was decided that once development will be more or less finished we provide some of our users ability to switch and give us some feedback without loosing their access to Spectoos 1.0. And it was important to keep in mind that, as a consequence, migration is not going to be one time process. First of all, because users who participate in beta can keep using Spectoos 1.0 and we somehow need to keep their accounts synced. And secondly, because migration never being successful from first attempt. Or at least you should be suspicious if it was :-)
So, in order to deal with it I developed separated app which has been using old API of Spectoos 1.0 to get data and fresh Spectoos 2.0 API to push this data into new system after transformation. This app is also saving state of migration and mapping between entities in order to be able to do migration again if it’s needed without duplication of data and other conflicts. So, for example if partner with ID = 123 from Spectoos 1.0 has been already created in Spectoos 2.0 with ID = 456 this app knows it and next time when you run migration for the same partner it performs updates instead of creating one more clone.
Many things were quite small and fast to migrate: partners themselves, their widgets and settings, etc. But other information took a lot of time and space. For example, events we are using for building our dashboard can take about hour for some partners. What I quickly realised after migrating few clients is that I don’t want to have all steps performed each time when I need to adjust something in one place. For instance, if I found out that code which is responsible for saving background color for testimonials in some widgets is broken it doesn’t mean I want stats to be updated now as well. It’s enough to update only widgets, right? So, how I can do it?
Generally speaking problem can be formalised in this form:
There is a process which consists of many steps. Each step can be also separated to substeps and so on. We should be able to define this process tree in our code and control which steps are being executed and which are ignored.
What I like to do in such cases is define desired behaviour first. After some thoughts and experiments I came to this kind of code:
Hash stored in variable
tree defines here which parts of code should and should not be executed. Default behaviour is specified using appropriately named key argument of
root function. Looks pretty clean and nice, as for me. And requires minimal changes in existing migration code. So, can we make it work? Yep, consider it as kind of exercise in Ruby meta programming.
First of all let’s define our
Stepify module. It should have
root function which accepts reference to the root of the tree + additional arguments + block.
In order to support call of
step method inside of the block without any recipient we have to execute this block in scope of some object. So, our
root method is creating this object, passes received arguments down and asks it to execute our block.
Now we can define
Step class itself. Basically it should have 2 main methods:
call which is used to execute block and
step which can be used in a context of that block.
Let’s start from simplest thing and define how do we execute block. In order to get execution of
step possible without any context we can use
instance_eval in Ruby. It does exactly the thing. Here is how it looks like:
Now it’s basically possible to run our test script because all required methods are defined and available for the execution in a right context. Here is a link to the working example: https://repl.it/@kimrgrey/step-by-step-etl-in-ruby-step-call. The only one problem is that inner context is not executed since we are not doing anything useful in
step . So, let’s fix it.
step is being called we are getting the name of the step in context of current level + block to execute.
First things first, so we are checking if this block should be executed according to our configuration. To make it happen we are taking appropriate subtree. If it’s Hash — good, it probably has substeps, so we should execute. It also may be
false directly defining desired result. If it’s
nilor not presented or any other stuff — not a big deal, we have our default.
Now, once we decided should we go farther or not it’s ok to create substep and call it with our block passing down our subtree as a context for next steps. Here is a working example of this version: https://repl.it/@kimrgrey/step-by-step-etl-in-ruby-step-call-and-step.
Now what’s left is few beauty improvements. For example, we can pass additional args to the step without actually diving into it, so each substep can have it’s own defaults or something else. Or rename something for the sake of clarity:
Here is how it looks like live: https://repl.it/@kimrgrey/step-by-step-etl-in-ruby-step-complete.
I’m not publishing this code as a separated gem because it’s only 38 lines of code and I believe it can be owned by project. But maybe I’ll change my mind in the future if feature set will grow. So, let’s stay in touch! ;-) For now, please, feel free to just copy / paste if you feel it useful for your own needs.