Data newsletter

Good Morning Data #4 ☕

JacobJustCoding
4 min readSep 23, 2022

Today’s Good Morning Data will focus on ETL testing, LinkedIn’s approach for near real-time personalization and GitHub’s Arctic Code Vault.

Grab a cup of coffee or any other favourite beverage and start the day with some noteworthy material from the world of data. Enjoy reading 📚

Coffee Mug 3D Illustration by Rakata

[Slawomir Chodnicki] Testing strategies for data integration

Recently, I have been trying to tackle the problem of creating a structured test setup with the team I work with.
ETL code is different from classic software code. However, this does not mean that testing is impossible, only more difficult, especially the planning of the testing setup.

For this reason, it is crucial to focus on a few basic aspects that the test environment should be characterised, namely:

  • management of ETL processes configurations,
  • having a dedicated test environment based on a CI server integrated with version control and provisioning, through which the environment can be easily and transparently created and managed,
  • a framework enabling the creation of tests in a structured manner and isolating individual functionalities for testing.

In addition, the granularity of testing should be taken into account. Thus an appropriate testing strategy should be selected, especially:

  • whether unit testing even makes sense for the ETL processes,
  • how to organize test cases so that subsequent changes do not break previous functionality,
  • how integration and functional tests will be implemented,
  • what and when should be covered by stress and performance testing.

[Felix Klemm] How to overcome the Curse of ETL Regression Testing

A major problem with complex solutions that are critical in terms of their functionality is that making changes to their implementation can cause their stability to be compromised.

As a result, it is then necessary to reverse the changes that caused the errors, and this rollback can be very costly in terms of time and money.
For this reason, it is necessary to provide a set of regression tests to ensure that well-known functionalities are not affected in the next implementation.

Regression testing is creating a test that mainly defends against the recurrence of a known issue.

To carry out regression testing, several principles should be followed, which will simplify the implementation process without jeopardising the stability of the production system.

[Tutorialspoint] ETL Testing Tutorial

For those who are not super comfortable with the various issues involved in ETL testing or don’t know what it entails, this tutorial is a good introduction and explanation of the testing challenges, what the testing techniques are and what the processes should look like.

[René Bremer ] How to build unit tests for Azure Data Factory

As the creation of unit tests in ADF is not yet so obvious, it may be useful to consider how to create them to meet the basic requirements.
This article is exactly about that, i.e. how to start creating unit tests in ADF following good unit test practices.

[LinkedIn] Near real-time features for near real-time personalization

Delays in recommender systems can often be crucial to the effectiveness of the suggestions offered by the system and their suitability to the user at any given time.

For this reason, in applications requiring fast responsiveness of the recommender system, it is advisable to have a solution which enables the extraction of features based on almost real-time user activity.

This article explains how LinkedIn has approached the problem of personalising recommendations for its users, ensuring the least possible delay in the system.

It is fascinating how good specification of requirements, changes in system design and a fast and short development cycle can improve the existing architecture and ultimately better meet user expectations.

[GitHub — Jon Evans] “If you don’t make it beautiful, it’s for sure doomed”: putting the Vault in GitHub’s Arctic Code Vault

As an enthusiast of new technology, I love the idea of the Archive Program on GitHub and especially the Arctic Code Vault realisation.

Working in the IT world daily, hardly anyone wonders how many new technologies have influenced our civilisational development because we have simply become accustomed to them.

The realisation of the Arctic Code Vault is a beautiful symbol of remembering part of what our civilisation has achieved. The preservation of a snapshot of multiple repositories, the artistic realisation, the so-called Tech Tree, the snapshot of the entire Wikipedia in five languages and the data dump of Stack Overflow is something that many programmers would consider an amazing idea.

The concept of the vault surviving 1000 years is undoubtedly fascinating, but I don’t focus on it because it has a symbolic meaning for me.
But who knows, in 1000 years, whether this treasure will serve as a solid collection of our IT history?

--

--

JacobJustCoding

Data engineering and Blockchain Enthusiast. Love coffee and the world of technology. https://www.linkedin.com/in/jakub-dabkowski/