Data newsletter

Good Morning Data #7 ☕

JacobJustCoding
4 min readOct 14, 2022

Today’s Good Morning Data will focus on Test Data Management and Container-managed ETL Applications.

Grab a cup of coffee or any other favourite beverage and start the day with some noteworthy material from the world of data. Enjoy reading 📚

Photo by Pawel Czerwinski on Unsplash

[XenonStack] Test Data Management Tools and Working Architecture — Complete Guide

Test data management is a process through which an organisation can efficiently and in a supervised manner provide data of the correct quality, in a valid format and at the required time.
This page is an area of insight into what you need to know about TDM, to begin with.

[DataKitchen] Build Trust Through Test Automation and Monitoring

Data testing is usually performed in several stages.
In the simplest case, it is divided into three stages:

  1. the process of testing data sources (verification of count, consistency, validation errors in data sources),
  2. transformation (checking whether business rules are still preserved after data transformation, consistency)
  3. data ready to be loaded into target tables (whether the output data is consistent).

In the meantime, it is useful to monitor the actual status of the testing process in the form of notifications at different warning levels depending on the criticality level of the error for the data.

From the developer’s point of view, looking at the data does not always allow him to determine whether the data is correct due to a lack of knowledge of specific business aspects. In addition, when looking at large datasets, we can only analyse the correctness of the data for a given subset of data.

To this end, after preparing the data, it is immediately worthwhile to realise a set of basic tests, such as:

  • location balance — checking the correctness of data characteristics according to business rules at different stages of data processing,
  • historical balance — historical data that has already been recognised as correct can provide a good reference point for proving it with newly uploaded data in the development environment,
  • statistical process control — calculating the statistics of datasets is a good way of detecting errors related to unjustified changes in dataset size, data multiplicity, and significant changes in statistical values.

In these tests, the automation of statistics calculations is indispensable due to the possibility of calculating many statistics simultaneously in a relatively short time. Dedicated tools can help in this respect.

By carrying out such a basic set of tests, it will be possible to ensure the following:

  1. errors will be detected early enough — testing layer by layer,
  2. data consistency over time,
  3. compatibility of data with business rules.

However, to ensure parallelism of testing and to ensure the productivity of developers who cannot test the data manually all the time, an appropriate automation process must be provided.

[DataKitchen] The DataOps Enterprise Software Industry, 2020

I know this article has already been mentioned before, but it is a big collection of tool suggestions that are nevertheless worth mentioning in the context of today’s newsletter topics:

  • Automated Testing and Production Quality and Alerts,
  • Deployment Automation and Development Sandbox Creation.

[Josef Schiefer, Robert M. Bruckner] Container-Managed ETL Applications for Integrating data in near real-time

Data loading processes in traditional data warehouses have a disadvantage in sometimes delivering data in a timely manner. Delays occur, and this causes the business side to be delayed with the reports it has to update, which ultimately leads to operational problems for companies.

This publication presents an architectural framework for the containerisation of ETL processes, enabling near real-time integration.

https://www.researchgate.net/publication/221598461_Container-Managed_ETL_Applications_for_Integrating_Data_in_Near_Real-Time

List of articles

If you want to see the complete list of Medium articles on which today’s newsletter was based, the link to it is below:

Test Data Management

5 stories

Dev Quote ✍️

If debugging is the process of removing software bugs, then programming must be the process of putting them in. — Edsger Dijkstra

Dev Meme 😂

--

--

JacobJustCoding

Data engineering and Blockchain Enthusiast. Love coffee and the world of technology. https://www.linkedin.com/in/jakub-dabkowski/