Data newsletter

Good Morning Data #6 ☕

JacobJustCoding
8 min readOct 7, 2022

Today’s Good Morning Data will focus on DataOps. Several aspects need to be clarified in this field, and this collection of articles should resolve any doubts at the initial stage of delving into this approach.

Grab a cup of coffee or any other favourite beverage and start the day with some noteworthy material from the world of data. Enjoy reading 📚

Photo by Claudio Schwarz on Unsplash

[Prukalpa] The Rise of DataOps

Due to the ever-increasing variety and complexity of data, traditional data management methods no longer work.
Therefore, other approaches are increasingly being developed to solve these problems.

DataOps is an example of this. What is worth noting is that it is not a tool but rather a broadly perceived organisational culture that influences the mindset of the people working within it so that data teams and others can work better together.

To familiarise yourself with this concept, it is worth starting by asking yourself what DataOps is. Definitions vary. Personally, for me, one definition that convinces me and, in my opinion, represents the idea well is:

DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organisation. — Gartner

The second question is where to use it.
The recipients of DataOps, as mentioned earlier, are organisations that specifically have data teams working with business departments where their communication is not optimised and the implementation of data products is ineffective.

The third question is how to implement this in the organisation.
Here it is first necessary to understand the characteristics of DataOps, which consists of 4 fundamental aspects.

The main messages to be understood within these aspects are:

  • Product Thinking — understanding the purpose of the product (the work the product does) is what you need to meet the expectations of your customers. It sounds trivial, but it is often very difficult. Understanding Data Value Streams and how to use them within Value Stream Mapping helps.
  • Lean — how to optimize processes to maintain quality while eliminating waste and making the data team more efficient?
  • Agile — minimise the time spent creating the first iterations of products by creating MVPs. This will allow you to check the expectations of your customers as soon as possible, make adjustments based on their feedback and keep you on track during development as it avoids the typical problem associated with creating data products in isolation by data teams.
  • DevOps — this is not classic DevOps for software teams that provide release management, operations and monitoring. It is an extension of involving business users in cooperation with data teams so that customers are entirely data-driven, i.e. enabling the creation of data products from scratch (data sources) to fully interactive and self-managing dashboards, reports, etc.

These aspects are like a jigsaw puzzle. Putting them together may not always be easy, but they produce satisfactory results.

The full article is available here:

[DataKitchen] DataOps is NOT Just DevOps for Data

A common way of thinking is to see DataOps as DevOps for data analytics.
However, this is not entirely true.

People, in addition to tools, are essential parts of the data lifecycle. To be effective, DataOps needs to foster collaboration and innovation. To this end, DataOps combines Agile Development with data analytics so that data teams and users can collaborate more effectively and efficiently.

It is worth understanding the differences between DevOps and DataOps, the duality of orchestration and testing and DataOps complexity. Describing them in detail:

  • the Human Factor
    DataOps is about managing people as much as tools. It is impossible to isolate these two aspects within DataOps and manage them independently. They are interconnected. Unlike DevOps, the stakeholders in DataOps are somewhat different as their preferences and needs differ.
  • the difference between processes
    DevOps is a continuous process consisting of sub-processes within which the creation, testing, building, deploying, configuring and monitoring of the deployed solutions are performed. They have the character of an endless loop because the software life cycle is not complete until the software is archived and taken out of production.
    DataOps, on the other hand, consists of two active and intersecting pipelines:
    - Value Pipeline aims to deliver value by integrating data from sources to the end user.
    - Innovation Pipeline has the objective of enabling the introduction of new analytical ideas into the existing Value Pipeline.
  • development and deployment processes
    DataOps is based on the DevOps development model. Some of the stages are slightly different as the scope of their environments is different, but the idea is the same. As far as DataOps is concerned, the goal is to reduce the time needed from development to deployment of solutions through continuous integration and deployment while ensuring the highest possible quality of data pipelines and data.
  • the Duality of Orchestration in DataOps
    Orchestration is carried out twice in the DataOps cycle as there are two main pipelines — Innovation and Value.
    The first of these, the Innovation Pipeline, oversees the testing and verification of processes before deployment.
    The second one is the control unit after the deployment, which manages the steps in the Value pipeline, error handling and monitoring.
  • the Duality of Testing in DataOps
    Again, like orchestration, testing applies to both pipelines differently. This applies to code and data, depending on which pipeline we are considering.
    As for Value Pipeline, the data is constantly changing. Analytics (models, transformations, algorithms, etc.), on the other hand, are fixed, and the orchestration supervises their change. Testing is oriented toward SPC or statistical process control, where it catches overruns of statistical thresholds as well as other data anomalies spoiling their quality.
    For Innovation Pipeline, on the other hand, we have a separate case, the data we want to keep as constant, and the code changes so we can analyse its impact on the data. The purpose of testing is the code. It validates the correctness of new analytics before deployment. The result of the testing should be approved and given the green light for promotion to the next environment.
    Of course, there are also cases where code and data are tested simultaneously.
  • DataOps Complexity — Sandbox Management
    It is not a trivial task to create a sandbox environment for data teams, as data teams often use a very diverse set of tools. As a result, it is difficult to gather a single set of tools that makes it possible to create a complete sandbox package.
  • DataOps Complexity — Test Data Management
    The concept of test data management is a first-order problem for DataOps. The creation of test environments should be automated while ensuring security and meeting data visibility constraints.
  • DataOps Connects the Organization in Two Ways
    A special feature of Dataops is that not only monitoring teams but also customers are involved in the Operations area. Thanks to this, close cooperation and a fast feedback loop can be maintained so that the response to customer requirements is more efficient on the part of the Data Team.
  • Freedom vs Centralization
    A distributed structure in the form of teams working in cooperation with business teams can be extremely valuable but also lead to chaos in managing these teams.
    Centralization of work of distributed teams on the standardization of metrics, data quality control and security can prevent mutual isolation between these teams and increase transparency about the products they deliver within the organization. However, too much centralization limits the ability to develop data products quickly and creatively, so care should be taken.
    DataOps thus makes it possible to take care of the trade-offs on two levels in particular:
    - development/operations
    - distributed/centralized development.

[Luis Velasco] Dawn of DataOps: Can We Build a 100% Serverless ETL Following CI/CD Principles?

Surely many developers creating ETL processes using GUI tools have dreamt more than once that the life cycle of their projects would be supervised in a way similar to classic software (so that there would be no need to check checkboxes when switching between environments to have automatic release and testing, etc.).

The world is praising more and more GUI-based tools because they are easier to use, but we as developers should not complain that some language, tools, or anything supporting our work is worse because it is difficult to learn. It’s the nature of a developer’s job, the more difficult things are, the more opportunities there are to learn because you have to dig deeper and develop in a particular area.

It is important to remember that analytics is also code, and it is worth striving to implement similar care of analytical software as in the case of classic software.

In this article, it is shown in a simple way what changes software tools such as dbt offer in solving the various problems mentioned earlier.

[DataKitchen] What is DataOps — Ten Most Common Questions

Without wishing to go into too much detail, but to understand the basic issues in DataOps, it is worth seeing this article with explanations of the basic doubts about the DataOps.

[DataKitchen] The DataOps Enterprise Software Industry, 2020

Growing organizational interest in DataOps has resulted in a thriving vendor environment.

Even though the post is from 3 years ago, I still think it is worth looking at the tools mentioned because three years is not that long, as DataOps is just starting to emerge as a prevailing culture in organisations.

In this article, many tools are mentioned under different categories of key DataOps components for:

  • Data Pipeline Orchestration,
  • Automated Testing and Production Quality and Alerts,
  • Deployment Automation and Development Sandbox Creation,
  • Data Science Model Deployment.

In addition to the basic components listed above, various other modules play an important role in the DataOps ecosystem, and the provision within the systems is extremely important.

List of articles

If you want to see the full list of articles on which today’s newsletter was based, the link to it is below:

DataOps

16 stories

Dev Quote ✍️

If you can get today’s work done today, but you do it in such a way that you can’t possibly get tomorrow’s work done tomorrow, then you lose. - Martin Fowler

Dev Meme 😂

--

--

JacobJustCoding

Data engineering and Blockchain Enthusiast. Love coffee and the world of technology. https://www.linkedin.com/in/jakub-dabkowski/