Data newsletter
Good Morning Data #2 ☕
--
Today’s Good Morning Data will focus on Azure Data Factory and Execution Engine Velox.
Grab a cup of coffee or any other favourite beverage and start the day with some noteworthy material from the world of data. Enjoy reading 📚
Data in MAANG
[META] Introducing Velox: An open-source unified execution engine
Nowadays, companies are using an ever-increasing number of user workloads such as analytics, real-time, transactional, monitoring, and machine learning which is the biggest consumer of data systems.
The typical approach has been to use specialised processing engines depending on the type of workload (Spark, Presto, Scuba, Cubick etc.).
However, their disadvantage is the high limit of reusability and the fact that they expose inconsistencies to the end-user.
Their common feature is the layered architecture of these tools:
- Front-end,
- IR,
- Optimizer,
- Execution Runtime,
- Execution Engine.
The Execution Engine stands out in particular, which led Meta to conclude that it would be possible to consolidate them into a single engine.
In this way, they created an engine (still under development) which accelerates data management, primarily using:
- shared library,
- common data processing APIs,
- integration into data management systems.
For those who are interested in more details, I encourage you to read this thesis:
[Guang X] Lessons learned from Azure Data Factory
ADF is constantly gaining more popularity. After using it for a certain period of time the advantages and disadvantages become visible, which is what this article is about — the learned lessons.
[Davide Mauri] Azure SQL Managed Instances and Azure Data Factory: a walk-through
To execute or dispatch activities defined in the pipelines, a compute infrastructure is needed. The way to provide compute power is to use Integration Runtime.
In general, there are three types of Integration Runtime in Azure.
The typical one is Azure Integration Runtime, which is used primarily to move data between public clouds.
However, when there is a need to move data from on-premise environments to the cloud, it is necessary to use the Self-Hosted Integration Runtime, which will gain access to the on-premise network.
This article is a brief explanation of how to integrate data using ADF from on-premise data source to Azure SQL Managed Instance.
[Hashmap] Make The Most Of Your Azure Data Factory Pipelines
To use the ADF in the best possible way in terms of functionality, security and performance, best practices should be followed. Of course, like any tool, it has its limitations.
That’s what this short post is about, some best practices and thoughts on the limitations of ADF.
MEDIUM MEMBER-ONLY ARTICLES
[Adam Bertram] Getting Started with Azure Data Factory
ADF is a service that may take some time to understand fully, but this should not stop us from practising using it, as the Azure interface makes it easy to use.
This tutorial shows you how to create an Azure Data Factory instance and use Copy Data Tool to pull data from a data source (HTTP) and load it to Azure Data Lake Storage Gen2.
[René Bremer] How to manage Azure Data Factory from DEV to PRD
The article describes how to manage the DTAP cycle with an indication of the limitation of the ADF.
Book of the Month
[Alex Gorelik] The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science
The book is an introduction to the Data Lake concept. It describes the context in which the Data Lake emerged, the maturation cycle from Data Pond through Data Puddle and Data Lake to Data Ocean, the architecture and relevant aspects of implementing a data lake in an organisation.