DATA NEWSLETTER

Good Morning Data #9 ☕

5 min readOct 28, 2022

Today’s Good Morning Data will focus on distributed systems issues, consistency models, and optimizing DWH storage by Netflix.

Grab a cup of coffee or any other favourite beverage and start the day with some noteworthy material from the world of data. Enjoy reading 📚

Data in MAANG

[Netflix] Optimizing data warehouse storage

In Netflix, due to the enormous amount of growing data, the optimization of storage and data processing is very important in terms of increased performance, reduced complexity of the data processing infrastructure, and costs related to architecture.
To this end, the AutoOptimize tool has been developed with the concept of centralizing the optimization of the improvements made.
AutoOptimize is based mainly on several primary operations:

merge — consolidating many small files loaded to the data warehouse from real-time systems into larger, better-organized ones, enabling faster data retrieval and storage savings,
sorting — earlier sorting of data collected on partitions supports downstream querying,
compaction — a collection of delta files in Iceberg to speed up read operations,
metadata management — separation of logical and physical partitioning by mapping file locations allows faster retrieval and querying of data on partitions.

These operations are performed based on predefined principles.

AutoOptimize has been divided into subsystems to enable management and decision-making on optimization independently:

Autotune — has decision-making and executive role in optimization issues,
Iceberg — is based on Apache Iceberg, where the facilities for snapshots and atomic operations are used in a scalable manner,
Autoanalyze — a module based on heuristics for the best possible adaptation of optimization parameters.

In addition, the operations related to file merging and the improvement of AutoOptimize results are well described concisely in the article.

Optimizing data warehouse storage

By Anupom Syam

netflixtechblog.com

[Denise Yu] Why are Distributed Systems so hard? A network partition survival guide

A hilarious talk on the reliability considerations of distributed systems. It is not a very technical talk but interestingly discusses aspects of the CAP Theorem, such as:

the wide range of definitions of consistency and the association with linearisation,
availability as a reference to the acceptable threshold of network latency. When talking about availability, it is impossible not to take into account the network's delay because that allows us to see if our node is not responding or is responding very slowly, which in the end may mean that it is lost and may affect the availability of the system,
partition tolerance as one of those assumptions of the CAP Theorem which must be fulfilled to speak of a distributed system at all, the P provision is then realized at the expense of the C, i.e. we make it possible to read from nodes whose data is not compatible with each other (AP case) or we sacrifice availability and make it impossible to write to a node on the partition side of a non-responding node until the partition disappears (CP case).

In the end, how RabbitMQ and Kafka-based partition issues are resolved was mentioned. This boils down to partition detection and then taking a strategy to resolve the issue.
In the case of RabbitMQ, these are coping strategies:

pause minority,
pause-if-all-down,
auto-heal,
ignore.

On the other hand, in the case of Kafka, which has a different way of detecting and solving the problem, these are consensus methods of the Raft and Paxos type.

https://www.youtube.com/watch?v=uTJvMRR40Ag

[Jepsen.io] Consistency Models

It is an interactive map for understanding the relationships between consistency models in concurrent systems.

Consistency Models

Jepsen analyses the safety properties of distributed systems-most notably, identifying violations of consistency…

jepsen.io

[DhineshSunder Ganapathi] Split brain in distributed systems

In a distributed system organized according to the leader-follower (master-slave) architecture, a situation in which contact with the leader server is lost causes difficulties in communication with other nodes (followers). The reason for this can be a failure of the server or communication between nodes, which is called network partition. The result can be a discrepancy in the same data or competition for resources.

In addition, it is sometimes extremely difficult to determine whether a leader has died or is just experiencing a temporary delay/complexity caused by its hardware/software or by the operation of the network.

It is also common practice in HA clusters to use a private heartbeat network for diagnostic purposes to monitor the status of nodes in the cluster. Of course, this does not solve all problems because what if all private links fail? Then each node thinks it is the only one and serves its clients, but in fact, this leads to the creation of many different datasets due to insert/update operations performed on it.

In any case, a new leader must usually be elected. However, what if we have misdiagnosed the death of a leader and a new leader has already been elected, and the old leader comes back to life (zombie leader) after the election of a new one?
This is the state of the distributed system called the split brain.

In this case, there are different approaches to solving this problem, which are mainly dependent on the requirements, the purpose of the system, and the origin of the problem.

If the requirement is high availability of the system, then this should be maintained at the expense of data consistency. This means that out-of-date calls will work as they did before.
Once the partition is gone, there will be a reconciliation process to make the system state consistent.

Otherwise, i.e. when consistency is necessary to be maintained at the expense of availability, unresponsive nodes on the partition side that are not responding to requests are isolated. Calls are unreachable until the partition is resolved.

Another solution is to rely on the epoch number, which is an incremented number indicating the generation of the server. Then, in each request sent to the followers, this number is included, and the followers are thus able to distinguish which leader is sending the real information about being a leader.

Split brain in distributed systems

In a distributed environment with a central (or leader) server, if the central server dies, the system must quickly…

medium.com

[Martin Kleppmann] Distributed Systems lecture series

For those who like to watch short films with their coffee, I recommend this series of videos as a general overview of different aspects regarding distributed systems.

https://www.youtube.com/watch?v=UEAMfLPZZhE&list=PLeKd45zvjcDFUEv_ohr_HdUFe97RItdiB

Dev Quote ✍️

Programming is the art of telling another human being what one wants the computer to do. ― Donald Ervin Knuth