DATA NEWSLETTER

Good Morning Data #10☕

JacobJustCoding
3 min readNov 4, 2022

Today’s Good Morning Data will focus on web scrapers, their architectures at scale, and other tips for creating them.

As I have recently started a new project involving the processing of many websites at the same time, I believe that the topic of how to scale it is extremely important to those who are doing similar projects, as there will eventually come the point where responsible management of the distributed architecture is essential for reliability, security, and good performance.

Grab a cup of coffee or any other favourite beverage and start the day with some noteworthy material from the world of data. Enjoy reading 📚

Photo by Ales Nesetril on Unsplash

[Mobigesture] A Reliable web scraping Robot — Architectural Insights

This is a good illustrative introduction to the breakdown of the web scraper architecture according to the responsibilities to be performed, i.e., the collection of data by a definable number of spiders, the analysis of this data, and the processing of the collected data and the results of the analysis.

[ScrapeHero] Scalable Large Scale Web Scraping — How to build, maintain and run scrapers

Every process of building large-scale web scraping is subject to several factors that must be considered in the initial phase.

In the beginning, it is essential to consider what kind of pages will be analyzed, as this will primarily affect the choice of tool or framework to be used. The question is whether these are comprehensive pages and the logic embedded in them.

The scraping of a million pages requires a distributed architecture in the form of several servers on which the scrapers are run.
To enable their proper operation and communication, a Message Broker such as Kafka is used to distribute both the URLs which are used to scrap and the data obtained as a result of the scraping.

To ensure data storage, it is common to use NoSQL databases such as MongoDB, but also Cloud Hosted Databases. However, it should be noted that the volume of data is enormous and grows very quickly, so sharding and replication to ensure adequate performance and reliability may be advisable.

It is also necessary to properly manage IP rotation and proxies to avoid the use of anti-scraping tools, which are used in the scrapers' blocking process.

[ZenRows — Ander] Distributed web crawling made easy: system and architecture

This is a good guide explaining how to use the Celery queuing tool together with Redis to create a web scraper.
It includes code snippets so that if you try to implement it straight away, you can see how it works.

[ZenRows — Ander] Web Scraping in Python: Avoid Detection Like a Ninja

Since scraping websites, and in particular multiple URLs belonging to the same service, can result in our IP being blocked by a bot detection service, it is crucial to understand the reasons and workarounds for avoiding such detections.
These include:

  • proxy rotations,
  • user-agent rotations,
  • complete set of headers,
  • cookies,
  • headless browsers,
  • geoblocking,
  • detection of behavioral patterns,
  • captcha,
  • login wall or paywall.

Of course, the use of these bypasses is for scraping purposes and not for write-intent malicious activity.

Dev Quote ✍️

“In some ways, programming is like painting. You start with a blank canvas and certain basic raw materials. You use a combination of science, art, and craft to determine what to do with them.” — Andrew Hunt

Dev Meme 😂

https://devhumor.com/media/never-used-dark-mode

--

--

JacobJustCoding

Data engineering and Blockchain Enthusiast. Love coffee and the world of technology. https://www.linkedin.com/in/jakub-dabkowski/