During the past several years at Re Analytics we've spent a lot of time finding the best practices for web scraping, to make it scalable and efficient to maintain. It's like the cat and mouse game, you need to be always updated on the latest developments but, at the same time, the information needed is very sparse on the net. For this reason, we started to centralize all the information we collected and the best practices we developed, to build a point of reference for the Python web scraping community. Feel free to add your contributions to this repository, sharing each other's knowledge will boost the value of this repository for everyone.
Our goal is to scrape as many sites as we can so we've always looked for these key elements to make a successful large-scale web scraping project. At the moment they are focused on web scraping of E-commerce website because it's what we've done for years, but we're open to integrate them with best practices derived from other industries.
- Resilient execution: We want the code to be as low maintenance as possible
- Faster maintenance: We work smarter if we find standard solutions, and do not have to decode creative creations every time.
- Regulatory compliance: web scraping is a serious thing, we need to know exactly what tools are used. The following practices are always evolving and feel free to suggest yours.
Perform a technology stack evaluation for the target website using Wappalyzer Chrome Extension, with attention in the "Security" block. When a technology stack is detected under the "Security" section, please verify if in this list of technologies there is a specific solution for that technology.
Has the website some internal or public APIs for fetching the price\product data? If so, this is the best scenario available and we should use them to gather data
Sometimes websites have JSON in their HTML, not only when there's an API. Finding this, will ensure stability.
How the website handles the pagination of product catalogue? Internal services that provide the html code of the catalogue are preferred vs loading the full page code
Use json if available (on html of the page or from API). It's less prone to changes
Use Xpaths, not css selectors for getting a clearer code.
Don't insert rules for cleaning prices or numeric fields: formats change over different countries and are not standards, let's keep this task to post scraping phases in the DBs.
Load the fewer pages you can. Try to see if the fields you need are all available from product catalogue pages and try avoiding enter the single product page.
One of the most basic actions that a target website can take against web scraping is to ban IPs that make too many requests in a certain timeframe. Given that the web scraping activity must not interfere with the website functionality and operations, if this is happening to your scrapers, you might consider splitting its execution from several machines or route it via proxies. Nowadays there are plenty of proxy vendors on the market and also proxies for every need, we'll go in-depth in this section.
- Akamai
- Cloudflare
- Datadome
- PerimeterX
- Kasada
- F5 Shape Security
- Forter
- Riskified
- Passive fingerprinting including:
- Browser Fingerprinting techniques including:
Here's a list of websites where to test your scraper and find out how many checks it passes
- https://bot.incolumitas.com/ one of the most complete set of tests for your scrapers
- https://pixelscan.net/ check your ip and your machine
- https://bot.sannysoft.com/ another great list of tests
- https://abrahamjuliot.github.io/creepjs/ set of tests on fingerprinting
- https://fingerprintjs.com/products/bot-detection/ page about BotD, a javascript bot detection library included in Cloudflare, where you can also test your configuration