What the heck is Web Scraping ?
Web scraping is a data extraction technique used on a website: you create a script (a bot) that automatically fetches data, without you needing to do anything.
The subject is well known amongst startups' (digital) marketing services, with the big rise of Growth Hacking. It's been a few years that everyone wants to scrape data for everything.
Understand that everyone extracts data on the web. Among others:
- The startup that wants to enrich its business database or feed its CRM
- The company that wants to monitor its competitors
- The big corporate that analyzes its market penetration
- The SME that sells a product requiring web data
Problem is, most, if not all, websites (shall we say 99% ?) do not offer a direct access to their data. They don't offer an Application Programming Interface (API). Worse, some websites offer badly designed APIs that forces us to scrape their content to get everything. What a shame!
Web scraping and hacking
At Captain Data, people often ask us about the legal aspect of things, GDPR or technical feasibility:
- Concerning legality, that would require an entire article :) In one word: yes, it's legal (or at least, it's not illegal). Still, you need to behave yourself and respect copyrights, for example.
- As for GDPR, no problem since we only deal with business data. It happens that we're asked to extract sensitive information, but in that case we save it directly in our client's database.
- We're left with the interesting point that concerns the technical feasibility: there are more and more scraping services and solutions, and therefore more and more solutions to protect from scraping.
There's a connection that we often like to make between hacking and scraping: in the end, it's only a matter of means.
Anti-Scraping: a matter of means ?
When you learn the basic of IT security, you're often told that no system is perfect.
The more you invest in secure solutions, the more you can hope to be protected - provided that your employees do not leave their passwords on a small piece of paper!
It's the same for scraping: the more you protect yourself, you increase your chance of spotting bots. But remember that everything is about risk management: what is the (reasonable) percentage of successful bots spotting you're comfortable with.
However, if some IT securities seem really non-crackable from a hacking point of view, we have always managed to deliver our bots.
Protections technique are based on two major factors:
- The digital footprint
- Machine learning and statistics
Digital Footprint
When you surf on the web, you leave what's called a digital footprint. This is a set of parameters that you accumulate over time: cookies, your IP, your browser settings, etc.
This footprint helps to determine whether you are a human behind your screen or a bot harvesting the website. Bad news for website owners, recent technologies, including Headless Chrome (a "simplified" version of Google Chrome), make it extremely easy to reproduce these settings.
In other words: pretending to be a human is quite possible and easy for a bot.
Machine Learning
Machine learning allows to create high quality anti-scraping solutions. In short, companies harvest web data (Big Data) to create behavioral detection patterns.
Statistical analysis: number of IPs, number of sessions by IPs, speed of extraction, etc., make the extraction process much more complicated.
There are few anti-scraping scraping solutions based on machine learning on the market. When you have to counter one, it can become pretty hard :).
"Fortunately", like in hacking, the integration of such technical solutions relies on the expertise of men and women, which is therefore prone to error. And you don't need much, just a slight detail: a page is not protected properly, a piece of internal API is exposed when it should not, etc. And boom, the door is left open for a smart developer.
As scraping experts, what we bring on the table is easy to get: we're used to find the small hidden door(s).
One might think that the company that spends the most will have the last word in the story. In theory, it's true. However, practice shows us that no protection is unbreakable.
Web scraping as an innovation leverage
We're not telling you not to protect yourself! Indeed, products like Cloudflare bring so much more to the table and are not only an anti-scraping protection.
We juste do not have the same vision of the web :)
At Captain Data, we see scraping as a mean to innovate:
- It allows you to enrich a database, to maximize its value
- You can create new ways to do business and attract customers
- It facilitates the creation of innovative products
- And finally, it allows to modernize old application layers (legacy softwares)
Rather than trying to protect your data, you'd better think about how you can use it to foster innovation or enhance your offer.
You do not have an API? Build one!
You don't know how to do? We can help.
Building an API is too expensive? You can always ask us to scrap your application to create one on the fly and make a smooth transition by modernizing your infrastructure at a lower cost.
You read it everywhere « data is the new oil ». So what are you waiting for?
{{automation-component}}