Why we need next generation of scrapers and it's more serious than you think

It’s from at least a decade that we hear data is the new oil, and it’s true, data is the new oil, but the problem is that the oil is not in the ground, it’s in the web, and the way to extract it is not with a drill but with a scraper. Yes but why the data are so valuable?
Data in this decade gained value thanks to explosion of data analytics sector, and this was the firs wave, another wave is coming, the AI wave, with the advent of transformers and subsequently LLMs, that need an enormous amount of data to be trained on. All the big tech and non-tech companies are launching themselves in this direction, smelling huge profits and striving for supremacy by trying to get to the coveted AGI here a two image to compare these two waves.

Small wave of data analytics on the left, big wave of AI on the right

Let’s going back to the metaphor of the oil, during this century we have passed from the rusty land drill to the modern oil rig that can extract oil from the sea, and this is what we need to do with the scrapers, because the web is not a single site, it’s a “sea” of sites, and we need to extract data from most of them respecting the rules obviously, and this is not an easy task, because the web is not a static place, the scrapers need to be able to adapt to this dynamism, currently they are not able, and this is the challenge that we need to face in the next years, and this is the first point on why we need the next generation of scrapers.

There is another point that we need to consider, we need to avoid at all cost the oligopoly of the big tech companies in the AI field. And to do that we need to democratize the access to the data in order to allow the small startups and the open source community to compete with the big tech companies, and one key point to do that is to have tons of data acquired with smart scrapers and small amout of money. One might think at this point, here he is, another tech-guy against capitalism in a random way, sorry dude, but this time you got it wrong. The problem is that large language models (LLMs) are so powerful tool that can be used at work in many fields and also as decision making support tool having a strong influence on the users. The ability to tune the model from arbitrary people to a specific tendency can be a powerful weapon convoing some kind of information to the public that can be used to manipulate the public opinion since in the future the usage trend of this tool is going to increase dramatically please let’s avoid what is happening with the social media, where the big tech companies are the only one that have the power to control the information and they aren’t able to deal with misinformation causing a lot of problems that we are seeing in the nowadays society.

Since i believe in what i’m saying and my life randomly brought me to deal with data, i decided to do something about it, and i’m working on a project with other people that aims to democratize the access to the data, this is ScrapegraphAI, an Open Source Python library that aims to be in the next generation of scrapers, and we are working hard to make it happen, if you want to know more about it, please visit repo on github and if you want to contribute, you are welcome.

ScrapegraphAI logo

P.S maybe a more technical post about the project will come in the future, stay tuned :)))))

Enjoy Reading This Article?