Home > Digital technologies > AI and big data > Web scraping: how to automatically retrieve data from the web

Web scraping: how to automatically retrieve data from the web

Published on July 4, 2023
Share this page :

Web scraping allows you to automatically collect data from the web for competitive intelligence, lead generation, development of marketing and sales strategies and more. You still need to know which tools to use and what legal framework governs this practice. We explain everything to you. 

Illustration web scraping article

The web is an invaluable source of data. What if you could exploit this wealth of information for free? This is what “web scraping” offers, an effective technique for rapid and automated collection of web data. No more copying and pasting by hand. Many tools automatically perform this task and process thousands of pieces of data in a few seconds. Each solution has its strengths and weaknesses. Some require knowing how to code, others don't.

What's the point ? For who ?

The sectors that use web scraping the most are those that handle a lot of data: e-commerce, finance, social media, real estate, press, science, etc. In these sectors, the professions that use it the most are marketing, finance , HR, SEO specialists and data scientists.

Once the data has been collected, companies can use it to fuel their competitive intelligence or enrich their own database at little cost. Here are its main uses:

  • Monitor prices and availability of products and services in order to conduct competitive intelligence or analyze market trends.
  • Generate leads by automatically retrieving the names, first names, positions and contact details of professionals from LinkedIn, Twitter, Google Maps, Indeed, etc.
  • Optimize the SEO of a site: monitor its ranking in search results and its positioning vis-à-vis its competitors.
  • Analyze online sentiment by browsing customer reviews and comments on social networks.
  • Automatically check links. This is particularly useful in an affiliate strategy to ensure that links are not broken or obsolete.
  • Monitor job offers or collect information about potential candidates from job sites or social networks
  • Building datasets to train an AI
  • Check for copyright violations (plagiarism of images or texts)
  • Collect information on a specific topic. For example, carrying out an automatic press review of articles and innovations in battery production.

How it works ?

Web scraping uses a scraper, a software tool that extracts information from websites. The scraper interacts with the sites in the same way as a browser manipulated by a human. But instead of displaying the information, it collects it and saves it for later analysis. This process consists of four stages:

  1. HTTP request : the scraper sends a request to the target URL to obtain the content of a page.
  2. HTML code analysis : the scraper identifies the elements which concern the searched data.
  3. data extraction : data is extracted using selectors like XPath, CSS or regular expressions (regex).
  4. data storage: the information is recorded in formats that can be used for analysis (Excel, CSV, JSON, etc.).

How to do web scraping?

For developers

Developers are the masters of scraping. Thanks to programming languages associated with frameworks and libraries specialized in data extraction, they can create scrapers perfectly adapted to the targeted website and the data to be extracted. Efficiency, scalability and maintainability are their key words.

Which language to favor? If JavaScript (Node.js), Ruby, C, C++, R or PHP allow web scraping, Python has largely established itself its last years thanks to two tools, BeautifulSoup (a library) and Scrapy (a framework).  

Easy to learn, fast and portable (Linux, Windows, macOS and BSD), Python and its two complements will allow you to carry out any web scraping project.

Here is an example of a Python program that uses the BeautifulSoup library to retrieve all the prices of electric shavers on Amazon.fr pages.

web scraping with Python and beautifull Soup

For non-developers

While developers always have the advantage of creating the most efficient web scraping tools best suited to the needs of their users, they no longer have a monopoly on it. So, non-IT professionals, whether they work in marketing, finance or HR, are increasingly using web scraping, but without coding. To do this, they use no-code tools that have appeared in recent years. Here are a few.

Browser extensions

This is the simplest method to approach web scraping without coding: install an extension in your web browser. Free and easy to use, these plug-ins all work in the same way: once on the target site, you select the elements you wish to retrieve (text, images, URLs, etc.), the frequency (once per hour, per week). day or week, e.g.) and the extension takes care of the rest. Among the best known are Web Scraper, Simplescraper, Scraper, Agenty or Instant Data Scraper.  

The Web Scraper extension for Chrome and Firefox
The Web Scraper extension allows you to choose the elements to recover (product name, price, image, etc.) by a simple selection with the mouse.

Web scraping platforms

Another solution that does not require programming knowledge: go through one of the many platforms that offer web scraping services such as Octoparse, Bright Data, Parsehub or PhantomBuster. These tools allow – upon subscription – to collect data from the web, but also from social platforms such as Facebook, Instagram, Twitter, YouTube, etc. You will be able to retrieve information on hashtags, mentions, comments and likes. This data can be used to analyze trends and performance of marketing campaigns.

Scraping a fashion site with PArseHub

Using a web scraping platform, in this case ParseHub, we can select the data to be processed (in this case, the clothing categories of an e-commerce site), extract it and interpret it. The table below analyses average prices by product category and manufacturer.

Pricing structure with ParseHub

The big advantage of these platforms is to offer no-code solutions, operating on the cloud 24/7. So you can schedule the scrapers to get the data in a continuous stream or at flexible intervals. Another advantage: you have the choice between dozens of ready-to-use scraper models, capable of targeting the most popular sites and platforms in just a few clicks. Finally, these cloud platforms circumvent many of the protections put in place by websites to protect against web scraping IP rotation, captchas, proxies, infinite scrolling, etc.

Applications data analysis

Google Sheets, Power BI, Excel… spreadsheets and data visualization applications make it possible to extract data from the web more or less easily.

Sometimes you will need to use specific functions. This is the case of Google Sheets which has specialized functions: IMPORTXML and IMPORTHTML. You still need to understand the structure of an HTML page to correctly formalize the formula and achieve your goals.

Excel, Microsoft 365 and Power BI have specialist data extraction modules that are easier to implement. For example, Excel uses the Power Query module (since Excel 2010) or a web query module from the Data menu. For its part, Microsoft 365 benefits from the power of its Power Automate automation module. Still from the same publisher, the Power BI data analysis solution includes web scraping in its Get Data menu. All these wizards automatically propose to retrieve the tables present on the targeted web pages, but it is possible to define other data sources.

Web scraping with Microsoft Power Query
Power BI easily retrieves data from the web and displays it in tables and graphs. ©Microsoft

The rise of AI

The democratization of artificial intelligence, particularly generative AI such as ChatGPT, Bing Conversation or Bard, is a game changer. Thanks to this type of AI, it is easy to retrieve information from the web or PDF files extremely quickly. However, these AIs have the disadvantage of being general and not being able to easily export data in structured formats.

For this, it is necessary to resort to AI specialized in web scraping such as Scrapestorm, kadoa.com, Nimbleway API or Browse.ai. Solutions from around twenty euros per month.

With such AI, there is no need to program. Simply define the data to extract (prices, for example), one or more data sources (one or more websites) and specify the frequency of data retrieval (every week, for example).

The AI takes care of everything else: create a program configured according to your choices, extract the data and transmit it to you in the format you have defined (Excel, CSV, JSON, etc.).  

In addition to their ease of use, efficiency and speed, these web scraping AIs are capable of recognizing and recovering any type of data (text, images, videos, links, other files). In addition, they are not blocked by dynamic pages or the usual security tools implemented by targeted sites, such as captchas or blocking IP addresses.

What are the limits of web scraping?

Web scraping largely contributes to the proliferation of bots, these software robots that crawl the web in search of data. It is even becoming an invasive species. Thus, a study by the American cybersecurity company Imperva reveals that bots represented 47 % of Internet traffic in 2022 !

Technical limits

Among all these bots, there are malicious programs but also web scraping bots which overexploit the servers they target. They multiply requests, thus reducing their performance, sometimes going so far as to crash them. A nightmare for any system administrator!

Another unfortunate consequence, this time for marketing: this non-human traffic distorts the audience measurements of the targeted websites and therefore compromises the company's digital marketing strategy.

However, the defense is getting organized. To protect themselves from this invasion, more and more sites are using different technical solutions to ban bots : captcha puzzles to ask users to prove that they are human, banning of IP addresses, automatic limitation of the number of requests coming from the same IP, etc.

Another solution is to complicate and regularly change the structure of the pages that are displayed:

  • change the URL architecture (change the order of parameters),
  • edit HTML code (change class names and IDs, change order of DOM elements)
  • rotation of templates (if your site has a CMS, you can create several page templates and rotate them)
  • obfuscate the code (minification, obfuscation of variable and function names, loading content with JavaScript, encoding data)
  • alternate the data structure (CSV, JSON, etc.)
  • change the API structure if your site offers one

With all of these changes, the scraping solution's code or settings will need to be updated to accommodate, reducing their effectiveness as well as their value.

Legal limits   

Like any data collection technique, scraping is subject to French and European regulations, but also to the general conditions of use (CGU) specific to each website.

The use of web scraping is therefore subject to three conditions:

1. Respect the T&Cs of the data source site

In its T&Cs, LinkedIn formally prohibits web scraping:  

“You agree not to develop, support or use any software, devices, scripts, robots or any other means or processes (including spiders, browser plug-ins and add-ons, or any other technology) intended to web scrape the Services or otherwise copy profiles and other data from the Services”

Breaching these GTCs exposes the offender to penalties. Ignoring them is perilous, especially as more and more major platforms are introducing tools to detect web scraping.  

2. Comply with GDPR

In addition to respecting the T&Cs, the web scraper must respect GDPR, applicable to any data processing. En case of non-compliance, the amount of sanctions can be up to 20 million euros or in the case of a company up to 4% of the global annual turnover.

3. Respect copyright

Databases are protected by copyright, and by a sui generis right (in its own right) protecting their producer (articles L. 112-3 and L. 341-1 of the French Intellectual Property Code). The maximum penalties incurred in the event of violation are a fine of 300,000 euros and 3 years of imprisonment..

Justice thus ruled in favor on February 2, 2021 of the Leboncoin.fr site whose real estate advertisements were the subject of data extraction by a competing site.

In addition to the legal framework, certain good practices should be respected:

  • Practice web scraping outside peak hours of the site, when servers are more available;
  • Limit recovered data to what you really need ;
  • Use APIs and other means (datasets, etc.) offered by the targeted site to avoid resorting to web scraping.

Web scraping means regularly updating your skills

Despite its technical and legal challenges, web scraping remains popular. It is a great tool for automatically collecting data. ORSYS offers numerous face-to-face or remote training courses to use one of the solutions we have presented to you.

Developers and data scientists (data analyst, data engineer, etc.) will be able to train in big data, using Python libraries specialized in web scraping (scrapy, BeautifulSoup, Selenium, etc.).

We also offer training to learn how to clean and manipulate the data collected.

Web developers, CISOs and system administrators may be interested in our website security training to prevent web scraping and data theft.  

CIOs, DP.O. and lawyers will be able to follow our training courses to master the legal issues of the GDPR.

Finally, non-computer scientists Those who wish to benefit from web scraping in their professional use can turn to our Excel, Google Sheets or Power BI training courses.

Everyone will find his account.

Our expert

Made up of journalists specialising in IT, management and personal development, the ORSYS Le mag editorial team [...]

associated domain

associated training