Data Scientist in Law: Is Web Scraping Legal?

Mari Galdina
Analytics Vidhya
Published in
4 min readFeb 27, 2021

--

Our world is full of information and data for research. For data scientists, each piece of knowledge can be helpful in their projects. We can download massive data from the internet or collect our dataset. Usually, companies offer a CSV format of dataset or provide access via an Application Programming Interface (API). But occasionally, we need data from webpages without a convenient option to download. How can we collect them?

Photo by Mark Rohan on Unsplash

What is Web Scraping?

Web Scraping is the technique of automatically extracting data from websites using software or script.

Web scraping structuring data into a more convenient format, which needs you for work. How we know, each website visit is a request to a web server. Our browser sends requests to see information, and the webserver sends back files that tell our browser how to render the page for us.
It very simplified definition. There is a lot that happens behind the screens. But we need only the main content of the web page. Usually, this information concentrated into HTML files, and web scraping takes data from it.

Here you can find main information about HyperText Markup Language (HTML)

Simple steps into web scraping:

  1. import library, which helps to download the whole page. In Python, this library is called requests.
import requests

2. get this page and save it into a variable. All browsers send a GET request to the webserver to get the webpage, and the requests library uses the same command.

page = requests.get("<add link to site for web scraping>")

Now you have all the HTML content of the webpage and can print it out:

page.content

How we can see this is a lot of unnecessary information. Request give us a full-page description with all HTML tags.

3. parsing the page. The better way to do that is by using the BeatifulSoup library.

from bs4 import BeautifulSoup

This library allows us to move through the HTML structure one level at a time.

#create an instance of the BeautifulSoup class to parse our document
soup = BeautifulSoup(page.content, 'html.parser')

Is Web Scraping Legal?

The answer to this question is not that simple. No one can say yes or no. Nowadays, we see a complex and evolving area of law. If we create a spectrum and put one side benign web scraping and another side web scraping attracts trouble. Then we find out all web scraping somewhere in the middle.

Why does that happen?

  • Usually, websites do not offer a clear guidance about using information from their pages. Before scraping it, we should check the terms and conditions page. Sometimes they provide scraping rules, and we should follow them.
  • Web scraping consumes server resources for the host website. That means it can cost time and money for businesses. Time lost on opening websites for potential customers brings loss in income for the company. And it gives them the reason to fight with web scraping.
  • Reason for scraping: scientific research or commercial reasons. We had a precedent in late 2019 when the US Court of Appeals denied LinkedIn’s request to prevent an analytics company from scraping its data. But it is still one case involving public information.
  • The information available only beneath a login can have cost and original owner with copywrites. For example, Facebook rules forbid logged-in web crawlers to download user data.
  • All countries and states have different laws about the internet. So, where are you located in California or Florida? Or perhaps Russia?

We can see that details matter when you use web scraping.

The rule for legal web scraping: Details are matter.

Best rule for practicing web scraping polite and profitable way for all:

  • don’t scrape the same site or webpage frequently than you need to (you can create a schedule depending on information updates);
  • don’t scrape the same content every time you run your code (you can use caching for data);
  • don’t overwhelming servers with too many requests in too short a timespan (you can use a special function to pause your code for a while).

Summary

Is Web Scraping Always Bad? Let’s remember search engines to index web content or price comparison services from various stores to save your money. They all use web scraping and give us tips and advice. For me, like a data scientist, web scraping help to collect data for projects. For example, government websites provide data for the general public, which frequently available over APIs. But some the scale of work unexpectedly large and you can not do it manually. Web scraping comes to help here!

--

--