Web Scraping and Legal Issues

Web Scraping is the process of extracting data from websites, preferably using a program which simulates human exploration by sending simple HTTP requests or emulating a full web browser. Web Scraping, Content Scraping, Screen Scraping, Web Harvesting or Web Data Extraction are all analogous terms. In general, anything that you can see on the internet can be extracted and the process made automated.

There is a close resemblance between web scraping and web indexing. However, one stark difference is that web scraping is focussed on gathering a particular type of data like contact information, whereas the objective of content scraping is to gather all the data that is present. Web scraping has been used effectively in many fields like online price comparison (BuyHatke) and web mashup (Frrole).

How do you scrape the data?

The easiest way to do it is, of course, opening the content on your browser and copying the data that you see. That process is, however, monotonous and prone to errors. What you do then is use a program to get the data off!

In Python, there is a library called beautiful soup, which parses HTML documents for you. You send the URL through urllib and feed it to beautiful soup. For instance,

from urllib import urlopen
from bs4 import BeautifulSoup as B

webpage = urlopen(‘http://dada.theblogbowl.in’)

soup = B(webpage.read())

urllib identifies itself as a Python library and if a website has checks in place of that, your request might not get the proper response. In that case, you need to emulate a web browser. A popular library in Python which helps in doing this is mechanize.

import mechanize
import cookielib

# Set Browser
br = mechanize.Browser()

# Set Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Set Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# Want debugging messages? Just uncomment the following lines
#br.set_debug_http(True)
#br.set_debug_redirects(True)
#br.set_debug_responses(True)

# User-Agent (this is cheating!) We are basically writing the headers to make it look like
# it’s coming from a Firefox browser in Fedora
br.addheaders = [('User-agent',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

Now that you have emulated a browser, your next task is to get data that may be scattered over thousands of pages. There are two issues you will face- pagination and AJAX based views.

To counter pagination, you need to observe the url structure- whether the pages change by sending a get variable (like www.example.com?p=5) or through a different parameter within the URL (like www.example.com/page/5/). Then you can change your program to loop over the range of pages and send requests to all the pages.

To get data off AJAX based views, you need to open Firebug or Chrome Inspector and check the AJAX based URLs (like www.example.com/feeds/ajax/5/), and send requests to that page by changing the page number. Do note that some AJAX based views send the correct data only when the request is through AJAX, so make sure you change your HTTP headers accordingly.

Is it legal?

Now that you have a general idea of how to do it, let’s talk about the legal issues surrounding the matter. Web scraping was generally considered a gray area until 2000, when eBay filed an injunction against Bidder’s Edge. Although it was settled out of court, it started the question that is being asked even today- is web scraping legal?

Many websites have mentioned it in their terms of service that scraping them is not allowed. In such cases, it is generally advised to stay away from such sites. Although scraping public data might not be an outright criminal offense, if the source wants you to stop doing so, it can very well get an injunction against you- like in the case of American Airlines and Fire Chase.

Although most of these injunctions are against services that require continued use of scraping, don’t be under the illusion that a one off scraping task makes it difficult to track you. In 2011, Internet activist Aaron Swartz was arrested for downloading academic journal articles from JSTOR. The resulting United States vs Aaron Swartz case received a high public attention, until Swartz committed suicide earlier this year.

With advancement in technology, it is becoming a tug of war between content scrapers and those who actively try to block such requests. Since there are no proper laws that define the scope and legality pertaining to web scraping, it’s always a good idea to use services to block these scraping attempts rather than filing lawsuits later. Perhaps, it’s time that more definitive laws are set up so that the legalities surrounding the field of web scraping can get clearer.

Leave a Reply