Data scraping, in its most general form, refers to a technique in which a computer program extracts data from the output generated from another program. Data scraping is commonly manifest in web scraping, the process of using an application to extract valuable information from a website.
Why scrape website data?
Typically companies do not want their unique content to be downloaded and reused for unauthorized purposes. As a result, they don’t expose all data via a consumable API or another easily accessible resource. Scraper bots, on the other hand, are interested in getting website data regardless of any attempt at limiting access. As a result, a cat-and-mouse game exists between web scraping bots and various content protection strategies, with each trying to outmaneuver the other.
The process of web scraping is fairly simple, though the implementation can be complex. Web scraping occurs in 3 steps:
First, the piece of code used to pull the information, which we call a scraper bot, sends an HTTP GET request to a specific website.
When the website responds, the scraper parses the HTML document for a specific pattern of data.
Once the data is extracted, it is converted into whatever specific format the scraper bot’s author designed.
Scraper bots can be designed for many purposes, such as:
Content scraping – content can be pulled from the website in order to site in order to replicate the unique advantage of a particular product or service that relies on content. For example, a product like Yelp relies on reviews; a competitor could scrape all the review content from Yelp and reproduce the content on their own site, pretending the content is original.
Contact scraping – a lot of websites contain email addresses and phone numbers in plaintext. By scraping locations like an online employee directory, a scraper is able to aggregate contact details for bulk mailing lists, robot calls, or malicious social engineering attempts. This is one of the primary methods both spammers and scammers use to find new targets.
How is web scraping mitigated?
Typically, all content a website visitor is able to see must be transferred onto the visitor’s machine, and any information a visitor is able to access can be scraped by a bot.
Rate limit requests – for a human visitor clicking through a series of web pages on a website, the speed of interaction with the website is fairly predictable; you’ll never have a human browsing 100 web pages a second, for example. Computers, on the other hand, can make requests orders of magnitude faster than a human, and novice data scrapers may use unthrottled scraping techniques to attempt to scrape an entire website very quickly. By rate limiting the maximum number of requests a particular IP address is able to make over a given window of time, websites are able to protect themselves from exploitative requests and limit the amount of data scraping that can occur within a certain window.
Modify HTML markup at regular intervals – data scraping bots rely on consistent formatting in order to effectively traverse website content and parse out and save useful data. One method of interrupting this workflow is to regularly change elements of the HTML markup so that consistent scraping becomes more complicated. By nesting HTML elements, or changing other aspects of the markup, simple data scraping efforts will be hindered or thwarted. For some websites, each time a webpage is rendered, some form of content protection modifications are randomized and implemented. Other websites will change up their markup code occasionally to prevent longer-term data scraping efforts.
Use CAPTCHAs for high-volume requesters – in addition to using a rate-limiting solution, another useful step in slowing content scrapers is the requirement that a website visitor answers a challenge that’s difficult for a computer to surmount. While a human can reasonably answer the challenge, a headless browser* engaging in data scraping most likely cannot, and certainly will not consistently across many instances of the challenge. However, constant CAPTCHA challenges can negatively impact the user experience.
Another less common method of mitigation calls for embedding content inside media objects like images. Because the content does not exist in a string of characters, copying the content is far more complex, requiring optical character recognition (OCR) to pull the data from an image file. But this can also hinder web users who need to copy content such as an address or phone number of a website instead of memorizing or retyping it.
*A headless browser is a type of web browser, much like Chrome or Firefox, but it doesn’t have a visual user interface by default, allowing it to move much faster than a typical web browser.