Expressvpn Glossary
Data scraping
What is data scraping?
Data scraping refers to the practice of using automated tools to gather large amounts of data from websites and other digital platforms. The collected datasets are often used to support market research, machine learning (ML) applications, and more.
How does data scraping work?
Scraper bots send automated requests to the platforms they seek to gather information from. Most use lightweight tools that communicate directly with websites, but some use automated browser tools that load and render web pages much like a human user.
From there, they parse data received from the target and locate whatever information is being sought. This is then extracted and stored, either as the original code (usually HTML) or adapted into a structured format (like a spreadsheet).
Data scraping uses
Data obtained through scraping can be used to:
- Enable rapid market intelligence: Organizations may use data scraping techniques to monitor publicly available information such as product availability and pricing.
- Power AI development: Data scrapers can efficiently create the massive and diverse datasets that ML systems require for training.
- Generate leads: Sales or marketing teams may gather dispersed but publicly available information to build lists of possible customers.
Misuse of data scraping
Data scraping is sometimes used illegally or unethically by cybercriminals to harvest personal information. For example, threat actors may scrape forums and community platforms for contact information that is used to direct spear phishing or spam.
Legal, ethical, and privacy considerations
Note: This information is for general educational purposes and does not constitute legal advice.
Organizations considering scraping should assess applicable laws and contractual restrictions before proceeding. While responsible scrapers respect the law, user privacy, and targeted services’ terms of service, scraping can result in the unintentional collection of sensitive data.
The following ethical and privacy considerations apply:
- Informed consent: Users may object to their data being harvested and used for other purposes, even if they’ve posted it publicly.
- Personal privacy: Harvested data may be used in profiling and can facilitate phishing attacks.
- Legal exposure: Scraping activities could potentially violate laws such as the General Data Protection Regulation (GDPR) by gathering personal data, whether intentionally or not.
- Copyright infringement: Scraped content may include copyrighted material or proprietary databases, which may be protected under intellectual property laws.
- Resource burden: High-volume requests from scrapers strain the infrastructure of targeted sites, which is why many platforms will ban the practice in their terms of service.
Further reading
- How to check for social media data breaches
- Protect your privacy: A guide to data removal services
- Personal data removal laws: Do they protect you?