To Scrape or Not to Scrape

C4ADS relies on web scraping to gather publicly available information (PAI) critical to its investigations and analysis. Web scraping has come under increased public scrutiny because of its connection with generative AI training data โ but it remains an important tool for open-source intelligence (OSINT) analysis. Evaluating whether to scrape a given source requires weighing ethical, privacy, technical, and operational considerations within a consistent framework. This post outlines that framework and explains why ethical reasoning cannot be separated from OSINT research in the modern data environment.
What is Publicly Available Information? #
OSINT analysis, including C4ADSโ work, rests on PAI. PAI is a primary data source accessible for public consumption โ for example, a public social media account or notes from a city hall meeting.
PAI forms the bedrock of OSINT analysis because illicit actors must embed themselves in licit systems, such as trade or financial networks, to operate. When illicit actors exploit licit systems, they create a data trail that analysts can map, track, and use to disrupt criminal activities. C4ADS uses these trails to investigate illegal fishing operations, map atrocities in Sudan, trace vulnerable supply chains, and much more.
โPublicly availableโ does not exclusively mean free. A data set qualifies as PAI as long as anyone can pay to access it. Trade data, for example, can be purchased openly by anyone from reputable data vendors. This differs from non-publicly available information, such as classified information, which should not or cannot be available to everyone โ even to those who offer to pay for it.[1] PAI also differs from secondary sources, which synthesize data rather than contain original records, and excludes data from private accounts and devices, such as social media or email accounts.[2] PAI can, however, include data that has been made public by a third party โ such as email leaks from sanctioned entities or illicit networks.
Why Web Scraping? #
The majority of PAI originates in government-released documents or data sets, such as corporate registries, daily gazettes, procurement data, or property records. This data is often released either due to legal obligation โ such as with the U.S. Federal Register โ or in an effort to promote transparency. Even so, โpublicly availableโ does not always mean โeasily accessible.โ
Even if the ultimate goal is transparency, data is not always released in a useful format. Analysts may find that a website is difficult to search, poorly organized, or prone to crashing.[3] A more common obstacle is volume: official sources often display only one record at a time, forcing analysts to review numerous entries individually. This can prevent analysts from identifying trends or patterns that are only apparent when aggregating data in bulk. Web scraping allows researchers to reconfigure PAI into a format that is actually useful for investigations. At C4ADS, that means aggregating disparate sources into a single interface โ Horizons, C4ADSโ investigative platform.
PAI can also disappear at any time. On several occasions, C4ADS has published investigative reports only to find that the underlying source has been taken down. Web scraping preserves access to these sources, maintaining robust sourcing for ongoing and future investigations.
The Decision-Making Framework #
Web scraping has come under increased public scrutiny because of its connection with generative AI training data. However, web scraping PAI remains an important tool for OSINT analysis. This raises an important question: How do data-driven investigators balance ethics and analytic needs when web scraping data?
When C4ADS evaluates whether to scrape a given source, four questions form the foundation of that process:
Question 1. Does this source qualify as PAI?
This is the first and most basic threshold. If anyone can access or purchase that information or source, it qualifies as PAI. If access requires credentials, classification clearance, or legal standing the public does not have โ it does not.
Question 2. Does collecting it violate internal standards?
PAI status is the first barrier, but it is not sufficient on its own. Organizations need to self-impose additional ethical constraints on what type of PAI is acceptable to collect. C4ADS policy, for instance, prohibits collecting certain types of information, such as biometric data, health records, and information on minors โ even when that data technically qualifies as PAI. Every organization should define its own ethical floor.
Question 3: Does the collection process itself introduce unacceptable risk?
The act of scraping can cause harm independent of what the data contains. By overloading a websiteโs server or exposing C4ADSโ identity to site administrators, web scraping can jeopardize access to a source entirely. The collection process requires its own risk evaluation, separate from the data question.
Sometimes the process of scraping itself can cause a poorly-designed website to crash, resembling a distributed denial of service (DDoS) attack, in which servers are overwhelmed by traffic. Scraping can also expose a researcherโs identity to a site administrator, and an unusual volume of requests can prompt them to take the source down or block traffic entirely.
When a site is at risk of crashing, data scientists can modify the scraping script to run more slowly, giving the server time to handle repeated requests. If scraping continues to harm a website regardless, it is important to weigh the analytic value against the risk. Equally, if a source is high-value and scraping risks exposing a researcherโs identity or cutting off access entirely, it is often worth abandoning the scrape to protect long-term investigative access.
Question 4: Does the analytic value justify the risks?
If a source clears the first three criteria but the data is low-priority or obtainable through less invasive means, the risk does not justify the reward. Value and risk must be weighed together.
The Framework Under Pressure: Edge Cases #
These questions do not always produce clean answers. The scenarios below examine the framework under pressure and illustrate how the choices involved are often not straightforward.
One frequent edge case: How does an OSINT analyst handle official websites that require users to create a free account to access records? Such websites usually fall within a gray area in the definition of PAI. Each situation requires independent evaluation.
Scenario 1: A Free Account for Public Data
New York City (NYC) law mandates a tree census every 10 years. The data becomes publicly available after the census concludes. The agency conducting the census requires anyone requesting a copy of the data to create a free account on its website. Is it ethical to create an account to scrape the website?

Image: Pierre Blachรฉ/Pixabay
Working through the framework: the data is collected due to public mandate and its public release is legally required, so it clears Question 1. The cityโs account requirement does not alter the dataโs PAI status. It does not contain sensitive categories of information prohibited under C4ADS policy, clearing Question 2. There is no potential security risk in creating an account with a NYC municipal agency, clearing Question 3. The analytic value is reasonable and there is no alternative way to get the data, clearing Question 4.
Scenario 1: Continued
NYC later restricts the census data to account-holders after an AI company begins selling NYC-specific modeling software to municipal departments, even though the original data came from NYC itself.
Scraping this data remains ethically permissible as long as it is intended for analytic use. Storing it in an internal system, such as Horizons, C4ADSโ investigative platform, is permissible. Selling that data, however, would cross ethical boundaries.
Scenario 2: Credentialed Access to Protected Records
A pharmaceutical company allows doctors to view the side effects and adverse reactions that patients have reported for a specific medication. The records include patientsโ addresses and phone numbers, as well as the results of a genetic screening test. While the database is accessible only to specific healthcare providers, an OSINT analyst has received login credentials from a source. Is it ethical to use that login to scrape the website?
This data does not fit the definition of PAI โ it fails at Question 1. Access is restricted to credentialed healthcare providers, meaning the public cannot obtain it regardless of willingness to pay. That alone ends the analysis.
Even if the data could arguably meet the threshold for PAI โ which it does not โ it would still fail Question 2. The records contain protected medical information, and the genetic screening results carry privacy implications not only for the subject of an investigation but also for their entire family. Scraping this database is not ethical under any reading of the framework.
Conclusion #
Web scraping is neither inherently ethical nor unethical โ but it is a powerful, and often necessary, process for preserving disappearing records, aggregating data that governments release but make practically inaccessible, and maintaining robust sourcing for investigations. The stakes are practical: without these techniques, investigations become harder to defend, patterns harder to surface, findings harder to verify, and the illicit actors C4ADS tracks are harder to hold accountable. While the framework above does not guarantee clean answers, it does ensure that the right questions get asked โ and that the decision to scrape, or not to scrape, receives the consideration it deserves.
As we continue to navigate a rapidly changing OSINT and data landscape, we would love to hear how other data-driven investigators are navigating these decisions!
[1] Buying classified data is espionage. C4ADS does not condone nor participate in espionage.
[2] It is not that data scientists canโt get that data in many cases. It is just highly unethical (and often extremely illegal) to do so.
[3] Whether this reflects deliberate restriction or poor web design, the effect is the same: the data is often difficult to access.

