In the view of the ICO, five out of the six possible lawful bases for web-scraping are unlikely to be available to organisations. The ICO decided to focus its consultation on legitimate interests as the potential lawful basis for web-scraping. The ICO does not expand on the reasons as to why other lawful bases would be inappropriate in this case as these appear to be self-explanatory. For example, consent has to be freely given, specific, informed and unambiguous to be valid, and it is unlikely that data subjects would have consented to such use of their personal data published on the internet.
The ICO further highlights that, for the processing to be lawful, it must also not infringe upon any other laws beyond data protection. Developers, therefore, must be mindful of the requirements of other laws when carrying out web-scraping or obtaining web-scraped data from another organisation.
Organisations relying on legitimate interest as the lawful ground for personal data processing must carry out the three-part test which is as follows:
Purpose test: are you pursuing a legitimate interest?
The first step would be for the organisation to formulate their legitimate interest in web-scraping data. In the ICO’s view, an organisation would need to do so in a “specific, rather than open-ended way, based on what information they have access to at the time of collecting the training data.”
Such legitimate interests could include commercial gain as well as wider societal interests. The key is for developers to be able “to evidence the model’s specific purpose and use” to make sure that downstream use of the Generative AI model will comply with data protection requirements and respect individuals’ rights and freedoms.
Necessity test: is the processing necessary for that purpose?
The organisation needs to evidence that the processing is necessary to achieve the purpose and that the same results cannot reasonably be achieved in a less intrusive way. The ICO recognises that “most Generative AI training is only possible using the volume of data obtained through large-scale scraping”.
Balancing test: do the individual’s interests override the legitimate interest?
The third and final step is to balance the organisation’s interest(s) against the rights and freedoms of individuals.
Collecting data through web-scraping is a form of ‘invisible processing’ where individuals don’t know that their personal data is processed this way. This can lead to individuals losing control over their personal information and how organisations use it, making it hard for them to exercise their rights under UK data protection law. Both invisible processing and AI-related activities are considered high-risk and need a DPIA (Data Protection Impact Assessment) as recommended by the ICO guidelines.
As well as the upstream risks discussed above, there may be downstream risks and harms involved such as generative inaccurate information which may lead to distress or reputational damage. These also include using social engineering tactics to create phishing emails and other adversarial attacks.
The degree to which organisations developing Generative AI can reduce downstream risks and harms, depends on how the models are brought to market. The ICO’s consultation provides an outline of risk mitigation considerations to consider when carrying out a balancing test. These depend on how the models are deployed; by the initial developer, by a third party through an API or provided to a third party.
The ICO notes that where the initial developer makes an AI model available to third parties, it will have much less control over its downstream use. This means that any wider societal interest may not be realised in practice. For this reason, the ICO recommends that organisations carefully consider the balancing test, especially in cases where it will not be able to exercise meaningful control over its downstream use.