Call for Applications: Web Archives for Social Sciences Datathon
Do you want to use web data at scale for socio-economic research, but don’t know how to start? Are you interested in exploring large Web Archives for your research but unsure where to begin?
We’ve got you covered.
On 27–28 November 2025 we are organising a Web Archives for Social Sciences Datathon at the University of Bristol. This is in collaboration with our partners: the Common Crawl and UK Web Archive at The British Library. The datathon will take place at the BDFI Neutral Lab.
This two-day event will build capacity in the social science research community to use large-scale Web Archive data for policy-relevant, socio-economic research. Participants will work in teams with curated data extracts from the Common Crawl to address real-world research challenges. They will be supported by our expert facilitators.
Who should apply?
Early-career and more established researchers in the social sciences who currently use, or wish to use, web data for their research–especially data from Web Archives.
Researchers with technical backgrounds (e.g. data science) who want to apply their skills to Web Archive data for policy-relevant research.
Some prior experience with code-based tools (e.g. Python or R) is expected.
What will you gain?
By the end of the Datathon, you will be:
More confident in using Web Archives to address your own research questions.
Familiar with the structure, challenges, and opportunities of these data.
Equipped with technical expertise for working with Web Archive data at scale.
Experienced in using LLM for analysing such web data.
Practical details
When: 27–28 November 2025
Where: Bristol, BDFI Neutral Lab
Support: Limited financial assistance for travel and/or accommodation is available.
How to apply
Send a one-page document to Emmanouil Tranos (e.tranos at bristol dot ac dot uk) by 15 October 2025, including:
Your current position
Your research interests, particularly how you are using (or plan to use) Web Archive data
Your technical skills
Your location and whether you require financial support.


BDFI’s Neutral Lab. Source: BDFI webpage.
This Datathon is part of the Atlas of Econonic Activities project and is funded by SDR UK.
Datathon materials
Agenda
Thursday 27/11/2025
09.30-10.00 Arrival at the BDFI; coffee and snacks available
10.00-10.15 Welcome and introductions [Emmanouil Tranos, University of Bristol]
10.15-10.30 Introduction to the Common Crawl [Laurie Burchell and Thom Vaughan, Common Crawl]
10.30-10.45 Web data and LLMs [Leonardo Castro Gonzalez, University of Bristol]
10.45-11.00 Problems and teams [Emmanouil Tranos, University of Bristol]
11.00-13.00 Work [facilitators will be around; snacks and coffee available]
13.00-14.00 Lunch and QAs
14.00-18.00 Work [facilitators will be around; snacks and coffee available]
18.00-19.30 Pizza and brief team updates
20.00 Leave the BDFI
Friday 28/11/2025
09.00-12.30 Work [facilitators will be around; snacks and coffee available]
12.30-13.30 Lunch and QAs
13.30-15.30 Work [facilitators will be around; snacks and coffee available]
15.30-16.45 Group presentations
16.45-17.45 Closing remarks and reception
Facilitators: Emmanouil Tranos, Leonardo Castro Gonzalez, Laurie Burchell and Thom Vaughan.
Guest star: Jon Reades.
Problems
This is a cache of web data containing all commercial websites (.co.uk, landing webpages only) archived by the Common Crawl (2021) that (1) include at least one UK postcode in their web text, and (2) we believe represent Financial Services. Identify the different sub-classes (industries) within Financial Services and highlight the websites that provide Fintech services.
This is a cache of web data containing all commercial websites (.co.uk, landing webpages only) archived by the Common Crawl (2021) that (1) include at least one UK postcode in their web text, and (2) we believe represent the Creative Industries. Identify the different sub-classes (industries) within the Creative Industries and highlight the CreaTech websites.
This is a cache of web data containing all commercial websites (.co.uk, landing webpages only) archived by the Common Crawl (all of 2021 and 2024) that include at least one postcode from Manchester and Birmingham in their web text. The 2021 commercial websites are classified by economic activity, while the 2024 websites are not. Use the 2021 data to classify the 2024 data for both cities. Compare the industrial structure of these cities and analyse how it evolved overtime. Manchester data for 2021 and 2024 and Birmingham data for 2021 and 2024.
This is a cache of web data (part 1 and part 2) containing all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl (February-March 2024). Identify the key policy areas Local Authorities in the UK focus on. Are there any Local Authorities that have implemented distinct policies or actions?
This is a cache of web data, which contains all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl in two points in time: February-March 2024 (part 1 and part 2 same as Problem 4) and October 2025. Identify changes in a specific policy domain associated with the new government elected in July 2024.
Data description
You will find the following columns in the different data packages. Please be aware that not all columns are present in all data packages.
id: An ID. Not linked to any other external datacontent: the web text from the landing page of each websitesummary: an LLM-generated summary of the web textexplanation: an LLM-generated explanation of how the summary of was conductedclean_content: LLM-cleaned web text from the landing page of each websiteurl: the website URLparent_url: the website domainpostcodes: UK postcodes found in the web text
The rest of the fields were produced by our TNT-LLM inspired pipeline that classifies websites into a two-level typology (for now) of economic activities (high level clusters and low level clusters) using LLMs. These fields are:
partition_id: high level cluster ID (imagine something like sector)label_id: low level cluster ID (imagine something like industry)label_name: low level cluster name (imagine something like industry)label_description: low level cluster description (imagine something like industry)
Groups
| Name | Problem |
|---|---|
| Gabriel Adam Pierzynski | 1 |
| Thomas Carey-Wilson | 1 |
| Timothy Monteath | 1 |
| Kelly Yubini | 1 |
| Giovanni Maria Pala | 1 |
| Nirat Rujimora | 2 |
| Meihui He | 2 |
| Camilo Andres Lopez Barra | 2 |
| Do Ngoc Thao | 2 |
| Rita Rasteiro | 2 |
| Jia Zhao | 3 |
| Jo Kent | 3 |
| Wander Demuynck | 3 |
| Paddy Smith | 3 |
| James Thomas | 3 |
| Christina Palantza | 4 |
| Fanqi Zeng | 4 |
| Céline Van Migerode | 4 |
| Filippo Dionigi | 4 |
| Nora Ramsey | 4 |
| Meng Le Zhang | 5 |
| Aditi Dutta | 5 |
| Esha Sadia Nasir | 5 |
| Mariam Cook | 5 |
| Helena Byrne | 5 |
| Wong E Chern | 5 |
Access to VM
To access the VM where you can work on the data, please use this link.