Call for Applications: Web Archives for Social Sciences Datathon

Do you want to use web data at scale for socio-economic research, but don’t know how to start? Are you interested in exploring large Web Archives for your research but unsure where to begin?

We’ve got you covered.

On 27–28 November 2025 we are organising a Web Archives for Social Sciences Datathon at the University of Bristol. This is in collaboration with our partners: the Common Crawl and UK Web Archive at The British Library. The datathon will take place at the BDFI Neutral Lab.

This two-day event will build capacity in the social science research community to use large-scale Web Archive data for policy-relevant, socio-economic research. Participants will work in teams with curated data extracts from the Common Crawl to address real-world research challenges. They will be supported by our expert facilitators.

Who should apply?

  • Early-career and more established researchers in the social sciences who currently use, or wish to use, web data for their research–especially data from Web Archives.

  • Researchers with technical backgrounds (e.g. data science) who want to apply their skills to Web Archive data for policy-relevant research.

  • Some prior experience with code-based tools (e.g. Python or R) is expected.

What will you gain?

By the end of the Datathon, you will be:

  • More confident in using Web Archives to address your own research questions.

  • Familiar with the structure, challenges, and opportunities of these data.

  • Equipped with technical expertise for working with Web Archive data at scale.

  • Experienced in using LLM for analysing such web data.

Practical details

When: 27–28 November 2025

Where: Bristol, BDFI Neutral Lab

Support: Limited financial assistance for travel and/or accommodation is available.

How to apply

Send a one-page document to Emmanouil Tranos (e.tranos at bristol dot ac dot uk) by 15 October 2025, including:

  • Your current position

  • Your research interests, particularly how you are using (or plan to use) Web Archive data

  • Your technical skills

  • Your location and whether you require financial support.

BDFI’s Neutral Lab. Source: BDFI webpage.

This Datathon is part of the Atlas of Econonic Activities project and is funded by SDR UK.

Datathon materials

Agenda

Thursday 27/11/2025

  • 09.30-10.00 Arrival at the BDFI; coffee and snacks available

  • 10.00-10.15 Welcome and introductions [Emmanouil Tranos, University of Bristol]

  • 10.15-10.30 Introduction to the Common Crawl [Laurie Burchell and Thom Vaughan, Common Crawl]

  • 10.30-10.45 Web data and LLMs [Leonardo Castro Gonzalez, University of Bristol]

  • 10.45-11.00 Problems and teams [Emmanouil Tranos, University of Bristol]

  • 11.00-13.00 Work [facilitators will be around; snacks and coffee available]

  • 13.00-14.00 Lunch and QAs

  • 14.00-18.00 Work [facilitators will be around; snacks and coffee available]

  • 18.00-19.30 Pizza and brief team updates

  • 20.00 Leave the BDFI

Friday 28/11/2025

  • 09.00-12.30 Work [facilitators will be around; snacks and coffee available]

  • 12.30-13.30 Lunch and QAs

  • 13.30-15.30 Work [facilitators will be around; snacks and coffee available]

  • 15.30-16.45 Group presentations

  • 16.45-17.45 Closing remarks and reception

Facilitators: Emmanouil Tranos, Leonardo Castro Gonzalez, Laurie Burchell and Thom Vaughan.

Guest star: Jon Reades.

Problems

  1. This is a cache of web data containing all commercial websites (.co.uk, landing webpages only) archived by the Common Crawl (2021) that (1) include at least one UK postcode in their web text, and (2) we believe represent Financial Services. Identify the different sub-classes (industries) within Financial Services and highlight the websites that provide Fintech services.

  2. This is a cache of web data containing all commercial websites (.co.uk, landing webpages only) archived by the Common Crawl (2021) that (1) include at least one UK postcode in their web text, and (2) we believe represent the Creative Industries. Identify the different sub-classes (industries) within the Creative Industries and highlight the CreaTech websites.

  3. This is a cache of web data containing all commercial websites (.co.uk, landing webpages only) archived by the Common Crawl (all of 2021 and 2024) that include at least one postcode from Manchester and Birmingham in their web text. The 2021 commercial websites are classified by economic activity, while the 2024 websites are not. Use the 2021 data to classify the 2024 data for both cities. Compare the industrial structure of these cities and analyse how it evolved overtime. Manchester data for 2021 and 2024 and Birmingham data for 2021 and 2024.

  4. This is a cache of web data (part 1 and part 2) containing all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl (February-March 2024). Identify the key policy areas Local Authorities in the UK focus on. Are there any Local Authorities that have implemented distinct policies or actions?

  5. This is a cache of web data, which contains all the UK governmental webpages (.gov.uk) that have been archived by the Common Crawl in two points in time: February-March 2024 (part 1 and part 2 same as Problem 4) and October 2025. Identify changes in a specific policy domain associated with the new government elected in July 2024.

Data description

You will find the following columns in the different data packages. Please be aware that not all columns are present in all data packages.

  • id: An ID. Not linked to any other external data

  • content: the web text from the landing page of each website

  • summary: an LLM-generated summary of the web text

  • explanation: an LLM-generated explanation of how the summary of was conducted

  • clean_content: LLM-cleaned web text from the landing page of each website

  • url: the website URL

  • parent_url: the website domain

  • postcodes: UK postcodes found in the web text

The rest of the fields were produced by our TNT-LLM inspired pipeline that classifies websites into a two-level typology (for now) of economic activities (high level clusters and low level clusters) using LLMs. These fields are:

  • partition_id: high level cluster ID (imagine something like sector)

  • label_id: low level cluster ID (imagine something like industry)

  • label_name: low level cluster name (imagine something like industry)

  • label_description: low level cluster description (imagine something like industry)

Groups

Name Problem
Gabriel Adam Pierzynski 1
Thomas Carey-Wilson 1
Timothy Monteath 1
Kelly Yubini 1
Giovanni Maria Pala 1
Nirat Rujimora 2
Meihui He 2
Camilo Andres Lopez Barra 2
Do Ngoc Thao 2
Rita Rasteiro 2
Jia Zhao 3
Jo Kent 3
Wander Demuynck 3
Paddy Smith 3
James Thomas 3
Christina Palantza 4
Fanqi Zeng 4
Céline Van Migerode 4
Filippo Dionigi 4
Nora Ramsey 4
Meng Le Zhang 5
Aditi Dutta 5
Esha Sadia Nasir 5
Mariam Cook 5
Helena Byrne 5
Wong E Chern 5

Access to VM

To access the VM where you can work on the data, please use this link.