The Flux by Epistemix

Welcome to The Flux - where we talk data, decisions, and stories of people asking the what-if questions to create an intentional impact on the future.

All Episodes

The Flux by Epistemix

Mining Meaning: Laura Sheppard on Gender, Academia, and the Power of Public Data

April 22, 2025 • Season 1 • Episode 18

In this episode of The Flux, we talk with Laura Sheppard, a research fellow at University College London’s Centre for Longitudinal Studies, about how data mining can uncover powerful insights from unexpected sources. Laura shares her work using the British Library’s Ethos dataset, a comprehensive record of UK doctoral theses - to explore gender inequality in higher education. We discuss the process of inferring gender from names, the challenges of working with messy or incomplete data, and how publicly available datasets can be creatively repurposed to answer socially important questions. Laura also touches on her work with The Trussell Trust, combining geographic and census data to map food bank accessibility in the UK.

This conversation is a deep dive into the art and science of data mining, and how thoughtful assumptions, transparent methodology, and a bit of creativity can turn raw data into meaningful research.

Hello and welcome to The Flux, where we talk to experts who are using data to help people make better decisions. In this episode, I'm talking to Laura Sheppard, a research fellow at University College London’s Center for Longitudinal Studies, about data mining. So welcome, Laura.

Laura: Hello. Nice to be here.

First of all, could you describe, for people who’ve never heard of it, what data mining is?

Laura: So, in my experience, data mining is all about working with an unclean dataset in its rawest form. It contains useful information, but not necessarily the exact information we want for our research or the questions we’re trying to answer.

I've always approached it as the process of extracting the useful information within the raw dataset to help identify patterns, answer research questions, and uncover insights that aren’t obvious at first glance especially when working with newer or less traditional forms of data.

People often think of data as just numbers and huge spreadsheets, but I’ve used a lot of text mining too, like searching for certain phrases within the data to infer specific information. And we’ll probably talk more about this later, but I’ve also worked with using names to infer personal characteristics about people. So there are lots of options and opportunities. At least in my experience, there’s been quite a bit of that.

I come at this from an academic standpoint, but the data mining techniques I've used, and that others use in academia, can absolutely be applied in the business or corporate world too. Just thinking about the information social media companies collect and what they can mine from that - it's pretty fascinating.

Yeah, I think one thing that really stood out in our previous conversation was the idea that you can use data mining to access valuable information from datasets that weren’t originally designed for research. With some creativity and the right techniques, you can repurpose them for entirely new uses.

Laura: Yeah, definitely. And I think when you’ve got a really interesting research question, there’s often data out there you can use, even if it wasn’t originally collected for research purposes. I’ve seen a lot of people moving toward these new sources of data. It's a really exciting area right now.

Could you tell us a bit about the work you did with the British Library’s e-thesis dataset? I think it ties in perfectly with this idea: something collected organically and repurposed for meaningful, socially relevant research.

Laura: Yeah, absolutely. So, for anyone who doesn’t know and I imagine most listeners won't, the British Library’s EThOS service is an online record of all doctoral theses in the UK. It includes about 98% of all UK doctorates, so you can think of it as a kind of census of doctoral students. It’s not just a sample, it's nearly everyone.

There are around 625,000 records currently in EThOS. What it does is scrape new thesis records from university repositories and post the metadata the information about the thesis on the site. This gives us a really rich source of data. For example, when I finished my PhD, my record (though it might not be live yet) would eventually include my full name, the year I completed it, the university, the discipline, the title, possibly my supervisors, and the department.

This metadata is available for all 625,000 theses. And it's not just PhDs, there are professional doctorates too, like in clinical or educational psychology. So there's a huge amount of information on people doing really interesting research.

The thing with PhDs, in particular, is that although some people do publish during their PhDwhat’s called a PhD by publicationthey’re a smaller group. A lot of the research doesn’t get widely shared. So this dataset contains a wealth of interesting and often underutilized information.

My main research focus is on gender and higher education inequalities, and EThOS provides a unique data source for that. Other datasets like those from the Higher Education Statistics Agency give high-level statistics, but not at the individual level. UK Research and Innovation (UKRI), the main PhD funding body, also publishes data, but it only covers about a quarter of PhD students. So we don’t get that full-picture, individual-level insight anywhere else.

The great thing is that the EThOS data is publicly available. Anyone can go to the British Library site and download the dataset I used. That ties in with the broader open science and open research movement. You can even replicate my study with my code, or explore the data in different ways.

Now, in my research, since EThOS doesn’t include personal characteristics like gender, age, or ethnicity, I had to infer gender using students' names. I also mined other information and created new variables from the metadata I received, and some additional data from the British Library. That transformation from raw data to a fully usable dataset is a form of data mining in itself.

For gender inference, there are algorithms that can do this. Some are built into coding languages like R and Python, and there are paid services like Gender API, Genderize, and NamSor. NamSor is the one we ended up using because it was able to combine first and last names to infer gender, which was helpful given the international nature of the UK PhD community.

A good example I often use is the name Andrea. In English, Andrea is typically female, but in Italian, it’s typically male. By including the surname, NamSor helps clarify that ambiguity.

How do you handle genuinely gender-neutral names in English like Lindsay?

Laura: Some of the algorithms actually take the year into account. So, for example, Lindsay might have been more common as a male name in the early 1900s but more often used for females in the 2000s. That time component can help improve accuracy.

These algorithms also give you a confidence score, like 0.95 for high confidence or 0.55 for low confidence. We also looked at how large their sample size was. If a prediction was based on 100,000 examples, we could be more confident in it.

However, they typically only predict binary genders, males or females so they don’t account well for nonbinary or other gender identities. In those cases, I preferred to label names as "unknown" rather than risk an inaccurate guess.

That really highlights how, in data science, especially with exercises like this, there’s a level of artistry involved. You're making assumptions, but the key is to be transparent and document them like I assume you did in your thesis, which is presumably now on EThOS?

Laura: Yeah, exactly. And as we pulled in additional data sources, like the year someone submitted their thesis, it allowed us to add more context. It really is like pulling on a thread you discover new angles and insights as you go.

People think quantitative research is all black and white, but there’s actually a lot of nuance. You have to make judgment calls along the way. For example, many people in our dataset only had initials instead of full first names. That made gender inference tough. But sometimes they had a middle name, so we used that instead.

Out of about 60,000 records with initials, only about 1,000 to 2,000 had usable middle names, but even that helped expand our dataset and improve accuracy. It’s all about digging into the details and being intentional about the choices you make.

So what were some of the headline findings from the project?

Laura: Looking at the UK PhD landscape, we found both general and gender-specific trends. The top PhD-awarding institutions were Oxford, Cambridge, UCL, Imperial, and Edinburgh, probably not a surprise. For disciplines, the top ones were medicine and health, engineering and technology, social sciences, physical sciences, and biological sciences.

Out of those, four are STEM fields, which already hints at gender dynamics. Male-skewed disciplines included computer science, physical sciences, engineering, and math. Female-skewed ones were arts and design, languages and literature, and education.

Even as more women earn PhDs, the gender gap in some fieldsespecially physical sciences and engineeringhasn’t closed much. We limited our dataset to theses from 1990 onward, partly because of gender inference issues. In the 1990s, the gender balance was much more skewed, with only around 35% of PhD students being female. That’s improved to about 44–45% in recent years.

We also looked at institutions that were gender-skewed. Roehampton and Queen Margaret University in Edinburgh were more female-skewed. Roehampton began as four women’s teacher training colleges, and Queen Margaret started with a focus on domestic and home economics education for women.

On the male-skewed side, Cranfield and Heriot-Watt stood out. Cranfield was founded for aerospace and engineering, and Heriot-Watt was the world’s first mechanics institute both very male-dominated historically, and those patterns persist.

Yeah, I can’t imagine Cranfield wants to be seen as a “male institution,” but it’s interesting how that legacy shows up in data. Do we know if the gender balance changes further along the academic path, for example, between undergraduate, PhD, and professor levels?

Laura: It varies by institution and discipline. Some undergraduate math courses might be 50-50 male and female, while others skew more. But in male-dominated STEM fields, the proportion of women tends to drop off as you go further into academia. You’ll often find more women among PhD students than among full professors. In the UK, only about 30–31% of full professors are women.

That’s why places like Roehampton matter, they can have a cyclical effect, attracting female students who may go on to become academics themselves.

And by breaking the data into decades, we could see that things are improving. In the 1990s, only around 35% of PhD students were women. That’s risen to over 45% in the 2010s. The Higher Education Statistics Agency says that 51% of postgraduate researchers are now women but that includes professional doctorates and research master’s degrees too.

Still, the most male- and female-skewed disciplines haven’t really changed across the three decades we studied.

Yeah, it makes sense. I think your work really shows the value of repurposing publicly available datasets. There's some overlap here with what Epistemix is doing we're building a synthetic population for the U.S., linking individual-level data across areas like health, work, and family. We're hoping to use it for modeling and analysis, even though much of the data comes from separate, publicly available sources.

Laura: Yeah, definitely. I think my project shows what’s possible with just one large dataset. But once you start linking datasets, the power increases dramatically. It’s about what variables you can create, what you can infer, and what new insights that can lead to.

Exactly. The total becomes more than the sum of its parts.

Laura: Yeah. That reminds me of a project I did during my PhD with the Trussell Trust, the UK’s largest food bank network. I took a three-month research assistant role to work on accessibility and demand for food banks across the UK.

We combined data from different sourcesTrussell Trust locations, independent food banks, and census data to understand where demand was high and accessibility was low. We used addresses to map locations, calculate travel times, and assess whether people could realistically reach a food bank. If someone’s nearest one is an hour’s walk away, are they really going to be able to get there?

This involved a lot of data mining, even just finding and cleaning the data, matching postcodes to regions, calculating travel times, and combining everything with census data on deprivation, work status, housing, etc. It was a great example of how new and traditional data sources can be used together to answer important questions.

Yeah, and it’s empowering to know that even if you don’t have perfect data at the start, you can still build something meaningful with what’s available.

Laura: Exactly. It’s all about focusing on the question, then figuring out how to answer it with whatever data you can find or create.

As a field, that’s really exciting. So before we wrap up, do you want to talk about what you're working on now?

Laura: Sure. I'm only a couple of months into my new role, my first job post-PhD. I'm continuing to focus on educational equality, working at the Center for Longitudinal Studies. Right now, we’re exploring educational attainment and how it’s shaped by both genetic and environmental factors using cohort data basically following people of the same age over time.

It’s a different kind of dataset, but I’m sure I’ll still be doing lots of data mining, variable creation, and developing new methodologies. It's been both interesting and challenging so far - a good new challenge after finishing the PhD.

Great. My own challenge after my PhD was joining Epistemix, which has also been exciting and challenging.

Okay, well thank you very much for your time, Laura. This has been fascinating.

Laura: Thanks so much. I’ve really enjoyed it. It’s nice to share academic research on a broader scale.

People on this episode

John Cordier

Host