TReMeDa deliverables and RA guide

Project deliverables

The project’s main deliverable is a curated database of metadata and associated replication data related to English-language publications on the topic of ‘social trust’ in social science disciplines, broadly construed, over the past three decades (since 1990).

The database will consist of three subsets that will help address several different aims and research questions:

A. Thematic dataset (THEME)

This master database will contain general publication metadata on the theme of “trust” research in the social sciences (broadly construed). The main content of this thematic database will be constituted by generic bibliometric information obtained from databases such as Web of Science (WoS), Scopus, Google Scholar, and/or OpenAlex. These will be complemented with some additional features that can be extracted with some acceptable level of reliability using machine learning and artificial intelligence tools, such as broad methodological classifications (e.g. theoretical; empirical-qualitative; empirical-quantitative).

Data downloaded via OpenAlex covering the period 1990-2025 contains around 88,000 publications that have an associated abstract. Including all the identified publications – including those with missing abstracts – would double the number of entries, but they are likely to contain outputs such as reviews, which are not the focus here. The cleansed final “thematic dataset” will be smaller, but still rather extensive.

Main tasks:

Manual selection and downloading of bibliometric metadata from databases available through open or institutional access options;
Automated dataset reduction and cleansing
Dataset augmentation (creating additional thematic and methodological classification variables)

Main challenges:

Identifying all the publications that have an important “trust” component may be a challenge for several reasons: (a) we want to treat “social trust” as a broad concept that encompasses various definitions and operationalisations of “trust” in a social science context (not only “generalised” trust, but other forms of interpersonal trust and institutional trust as well), so there are a number of keywords to consider; (b) we can only rely on titles and abstracts, but these may not mention a “social trust” variable as a central focus of the analysis, while it may be an important variable that is explored and reported on in the methodology and findings sections; (c) the word “trust” is often used as a shorthand for social/generalised/interpersonal/institutional/etc. trust in titles and abstracts given that the type of trust focused on is implicit in the discipline/journal that provides the context; (d) for these reasons, the most encompassing approach is to identify all the publications that mention “trust” in titles, abstracts or keywords, and then use some other approach to decode whether the focus of the output really is on the type of trust we are interested in;
For the reasons above, the first version of the Thematic dataset is very large and will need an innovative approach to identify suitable entries from a variety of clues, such as textual, publication and disciplinary context; most likely, this will require a finely calibrated LLM tool to perform this initial large-scale data cleansing task with human supervision and fine-tuning. One advantage at this stage is that online models can be applied give that the data on which it will operate (titles, abstracts, keywords) is to large extent openly available.

Aims and questions:

Give us a broad overview of trust research over the past three decades (time trends; geographies; disciplinary distributions; open-access; etc.)
Methodological trends: what is the proportion of theoretical vs. empirical-qualitative vs. empirical-quantitative studies?; how did methodological trends change over time?

B. Meta-analytical dataset (META)

This subset of the master dataset focuses only on the identified primary quantitative empirical studies (i.e. excluding meta-analyses). Depending on the size of the initial quantitative subset, the dataset may be further restricted based on publication prestige, impact etc. This dataset will contain further added features/variables describing more in-depth findings about “trust” research (e.g. proportion of primary vs. secondary quantitative data sources; whether “trust” is the dependent or independent variable (or something else); replication/data access statements; etc.).

I estimate that this “meta-analytical dataset” would consist of a few hundred publications.

Main tasks:

Manual review of the publications identified as empirical quantitative studies in the thematic dataset;
Manual dataset reduction and cleansing based on quality, influence, etc.;
Manual sourcing of full texts available through open or institutional access;

Main challenges:

The first version of this Meta-analytical dataset may still be very large and therefor in need of automation;
Even a more reduced version will be rather large and securing the full texts manually will require a considerable time. Tools such as Zotero’s “Find Full Text” utility may come handy, but it has a high fail rate, especially for older publications. JSTOR can also speed up the process. The main challenge will be to identify the most resource-efficient way of accessing a large number of PDFs ethically through institutional log-in access;
Setting up a fine-tuned local LLM capable of accurately identifying the main features to be extracted from the full texts of the articles (i.e. (a) reproducibility; (b) data sources; (c) statistical method(s); (d) dependent/independent status; (e) causal claim) will be a significant challenge;

Aims and questions:

What proportion of quantitative “trust” research adhers to different reproducibility standards?
What are the main sources of data for “trust” research in various social science disciplines?
Is quantitative “trust” research more interested in understanding the factors contributing to differences in social trust, or rather in how social trust may affect other outcomes and aspects of life? Is there a disciplinary divide in this respect? Has this changed over time?
What causal claims are being made in respect to “trust”?
What different statistical methods are being applied in quantitative “trust” research?

C. Methodological dataset (METHOD)

The smallest and most detailed subset, it consists of studies whose findings can be replicated. The studies in this dataset rely entirely on data that are publicly available or which have been made fully available by the authors either as part of a replication package or as a request by the TReMeDa research team. The inclusion criteria in this database will be methodologically driven, and its primary purpose is to serve as a teaching resource underpinning a textbook on Quantitative Sociological Methodology Applied to Researching Trust but available for instructors in general.

Studies in this dataset will be accompanied by detailed data documentation, formatted analytical datasets, and analytical code that allows (parts of) the original analyses to be reproduced by students and other researchers.

I estimate that around 50 publications will make up the final methodological dataset and they will cover a broad variety of social science disciplines and methods.

Main tasks:

Classify the studies according to their potential pedagogical contributions (i.e. which data analysis methods can be introduced using the study and its dataset: linear regression; logistic regression; statistical tests; causal analysis; time series; multilevel modelling; SEM; etc.). Ideally, this task will rely on the semi-automated statistical methods classification of the Meta-analytical dataset;
Compile analytical datasets from the original data sources using reproducible and traceable (documented) R code (and potentially other languages);
Create detailed documentation for the datasets and the data manipulation process;
Create annotated analytical code files in R (and potentially other languages) that reproduce (parts of) the originally published analyses;

Main challenges:

The datasets included in METHOD should be able to cover a very wide range of methods from different disciplines;
Compiling reproducible and well documented code will require a significant amount of time, effort and attention;
It is also essential that the code can serve well as a pedagogical resource, so it should combine the use of the best packages that make the code more readable and easy to implement for a students starting out in coding, while ensuring that the code remains usable even if packages relied on are updated or fall out of active maintenance;
The choice for which parts of the originally published analyses should be reproduced and therefore the content of the analytical datasets will require carefult pedagogical and methodological thinking and decision-making;

Aims and questions:

The main aim of the METHOD database is to provide datasets and code that demonstrate how various statistical methods have been - or can be - applied in real-life research, primarily as a pedagogical tool for enhancing the teaching of quantitative research methods in the social sciences.

Research Assistant roles

Research Assistants will support work on the TReMeDa project across two distinct but complementary workstreams:

1. computational tasks centred on building and running AI-assisted processing pipelines that underpin the creation of the THEME and META datasets;
1. various research-oriented tasks across THEME, META and METHOD, including literature evaluation and review, methodological judgement and manual coding, as well as - crucially - conceptual work to operationalise the criteria for AI algorithms and perform AI output validation.

These two sets of activities require different skill profiles and can in large part proceed in parallel, supporting each other. They are mapped onto two Research Assistant roles:

Research Assistant 1: Literature Screening and Methodological Validation

We are seeking a Research Assistant to support the final stages of the Trust Research Methodology Database (TReMeDa), a research project based in the School of Geography, Politics and Sociology at Newcastle University. The project is building an open-access, AI-assisted database of academic publications on social trust, cataloguing their research methodologies, data availability, and suitability as teaching resources. Outputs will underpin a forthcoming research methods textbook and a public dataset for use by educators and early-career researchers.

The primary focus of this post is the research-facing dimension of the project: screening publications for relevance and quality, reviewing and validating the outputs generated by the project’s AI-assisted extraction system, and contributing expert methodological judgement at key decision points. This is a role for someone with strong grounding in quantitative social science research methods and familiarity with the academic literature in sociology, political science, or a related social science discipline.

The RA will work closely with the Principal Investigator (PI) and, where applicable, alongside the complementary computational RA post. Work will be coordinated via shared project files and a version-controlled platform (GitHub). Regular one-to-one meetings with the PI will be scheduled throughout.

This role offers excellent professional development opportunities for doctoral researchers or recent postgraduates, including experience with systematic review methodology, AI-assisted research workflows, open science and reproducibility practices, and contribution to a textbook-level research output.

Key Responsibilities

Literature screening: Review publications drawn from a large bibliographic dataset covering social, institutional, and interpersonal trust research; assess relevance, disciplinary scope, publication quality, and likely methodological detail.
Validation of AI-generated classifications: Review and correct structured metadata extracted by the project’s large language model (LLM) pipeline, including classifications of research methodology, trust variable roles, statistical models used, and data/code availability statements.
Methodological coding: Apply controlled vocabularies and classification schemas to describe the design, data sources, and statistical techniques used in reviewed publications; flag ambiguous cases for discussion with the PI.
Pedagogical assessment: Evaluate whether individual studies and associated datasets are suitable for use as teaching resources, and at what level (introductory, intermediate, or advanced quantitative methods).
Quality control and documentation: Maintain clear, version-controlled records of all coding and validation decisions; ensure consistency and auditability of the classification process.
Schema refinement: Provide feedback on the project’s AI prompt templates and classification ontologies based on validation experience; assist in improving extraction accuracy through annotated examples.
Dataset and replication assessment: Where relevant publications include replication materials (data files, code, supplementary documentation), assess completeness and usability of those materials.

Research Assistant 2: Computational Pipeline and Data Engineering

About the Role

We are seeking a Research Assistant to support the computational and data engineering dimension of the Trust Research Methodology Database (TReMeDa), a research project based in the School of Geography, Politics and Sociology at Newcastle University. The project is constructing an open-access database of academic publications on social trust, using an AI-assisted pipeline to automatically extract and classify methodological information from a large body of published research. Outputs will underpin a research methods textbook and a public, versioned dataset for educators and researchers.

The primary focus of this post is building and operating the technical infrastructure that processes publications at scale. This includes setting up and running PDF ingestion tools, submitting batch extraction jobs to the university’s High Performance Computing (HPC) facility using locally deployed large language models (LLMs), maintaining the output database, and ensuring that all components of the workflow are documented and reproducible. The role also involves close collaboration with the PI and, where applicable, the complementary literature-validation RA post.

This role offers significant professional development opportunities for doctoral researchers or recent postgraduates with interests in computational social science, reproducible research infrastructure, and applied AI. The RA will gain direct experience with open-weight LLMs in an HPC research environment, end-to-end reproducible data pipelines, systematic review automation, and contributing to an open-source academic codebase. The project’s technical outputs (codebase and data pipeline documentation) are planned for public release, giving the RA co-authorship credit in an open-source research context.

Key Responsibilities

PDF ingestion and preprocessing: Convert academic article PDFs to structured text using open-source parsing tools (e.g. GROBID); implement and refine section-segmentation scripts to isolate abstracts, methods sections, tables, and supplementary materials.
LLM batch processing: Submit and manage batch extraction jobs on the university’s HPC GPU infrastructure using locally deployed open-weight large language models (e.g. DeepSeek, Qwen, GLM families); monitor job outputs and logs.
Prompt engineering and schema development: Develop, test, and refine task-specific prompt templates designed to produce structured JSON outputs covering methodological classifications, variable roles, statistical models, and data availability; work iteratively with the PI and validation RA to improve prompt accuracy.
Database integration: Merge AI-extracted metadata with bibliographic records from the master metadata dataset; apply controlled vocabularies and standardised coding; flag and resolve consistency issues.
Script development and documentation: Write and maintain reproducible R and/or Python scripts for all pipeline stages; ensure all code is version-controlled, commented, and executable from raw inputs.
Database publication and versioning: Prepare structured datasets for public release; maintain version-controlled repository on GitHub; assist with archiving stable releases (e.g. via Zenodo DOI assignment); contribute to data dictionary and user documentation.
Open-access and compliance checks: Assist in identifying open-access article versions (e.g. via Unpaywall API) for inclusion in the processing pipeline, in compliance with institutional access policies; maintain accurate records of file acquisition status.

Research Assistant work-plan (tentative)

The work-plan below maps the project’s six stages onto its three deliverable datasets — THEME, META, and METHOD — and assigns responsibilities across the PI, the literature/validation RA (RA1), and the computational/pipeline RA (RA2). Stages 1–2 produce the THEME dataset, Stages 3–4 produce the META dataset, and Stage 5 produces the METHOD dataset. Stage 6 covers publication and dissemination across all three.

Stage	Focus	Contributors
Stage 1: Metadata collection and cleansing	THEME	PI (75%) + RA1 (25%)
Stage 2: AI-assisted broad classification	THEME	RA2 (75%) + PI (15%) + RA1 (10%)
Stage 3: Quantitative subset selection and full-text acquisition	META	RA1 (60%) + PI (25%) + RA2 (15%)
Stage 4: AI-assisted full-text extraction and validation	META	RA2 (80%) + RA1 (10%) + PI (10%)
Stage 5: Replication dataset construction	METHOD	RA1 (50%) + PI (30%) + RA2 (20%)
Stage 6: Publication, dissemination, and archiving	All	PI (60%) + RA2 (30%) + RA1 (10%)

Detailed work plan (tentative)

Stage 1: Metadata Collection and Cleansing (→ THEME)

1.1 Purpose

Build a comprehensive master metadata dataset of English-language academic publications (1990–2025) on the topic of social trust in the social sciences. OpenAlex data already identifies approximately 88,000 publications with abstracts; this stage consolidates and cleansens that corpus using additional database sources.

1.2 Search Strategy

Core Keywords

Combinations of the following terms in titles, abstracts, and keywords:

“social trust”, “institutional trust”, “interpersonal trust”
“political trust”, “generalised trust” / “generalized trust”
“confidence in institutions”

These should be combined with discipline filters where available (e.g. sociology, political science, economics, social psychology).

Databases

Primary sources: OpenAlex (API-based, open metadata), Web of Science (WoS), Scopus (if accessible).

Secondary / complementary sources: Google Scholar (for coverage checks), Semantic Scholar.

1.3 Export Formats

Web of Science exports: tab-delimited (.txt) as the primary working format; RIS as a secondary archival format for Zotero integration.

Recommended workflow: (1) Export tab-delimited metadata for quantitative processing; (2) Export RIS files for bibliographic management in Zotero; (3) Maintain a unique internal identifier (e.g. wos_uid, openalex_id) to link records across formats and sources.

1.4 Cleansing and Deduplication

Remove entries without abstracts (unless clearly relevant based on title and source)
Deduplicate across database sources using DOI and fuzzy title matching
Standardise journal names, author formats, and publication years
Filter to English-language publications only
Restrict to 1990–2025 publication window

1.5 Tools

R

bibliometrix: importing WoS/Scopus data, citation counts, journal metrics
tidyverse: data cleaning and management
stringr: keyword matching and text filtering

Python

pandas: metadata handling
pybliometrics: Scopus API (if available)
openalex-python or direct API calls to OpenAlex

Note: Web of Science access is rate-limited and largely manual; RAs should follow institutional terms of use strictly.

1.6 Output

A cleansed master metadata table including: unique article ID, title, abstract, authors, year, journal, DOI, citation counts (if available), database source(s), and language.

Stage 2: AI-Assisted Broad Classification (→ THEME)

2.1 Purpose

Enrich the THEME dataset with broad methodological classifications derived from abstracts using locally deployed LLMs. This stage adds the analytical layer that distinguishes the THEME dataset from a raw bibliometric download.

2.2 Classification Tasks

Each publication is classified along the following dimensions:

Broad methodology: theoretical; empirical-qualitative; empirical-quantitative; mixed-methods; review/meta-analysis; other
Discipline: sociology, political science, economics, social psychology, other (where inferrable from abstract and journal)
Open-access status: flagged via Unpaywall/OpenAlex metadata

2.3 LLM Pipeline

Open-weight LLMs (e.g. DeepSeek, Qwen, GLM families) deployed on the university’s GPU-enabled HPC nodes
Batch processing of abstracts using structured prompt templates that return JSON-formatted classifications
RA2 submits and manages batch jobs via the HPC scheduler; monitors logs and outputs

No data may be processed using external APIs.

2.4 Validation

RA1 reviews a stratified random sample of AI classifications for accuracy
Misclassification patterns feed back into prompt refinement (RA2)
Borderline or ambiguous cases are flagged for discussion with the PI

2.5 Tools

R / Python

scikit-learn / tidytext: TF-IDF on abstracts as a supplementary signal for methodology classification
bibliometrix: journal-level indicators for discipline assignment

2.6 Output

The completed THEME dataset: ~88,000 publications with bibliographic metadata and broad methodological/disciplinary classifications. This dataset supports the project’s bibliometric research questions (time trends, disciplinary distributions, methodological trends, open-access patterns).

Stage 3: Quantitative Subset Selection and Full-Text Acquisition (→ META)

3.1 Purpose

Identify the subset of primary quantitative empirical studies from the THEME dataset and acquire their full texts. This reduced dataset (estimated at a few hundred publications) forms the basis for deeper AI-assisted extraction.

3.2 Selection Criteria

Starting from all publications classified as “empirical-quantitative” in Stage 2, apply the following filters:

Substantive centrality: trust is a central variable (DV or key IV), not peripheral
Study type: primary empirical studies only (exclude meta-analyses, reviews, commentaries)
Publication quality: journal ranked Q2 or above (Scimago), or high citation count relative to publication year
Recency and coverage: prioritise the last 10–15 years, with selective inclusion of influential older studies

RA1 reviews candidate publications and applies these criteria. Automated scoring (keyword density, citation metrics, journal rank) supports but does not replace human judgment. Borderline cases are flagged for discussion with the PI.

3.3 Full-Text Acquisition

Primary method: manual download via institutional library access (publisher websites).

Secondary methods (ethical and permitted):

DOI-based Unpaywall queries for open-access versions
Publisher supplementary material links

Fully automated scraping of paywalled content is not permitted. Semi-automated OA detection is acceptable.

Tools: unpaywallR (R), Unpaywall API (Python).

3.4 File Management

Store PDFs using a consistent naming scheme (e.g. Author_Year_Journal.pdf)
Maintain links to Zotero records
Record acquisition status (acquired / OA / not available) in the metadata table

3.5 Output

A reduced metadata dataset with acquisition status for each entry, plus a corpus of full-text PDFs and supplementary materials ready for AI processing.

Stage 4: AI-Assisted Full-Text Extraction and Validation (→ META)

4.1 Purpose

Extract detailed methodological and substantive information from the full texts acquired in Stage 3 using locally deployed LLMs. RA2 builds and operates the extraction pipeline; RA1 validates outputs against the original publications.

4.2 Preprocessing

Convert PDFs to structured text using GROBID (or comparable open-source parser)
Segment text into sections (abstract, methods, results, tables, references)
Ingest any supplementary code and documentation

4.3 LLM-Based Extraction Tasks

For each publication, extract:

Trust variable role: dependent variable, independent variable, mediator/moderator, other
Data source: primary data collection vs. secondary data (and which dataset)
Statistical models used: e.g. OLS regression, logistic regression, multilevel modelling, SEM, causal inference methods
Data availability statement: present/absent; data publicly available / available on request / not available
Replication materials: code available / supplementary files available / none
Key statistical effects: direction and significance of main trust-related findings

4.4 LLM Deployment and HPC Usage

Open-weight LLMs (e.g. DeepSeek, GLM, Qwen families) hosted on GPU-enabled HPC nodes
RA2 develops and refines task-specific prompt templates producing structured JSON output
Batch jobs submitted via the HPC scheduler; logs and outputs monitored systematically

No data may be processed using external APIs.

4.5 Validation

RA1 reviews all AI-extracted records against the original publications for:

Logical consistency of classifications
Alignment with the source text
Correct interpretation of trust variable roles and statistical methods

Systematic disagreements between AI output and human review inform prompt refinement (RA2). The PI resolves persistent ambiguities.

4.6 Output

The completed META dataset: a few hundred publications with bibliographic metadata, detailed methodological classifications, trust variable characterisations, data/code availability indicators, and key statistical findings. This dataset supports the project’s meta-analytical research questions.

Stage 5: Replication Dataset Construction (→ METHOD)

5.1 Purpose

From the META dataset, identify studies whose analyses can be reproduced using publicly available data, and build a curated collection of annotated analytical datasets and code. This is the most labour-intensive stage per publication and produces the project’s primary teaching resource — approximately 50 studies covering a broad range of social science disciplines and quantitative methods.

5.2 Selection Criteria

A study is eligible for the METHOD dataset if:

Its data are publicly available or have been made available by the authors (as part of a replication package or on request by the TReMeDa team)
Its analytical approach is sufficiently documented to allow partial or full reproduction
It offers clear pedagogical value — i.e. it can be used to introduce or illustrate a specific quantitative method

5.3 Pedagogical Classification

Each selected study is classified according to its potential teaching contributions:

Which data analysis methods can be introduced using the study (e.g. linear regression, logistic regression, statistical tests, causal analysis, time series, multilevel modelling, SEM)
Appropriate level: introductory, intermediate, or advanced
Disciplinary context: sociology, political science, economics, social psychology, etc.

5.4 Dataset Compilation and Code Development

For each study in the METHOD dataset:

Obtain and document the source data: download the original dataset; create detailed documentation of variables, coding, and any data manipulations applied
Compile the analytical dataset: using R (and potentially other languages), prepare the dataset used in the original analysis from the raw source data
Create annotated analytical code: write reproducible R scripts (and potentially Python/Julia equivalents) that replicate key parts of the originally published analyses, with pedagogical annotations explaining each step
Verify reproduction: confirm that the reproduced results match (or closely approximate) the published findings; document any discrepancies

5.5 Tools

R (tidyverse, haven, renv): data preparation, analysis reproduction, and dependency management
Quarto: documentation and annotation of analytical code
Python (optional): alternative implementations where pedagogically useful

5.6 Output

The completed METHOD dataset: approximately 50 studies, each accompanied by:

Bibliographic and methodological metadata (inherited from META)
Pedagogical classification and level
Documented analytical dataset(s)
Annotated, reproducible analytical code
Notes on any discrepancies with published results

Stage 6: Publication, Dissemination, and Archiving

6.1 Database Integration

Merge all three dataset layers (THEME, META, METHOD) into a harmonised, publicly releasable database:

Use controlled vocabularies for methods, models, and variable roles
Use binary or categorical indicators for data and code availability
Record uncertainty and provenance (human-coded vs. AI-extracted vs. both) where applicable
Ensure consistent identifiers link records across the three dataset tiers

6.2 Public Release

The final database will be released as:

A versioned open-access repository (GitHub)
Accompanying data dictionary and user documentation
Stable archived releases with DOI assignment (e.g. Zenodo)

No copyrighted full texts will be shared. The METHOD dataset will include only data and code that are already publicly available or released with author permission.

6.3 Reproducibility and Reusability

All scripts must be reproducible end-to-end from raw inputs
New publication batches can be appended and reprocessed through the pipeline
LLM model versions, prompt schemas, and validation decisions must be documented
Example queries and use cases will be prepared for end users

6.4 Documentation

Data dictionary covering all variables across the three dataset tiers
Pipeline documentation: how to rerun extraction and classification
User guide: how to query and use the database for research or teaching

Notes for RAs

Prioritise transparency over speed
Flag uncertainty rather than guessing
Maintain clear version control and documentation
Coordinate regularly with the PI and with each other where stages overlap

This guide should be treated as a living document and updated as tools, models, and workflows evolve.