MSc Dissertation  ·  2026
98%Coursework GPA
84%Dissertation GPA
Academic ExcellenceFinal Grade

Evaluating Hybrid Human-AI Systems for Discovering Undeclared Foreign Key Relationships in Large-Scale Relational Databases

MSc Applied Data Science & AI · OPIT – Open Institute of Technology ↗ · Supervised by Professor Lokesh Vij
92.64%
F1 Score
238M
Rows Processed
9 mins
Full Pipeline Runtime
368
Tables
310
Known Foreign Keys
94.02%
Precision
91.29%
Recall
+15
percentage point F1 gain vs. name matching alone

Built entirely in Python, evaluated on real open-source data.

Python
PostgreSQL
SQLAlchemy
OpenAI API (GPT-4)
SQLite
pandas
scikit-learn
Rich
1  ·  The Problem

Foreign keys are missing. Nobody notices until it's too late.

Foreign key relationships are the backbone of any relational database. They define how tables connect and depend on each other. In theory, these relationships are always documented. In practice, they often are not.

In enterprise environments, foreign key constraints are frequently dropped for performance reasons, lost during migrations, or simply never documented in legacy systems. The result is databases that nobody fully understands, where data integration and analysis become guesswork.

A database with 1,800 columns has over 3 million possible column pair combinations. Finding the meaningful ones manually is not realistic. My dissertation asked whether a system could do it automatically, at scale, with enough accuracy to be trusted.

Example of a messy, undocumented database schema

Example of a real-world database with missing foreign key documentation

2  ·  The Solution

A six-step pipeline combining statistics, naming patterns, and human expertise.

I built a six-step pipeline that combines statistical analysis, naming pattern recognition, and human expertise to discover undeclared foreign key relationships automatically.

System Architecture Diagram

Figure 3.2: System Architecture Diagram

1
Schema Reading
The system connects to the database and reads its structure: tables, columns, data types, and indexes. Foreign key information is deliberately excluded, simulating the real-world scenario where it is absent.
2
Column Profiling
Every column is sampled to build a statistical profile, capturing how many distinct values it has, null rates, and whether it looks like a key column.
3
Candidate Generation
The system identifies column pairs that could plausibly be foreign key relationships, using naming patterns and structural rules to narrow down millions of possible combinations.
4
Multi-Signal Scoring
Each candidate is scored using two signals. The heuristic signal checks name similarity and type compatibility. The value containment signal checks whether the values in one column actually appear in the other. These are combined into a single confidence score.
5
Human Review
Candidates the system is uncertain about are surfaced to a human reviewer with full context: confidence score, signal breakdown, sample values from both columns side by side, and a plain-English explanation. The reviewer confirms, rejects, or marks as uncertain.
6
Knowledge Base
Every human decision is stored and used to improve future runs, adjusting signal weights and learning new naming patterns over time. The more decisions are made, the fewer candidates require review in subsequent sessions.
3  ·  The Human Review Interface

Every decision is explainable. No black boxes.

The centrepiece of the system is the human review interface. It is a CLI tool that presents uncertain candidates to a reviewer with full context: confidence score, signal breakdown, sample values from both columns side by side, and a plain-English explanation generated by GPT-4.

Reviewers are never shown a black box score. Every decision is fully explainable.

Human Review Interface CLI screenshot

Human Review Interface: CLI screenshot

4  ·  Results

92.64% F1: over 15 percentage points above either signal alone.

The system was evaluated against 310 known foreign key relationships in MusicBrainz, a real open-source music database with 368 tables and 238 million rows.

Testing each component in isolation revealed how much each signal contributed:

Configuration F1 Score
Name matching only 76.90%
Value analysis only 78.61%
Combined (final system) 92.64%

Neither signal alone was sufficient. Combined, they produced a result over 15 percentage points higher than either individually.

Key Findings
GPT-4 hurt, not helped.
Adding GPT-4 as a scoring signal decreased F1 by 15 percentage points. In a database with consistent naming conventions, statistical signals were simply stronger. GPT-4 was retained for generating plain-English explanations, a task it is well suited to.
The bottleneck was candidate generation, not scoring.
93.5% of missed relationships were never generated as candidates in the first place. Every single one of them had perfect value containment when checked manually, meaning the scoring algorithm would have caught them. The gap is in naming pattern coverage, not accuracy.
Small samples were sufficient.
Sampling just 500 values out of 238 million rows produced the same accuracy as sampling 1,000 or 2,000. For this type of statistical task, the signal is strong enough that large samples add noise rather than improve precision.
Error analysis revealed a structural naming problem.
A subset of tables in MusicBrainz use generic column names like entity0 and entity1 instead of descriptive names like artist_id. These could not be matched by name-based patterns alone, accounting for the majority of missed relationships. This finding directly informed a targeted fix applied to the candidate generator.