CIS 6930 · Spring 2026 · University of Florida

Data Engineering (with LLMs)

Fifteen weeks. Twenty-four graduate students. Real datasets, real systems, real evaluations. The showcase below collects what they built: data pipelines that route deterministic rules, retrieval over past fixes, and LLM reasoning, each on a real dataset with measured cost and accuracy.

24Students

19Demo videos

8Application domains

15Weeks of work

A UF Data Studio course

This is a data engineering course. The discipline has its own canon of pipelines and primitives: extract from sources, integrate heterogeneous schemas, clean and label, evaluate, audit. The course covered the same canon in the Spring 2025 offering and in earlier semesters. What shifted this year is that large language models started reshaping how each stage of the pipeline gets built, from extraction to entity resolution to data cleaning to evaluation. The Spring 2026 cohort set out to examine that shift, asking at every stop where in the pipeline an LLM belongs and what has to surround it for the contribution to remain auditable when you let it in. The twenty-four student projects below are the cohort’s experiments testing that question on real data.

An LLM call costs roughly 10,000× more than a regex match and 100× more than a SQL lookup, so the answer is rarely “everywhere.” Each project worked out a different piece of the answer on a real dataset, against a real baseline, with the cost arithmetic written down. Some projects built LLM-orchestrated pipelines from scratch. Some pitted an LLM-augmented approach against a traditional baseline. Some proposed novel architectures that the literature had not yet tried.

Guest speakers

Two industry engineers visited the cohort in February. Dr. Shiree Hughes (General Motors, Big Data) came on Monday, February 9 to walk through the Hadoop, Spark, and Kafka stack that ingests telemetry from millions of connected vehicles, schema-evolution practices, and how data quality is handled at automotive scale. Mikhail Sinanan (Spotify, Global Head of Engineering for Audiobooks Supply Chain) joined on Friday, February 13 for a fireside on Spotify’s data platform, where Kafka handles 50 million events per second and the platform serves 1.4 trillion events per day. The visits covered streaming architectures, platform ownership across teams, and the cost arithmetic that governs what ships versus what does not.

Dr. Shiree Hughes

Software Engineer, Big Data · General Motors

Designs and operates the Hadoop, Spark, and Kafka pipelines that ingest telemetry from connected vehicles. Former President of ACM-Women. Joined an in-class discussion on streaming, schema evolution, and the practical realities of running data infrastructure at automotive scale.

See the visit page →

Mikhail Sinanan

Global Head of Engineering, Audiobooks Supply Chain · Spotify

Architect of foundational systems behind Spotify Wrapped. Scaled the music platform 4× while cutting cloud costs 40%. Joined a fireside chat on processing 1.4 trillion data points a day, Kafka at 50M events/second, and what platform engineering looks like at global scale.

See the visit page →

How the course is built

The skill-building track moves students through six stages over the first eight weeks. Each stage introduces one architectural primitive the cohort then uses in the final-paper projects shown later on this page.

API integration with LLMs (Assignment 0). Collect structured data from a public API, have an LLM extract fields from unstructured responses, and validate against a schema.
MCP data pipelines (Assignment 1). Build a Model Context Protocol server that exposes typed tools, then wire those tools into a data engineering workflow.
MCP on research computing (Assignment 1.5). The same architecture against UF’s NaviGator gateway and HiPerGator infrastructure.
Retrieval-augmented systems (Assignment 2). Vector stores, embeddings, a standard RAG pipeline, and evaluation against a held-out QA set.
Fairness auditing (Assignment 3). Fairness metrics on the Adult Income dataset, mitigation comparisons, and a reflection on the impossibility theorem.
The project track (1,000 points). Proposal, code checkpoint, paper outline, draft paper, final paper, peer review of two peers’ drafts, group presentation, and finals presentation.

UF research computing

Every student in the cohort gets access to two UF-operated platforms that this course relies on. NaviGator is the AI gateway. HiPerGator is the research computing cluster. Both are free to UF students, faculty, and staff.

NaviGator AI

it.ufl.edu/ai

UF's AI gateway. Provides API access to commercial and open-weight models through a single credential, plus hosted chat and notebook interfaces.

NaviGator Toolkit. API access to GPT, Llama, Gemini, and Claude. Used for most of the cohort's LLM calls.
NaviGator Chat. Web chat interface that can be pointed at custom datasets.
NaviGator Notebook. Hosted notebook environment running Google Gemini for document work.
Available to all UF students, faculty, and staff at no cost.

HiPerGator

it.ufl.edu/rc

UF's research computing cluster. Runs the cohort's MCP servers, model evaluations, and any workload that needs more than a laptop.

CPU. 32 cores maximum per class allocation.
GPU. 8 GPUs maximum, requested in advance rather than allocated automatically.
Storage. 2 TB on the Blue parallel filesystem.
Student accounts expire two weeks after the semester ends. Request resources at least two weeks before the semester starts.

Assignment 1.5 walks students through deploying an MCP server on HiPerGator and pointing it at the NaviGator endpoints. The project track uses both platforms throughout.

Featured projects

The class held group-presentation finals at the end of the term. Five group winners advanced, and the class voted Shane Thomas’s lateral-movement detection work the overall best presentation. The four group finalists (Zachary Zeng, Nikhitha Nagabhyru, Kevin Tran, Zachary Allen) carry the silver-medal badge below. The featured set also includes four papers that did not advance to finals but stood out on methodological grounds: Sanjeev’s pre-registered volatility study, Xiaomeng’s mixed-effects regression across nine essay scorers, Atul’s hallucination-as-metric framing on NYC 311 data, and Adnan’s MCP-orchestrated comparison against four canonical HoloClean baselines. Click any card to watch the student walk through the work.

Tag legend. Method tags identify the architectural approach (MCP RAG Entity Resolution Tiered Routing Hybrid LLM+Rule Statistical Eval Negative Result Pre-Registered Agentic Data Cleaning) and domain tags identify the application area (Healthcare Cybersecurity Civic Data Music Chemistry Education Jobs Books Logs Sports).

🏆 Overall Best Presentation · Class Vote

Two-Phase Agentic Lateral-Movement Detection on LANL

Shane Thomas

Cybersecurity Agentic Statistical Eval

"The agent never sees ground truth. The evaluator never sees the agent's internal state. The separation is the whole point."

Shane built a two-phase agent to find adversary lateral movement in the LANL authentication logs. Phase one scans the graph blind and surfaces anomalies. Phase two interrogates each anomaly entity by entity. The cohort voted this the strongest presentation of the term because the experimental design carries the result: thirty balanced cases across three independent GPT-OSS-120b runs put mean F1 at 0.747, against a Cypher-rules baseline at 0.732, with the false-positive distribution broken down by failure mode rather than averaged away.

Watch the demo

🥈 Group Finalist · Group C Winner

Decomposition vs Knowledge Augmentation in Multi-Agent Entity Resolution

Zachary Zeng

Entity Resolution Agentic Statistical Eval Negative Result

"Most of the F1 gain prior multi-agent ER papers credited to decomposition was actually hiding inside the retrieval step."

Multi-agent entity resolution papers had been claiming that breaking a matching task into specialist agents lifts F1. Zachary held the base model fixed at gpt-oss-120b and stripped retrieval and external knowledge from the pipeline, so decomposition could be measured alone. Five independent runs on the Abt-Buy benchmark and 115 discordant pooled pairs later, the McNemar p-value sits at 0.26 and the Cohen's kappa at 0.942 ± 0.005. Decomposition on its own does not move the needle. The gain prior work credited to it was hiding inside the retrieval step his study removed.

Watch the demo

The Canonicalization Gap: A Pre-Registered Volatility Study

Sanjeev Kamath

Pre-Registered Negative Result Statistical Eval

"Hypothesis registered before any data was collected. Hypothesis rejected by the data."

Sanjeev wrote down his hypothesis before he collected a single data point. Schema validation should cut LLM output volatility in half, the pre-registration said. Six hundred baseline calls across thirty countries and two model scales (gpt-oss-120b and llama-3.1-8b) later, the data said otherwise. SLMs gained 11.9% and the gain is statistically significant. LLMs gained 6.3% and the gain is not. The residual volatility that schema validation could not catch is what the paper names the canonicalization gap.

Watch the demo

Within-Response Controlled Rewriting for Construct Validity

Xiaomeng Xiong

Education Statistical Eval

"Encoders go up with rewriting complexity. GPTs go down. The split was non-obvious until mixed-effects regression made it visible."

If a student rewrites the same essay answer at three escalating levels of linguistic complexity while saying exactly the same thing, do automated scorers grade them the same? Xiaomeng built 743 four-version sets from 1,672 ASAP-SAS responses, kept the semantic content identical, and ran nine scorers across the variations. Encoders like BERT and DeBERTa-v3 score the rewrites higher as the prose gets more complex. GPT-4.1, GPT-5.1, and Claude Sonnet 4.6 score them lower. Mixed-effects regression with response-level random intercepts surfaced the split.

Watch the demo

NYC 311 Self-Healing with Hallucination as a First-Class Metric

Atul Arun

MCP Civic Data Data Cleaning

"What does an LLM-augmented cleaning pipeline actually risk? Atul put hallucination rate alongside precision and recall."

Atul wanted to know what an LLM-augmented data-cleaning pipeline puts at risk. He took 1,000 NYC 311 records, injected six controlled corruption families across 542 fields, and watched three cleaning strategies handle the wreckage. Drop-and-log, regex repair, and an MCP+LLM agent with thresholded retrieval (sim=0.25) each saw the same test data. The result reads not as a leaderboard but as a map of where each method earns its cost. The 0.1671 hallucination rate sits alongside precision and recall instead of being buried in an appendix.

Watch the demo

🥈 Group Finalist · Group A Winner

Tiered Self-Healing ETL with RAG Memory

Nikhitha Nagabhyru

Tiered Routing RAG ETL Schema Drift

"A 2× reduction in LLM calls at unchanged accuracy, because the router remembers what it has already fixed."

Nikhitha built a self-healing ETL router that learns from its own history. Deterministic rules try first. A RAG memory of past repair decisions tries second. Only the genuinely novel inputs reach the LLM. On a cold cache the router fires the LLM for 31.2% of the Magellan Walmart-Amazon records. On a warm cache, after the system has seen its first test set, the rate drops to 15.6%. F1 stays at 0.995 against the LLM-only baseline, at six times lower cost and five times lower latency. A five-action constrained vocabulary (rename, cast, fillna, drop, add_default) keeps the LLM from ever writing directly into the data.

Demo video hosted privately.

LLM-Triage as a Precision Filter for Static Security Analysis

Juan Veliz

Cybersecurity Hybrid LLM+Rule

"Cheap regex finds everything that might be a vulnerability. The LLM only has to recognize the real ones."

Static security analyzers drown teams in false positives. Juan flipped the usual architecture. Cheap regex sweeps first and flags every candidate. Llama-3.3-70B then triages each candidate and discards the noise. The LLM never has to find vulnerabilities; it only has to recognize them. On 1,000 BigCode Stack files at 30% injection rate with seed 42, the architecture cuts false positives 31% at matched recall, for $4.87 per thousand files. The four-configuration ablation (no few-shot, no schema normalization, no context window, threshold sweep) isolates which prompt choices move the needle.

Watch the demo

Frequency-Based RAG vs Vector RAG for Music Sessions

Kanishka Dhaundiyal

RAG Music Statistical Eval

"Strip the user's demographic profile from the prompt and LLM accuracy collapses 69×."

Kanishka built a frequency-based RAG (not vector RAG) to predict the next track in a music session, over 19M Last.fm events from 992 users. The baseline win is real and verifiable: McNemar's chi-squared 16.12, p=3.76e-5. The more interesting question came from the ablation. When Kanishka stripped the user's demographic profile from the prompt, LLM accuracy collapsed 69-fold, from 14% down to 0.2%. The number quantifies what practitioners feel when deciding whether to spend on LLMs for cold-start cases or stick with frequency.

Watch the demo

Vivek Chenganassery

MCP Logs

"An LLM cannot read 1.5 GB of logs. Vivek built it a way to navigate them instead."

No LLM can read 1.5 GB of logs. Vivek built it a way to navigate them. Three MCP tools (search_logs, get_error_counts, get_context) let the agent start from a cluster-summary view of 6.8 million HDFS rows and drill in step by step toward the specific traces that matter, at a 170,000:1 compression ratio between corpus and answer. A subsection called "why grep scores 1.0" documents an evaluation artifact that affected the comparison, with the methodology and the correction reported in detail.

Watch the demo

MCP-Orchestrated Cleaning on the HoloClean Hospital Benchmark

Adnan Farid

MCP Healthcare Data Cleaning

"An MCP-based cleaner, benchmarked head-to-head against the four canonical probabilistic systems."

The HoloClean hospital dataset is the canonical proving ground for probabilistic-inference cleaners. Adnan built an MCP-based architecture for the same task. One server extracts records, a second discovers candidate violations, a third proposes repairs. The paper benchmarks the system against HoloClean, Holistic, KATARA, and SCARE in a single comparison table, then breaks the result down by error type with full TP/FP/FN counts and a token-cost column. The cost-accuracy tradeoff against probabilistic inference is reported across the comparison.

Watch the demo

Project gallery

The thirteen projects below cover a wider sweep of the architectural space, from purely deterministic ETL baselines to fully agentic detection loops. Several read as practitioner case studies. Vatsal’s cross-city permit work, Rukaiya’s F1 race-strategy paper, and the two HoloClean cleaning projects each pit several architectures against one another on a real dataset and report the disagreement breakdown in full. The Books, PubChem, and skill-extraction projects work at the schema-mapping end of the spectrum, where the LLM’s job is to land a record into the right slot of a known taxonomy. Each card below is worth reading for the methodological choice the author defended in their final paper, not the headline number alone.

Cross-City Permit Integration with the Valentine Matcher Stack

Vatsal Harish Shah

Civic Data Entity Resolution Schema Matching

Two cities, two permit ontologies, and a question. Can an LLM line them up better than the established schema-matching toolkit can? Vatsal pitted five Valentine matchers (Cupid, Similarity Flooding, distribution-based, Jaccard, embedding-based) against five LLM endpoints on Gainesville and San Francisco permit data. Neither approach won outright. The strongest signal came when he intersected the matchers' correct predictions with one LLM's predictions, producing P=1 and F1=2/3 from a hybrid consensus neither method could reach alone.

Watch the demo

🥈 Group Finalist · Group D Winner

LLM-Guided Query Rewriting for PubChem

Zachary Allen

RAG Chemistry

PubChem is one of the world's largest chemistry databases, and asking it the right question is half the battle. Zachary built a system that rewrites a user's vague chemistry query into a verified PubChem REST call, with vector retrieval grounding the rewrite. To keep the test fair, he had Claude generate the vague queries, sidestepping prompt-answer leakage. On 500 compounds and 120 evaluation queries, the rewriter hit 70% accuracy against a 34.17% base, a 35.83-point lift. The paper is honest about what it does not solve, listing six limitations including the "re-run until identified" confound.

Watch the demo

Drift-Aware ETL with diff_manifests Audit Trail

Sai Meghana Barla

ETL Schema Drift Tiered Routing Civic Data

Schemas drift in production, and most pipelines silently mishandle the drift until something downstream breaks. Sai Meghana built an ETL pipeline that watches its own schema, derived from a 27-column auto-manifest off a Gainesville 311 snapshot. Her diff_manifests audit trail separates real production drift from synthetic test drift. A Jensen-Shannon-based recovery-stability index sits beside the usual mapping F1. The hybrid-triage policy reaches the same 0.8333 schema-recovery score as the LLM-only baseline at 73.6% fewer tokens.

Demo video hosted privately.

Synthetic-vs-Real Benchmark Gap on MIMIC-III/IV Vital Signs

Ian Arnold

Healthcare Hybrid LLM+Rule Statistical Eval

Clinical NLP papers often validate on synthetic templates, then ship to production EHR text and learn the hard way. Ian quantified the gap. He kept everything constant (the same code path, the same prompts, the same patterns) and swapped only the data, running three backends (regex, Claude Sonnet, hybrid) over synthetic MIMIC-IV templates and real MIMIC-III nursing and physician notes. Macro F1 collapsed 8.2× moving from one to the other. On the real data, the hybrid backend holds 88% of the LLM's macro F1 at 38% of the API cost.

Demo video hosted privately.

MCP vs Non-MCP Route Optimization

Vittal Chintamaneni

MCP Routing

Vittal pitted Google's base directions, an LLM with no tool calls, and an LLM with MCP-orchestrated context against each other on five routes spanning short and long durations. The non-MCP LLM averaged +46.39 minutes off the Google baseline. The MCP-augmented LLM averaged near zero. A three-variant ablation (no checkpoints, no weather, no LLM) shows which inputs the agent leaned on when it got close.

Watch the demo

🥈 Group Finalist · Group B Winner

LLM vs Hand-Coded ETL on Three Gainesville Open-Data Sources

Kevin Tran

Civic Data Hybrid LLM+Rule ETL

Kevin asked a question every working data engineer has stared at. When does an LLM save time over handwritten ETL? He compared four pipelines (handwritten ETL, string similarity, few-shot LLM, zero-shot LLM) on three Gainesville open-data sources (Socrata IDs p798-x3nx, vu9p-a5f7, gvua-xt9q) and reported accuracy broken down by field type. The answer depends on field type. The per-field breakdown across numeric, date, categorical, and free-text fields reports where each pipeline wins and loses. Sample size is small at 75 held-out records.

Watch the demo

Honest Negative-Finding Job-Extraction Benchmark

Sai Teja Appani

Jobs Hybrid LLM+Rule Negative Result

Sai Teja studied structured-field extraction from 100 LinkedIn job postings. Four target fields and a five-config ablation (no requeue, no validation, no few-shot, threshold 0.4, threshold 0.8) map out where the LLM helps and where it does not. The LLM wins on salary, where regex scores F1=0.00 and the LLM scores F1=0.49. The LLM loses on closed-vocabulary fields where the regex grammar is well-defined. The proposed hybrid architecture costs $14.88 to process 124,000 records.

Watch the demo

Controlled-Corruption Cleaning Comparison on HoloClean Hospital

Sanya Chaturvedi

Healthcare Data Cleaning MCP

Sanya took the canonical HoloClean hospital dataset and broke it on purpose, at both 5% and 20% severity, across three different error categories. A tool-bounded LLM orchestrator then tried to fix the damage, alongside the rule-based baseline. The per-category breakdown shows where the LLM earns its cost. Typo correction is the clearest case, with the LLM at 0.74 accuracy against the rule's 0.39. For other categories the gap narrows or reverses, with the per-category numbers reported in full.

Watch the demo

Skill Extraction with LLM+MCP+ESCO

Palavalli Shyam

MCP Jobs

Palavalli built a skill-extractor for Hacker News job postings that decouples the probabilistic part (an LLM proposes candidate skills) from the deterministic part (an MCP tool validates them against the ESCO taxonomy). On 200 postings with silver-standard ground truth, the system reaches 87.8% recall against the regex baseline's 38.0%. The adversarial test added five typos per posting. Recall fell to 50%, still above the baseline.

Watch the demo

Four-Method Entity Resolution on News and Tweets

Sri Ashritha Appalchity

Entity Resolution

Sri Ashritha ran four entity-resolution methods head-to-head on 204K All-the-News-2.0 articles and 2,483 labeled CLEF RepLab tweet clusters. The four were rule-based matching, Magellan Random Forest, Ditto S-BERT, and GPT-4o zero-shot. MinHash LSH blocking validated recall at 0.92 on news and 0.87 on tweets. The paper's cost-breakeven analysis lands at roughly 4,700 pairs. At that volume the GPT-4o approach costs the same as Magellan. Below it, the LLM is cheaper.

Watch the demo

Book Metadata Integration with FRBR Schema

Harris Barton

Books Entity Resolution

Harris wired OpenLibrary and GoogleBooks together end-to-end into the FRBR bibliographic schema. He curated a 200-book ground-truth set by hand, split it 140/60 for train and test, and reported three-run mean and standard deviation across six fields. The prompt-fidelity ablation reports the null result. Low, mid, and high fidelity prompts give F1 of 0.925, 0.921, and 0.928. Prompt engineering did not move the metric on this task, and the paper reports the finding rather than hiding it.

Watch the demo

Reproducibility-First Deterministic ETL Baseline

Siyuan Pan

ETL Civic Data

Siyuan's project pivoted mid-semester. The proposal was for an LLM-orchestrated pipeline. The final deliverable is a fully deterministic ETL integrating NYC 311 with restaurant inspection data into a 22-column unified schema. The pivot is disclosed up front in the paper rather than buried. The deliverable runs all 19 unit tests in 1.07 seconds at $0.00 marginal cost.

Watch the demo

Negative-Orchestration on F1 Race Strategy

Rukaiya Khan

Sports RAG Negative Result Agentic

On Formula 1 race-strategy decisions, the "no_multi_agent" configuration scored 0.900 and the "full_rag" configuration scored 0.520. Five-seed runs per variant rule out luck. RAGAS metrics (faithfulness, answer-relevancy, context-precision, recall) trace the failure to a collapse in answer-relevancy under multi-agent decomposition. The paper documents where orchestration hurt rather than helped.

Demo video hosted privately.

Hybrid Permit Normalization with Token-Budget Controller

Jiangwei Wang

Civic Data Hybrid LLM+Rule

Jiangwei normalized 1,000 permit records with a hybrid rule-plus-LLM pipeline under a 20,000-token budget cap. The disagreement matrix is the clearest signal: the hybrid corrected 260 records the rules missed, the rules corrected only 2 records the hybrid missed, both got 683 right, and both missed 55. The marginal contribution of the LLM is the difference between those two counts.

Demo video hosted privately.

Resources

Course materials: Syllabus · Schedule · Assignments · Projects

Acknowledgments

The work above represents fifteen weeks of student effort. The students chose their own problems, defended their methodological choices in conference-style peer review, and shipped artifacts that hold up against the published literature in their respective subareas. They are the reason this term’s record looks the way it does.

If you are a researcher whose work appears in any of the cohort’s reference lists and you would like to be in touch with the student who cited you, please reach out.

← Back to course homepage

CIS 6930 Spring 26

Data Engineering (with LLMs)

Guest speakers

Dr. Shiree Hughes

Mikhail Sinanan

How the course is built

UF research computing

NaviGator AI

HiPerGator

Featured projects

Two-Phase Agentic Lateral-Movement Detection on LANL

Decomposition vs Knowledge Augmentation in Multi-Agent Entity Resolution

The Canonicalization Gap: A Pre-Registered Volatility Study

Within-Response Controlled Rewriting for Construct Validity

NYC 311 Self-Healing with Hallucination as a First-Class Metric

Tiered Self-Healing ETL with RAG Memory

LLM-Triage as a Precision Filter for Static Security Analysis

Frequency-Based RAG vs Vector RAG for Music Sessions

Hierarchical MCP-as-Navigation for HDFS Logs

MCP-Orchestrated Cleaning on the HoloClean Hospital Benchmark

Project gallery

Cross-City Permit Integration with the Valentine Matcher Stack

LLM-Guided Query Rewriting for PubChem

Drift-Aware ETL with diff_manifests Audit Trail

Synthetic-vs-Real Benchmark Gap on MIMIC-III/IV Vital Signs

MCP vs Non-MCP Route Optimization

LLM vs Hand-Coded ETL on Three Gainesville Open-Data Sources

Honest Negative-Finding Job-Extraction Benchmark

Controlled-Corruption Cleaning Comparison on HoloClean Hospital

Skill Extraction with LLM+MCP+ESCO

Four-Method Entity Resolution on News and Tweets

Book Metadata Integration with FRBR Schema

Reproducibility-First Deterministic ETL Baseline

Negative-Orchestration on F1 Race Strategy

Hybrid Permit Normalization with Token-Budget Controller

Resources

Acknowledgments