CIS 6930 Spring 26

Logo

Data Engineering at the University of Florida

CIS 6930 · Spring 2026 · University of Florida

Data Engineering (with LLMs)

Fifteen weeks. Twenty-four graduate students. Real datasets, real systems, real evaluations. The showcase below collects what they built: data pipelines that route deterministic rules, retrieval over past fixes, and LLM reasoning, each on a real dataset with measured cost and accuracy.

24Students
19Demo videos
8Application domains
15Weeks of work

This is a data engineering course. The discipline has its own canon of pipelines and primitives: extract from sources, integrate heterogeneous schemas, clean and label, evaluate, audit. The course covered the same canon in the Spring 2025 offering and in earlier semesters. What shifted this year is that large language models started reshaping how each stage of the pipeline gets built, from extraction to entity resolution to data cleaning to evaluation. The Spring 2026 cohort set out to examine that shift, asking at every stop where in the pipeline an LLM belongs and what has to surround it for the contribution to remain auditable when you let it in. The twenty-four student projects below are the cohort’s experiments testing that question on real data.

An LLM call costs roughly 10,000× more than a regex match and 100× more than a SQL lookup, so the answer is rarely “everywhere.” Each project worked out a different piece of the answer on a real dataset, against a real baseline, with the cost arithmetic written down. Some projects built LLM-orchestrated pipelines from scratch. Some pitted an LLM-augmented approach against a traditional baseline. Some proposed novel architectures that the literature had not yet tried.

Guest speakers

Two industry engineers visited the cohort in February. Dr. Shiree Hughes (General Motors, Big Data) came on Monday, February 9 to walk through the Hadoop, Spark, and Kafka stack that ingests telemetry from millions of connected vehicles, schema-evolution practices, and how data quality is handled at automotive scale. Mikhail Sinanan (Spotify, Global Head of Engineering for Audiobooks Supply Chain) joined on Friday, February 13 for a fireside on Spotify’s data platform, where Kafka handles 50 million events per second and the platform serves 1.4 trillion events per day. The visits covered streaming architectures, platform ownership across teams, and the cost arithmetic that governs what ships versus what does not.

Dr. Shiree Hughes

Dr. Shiree Hughes

Software Engineer, Big Data · General Motors

Designs and operates the Hadoop, Spark, and Kafka pipelines that ingest telemetry from connected vehicles. Former President of ACM-Women. Joined an in-class discussion on streaming, schema evolution, and the practical realities of running data infrastructure at automotive scale.

See the visit page →
Mikhail Sinanan

Mikhail Sinanan

Global Head of Engineering, Audiobooks Supply Chain · Spotify

Architect of foundational systems behind Spotify Wrapped. Scaled the music platform 4× while cutting cloud costs 40%. Joined a fireside chat on processing 1.4 trillion data points a day, Kafka at 50M events/second, and what platform engineering looks like at global scale.

See the visit page →

How the course is built

The skill-building track moves students through six stages over the first eight weeks. Each stage introduces one architectural primitive the cohort then uses in the final-paper projects shown later on this page.

  1. API integration with LLMs (Assignment 0). Collect structured data from a public API, have an LLM extract fields from unstructured responses, and validate against a schema.
  2. MCP data pipelines (Assignment 1). Build a Model Context Protocol server that exposes typed tools, then wire those tools into a data engineering workflow.
  3. MCP on research computing (Assignment 1.5). The same architecture against UF’s NaviGator gateway and HiPerGator infrastructure.
  4. Retrieval-augmented systems (Assignment 2). Vector stores, embeddings, a standard RAG pipeline, and evaluation against a held-out QA set.
  5. Fairness auditing (Assignment 3). Fairness metrics on the Adult Income dataset, mitigation comparisons, and a reflection on the impossibility theorem.
  6. The project track (1,000 points). Proposal, code checkpoint, paper outline, draft paper, final paper, peer review of two peers’ drafts, group presentation, and finals presentation.

UF research computing

Every student in the cohort gets access to two UF-operated platforms that this course relies on. NaviGator is the AI gateway. HiPerGator is the research computing cluster. Both are free to UF students, faculty, and staff.

it.ufl.edu/ai

UF's AI gateway. Provides API access to commercial and open-weight models through a single credential, plus hosted chat and notebook interfaces.

  • NaviGator Toolkit. API access to GPT, Llama, Gemini, and Claude. Used for most of the cohort's LLM calls.
  • NaviGator Chat. Web chat interface that can be pointed at custom datasets.
  • NaviGator Notebook. Hosted notebook environment running Google Gemini for document work.
  • Available to all UF students, faculty, and staff at no cost.

HiPerGator

it.ufl.edu/rc

UF's research computing cluster. Runs the cohort's MCP servers, model evaluations, and any workload that needs more than a laptop.

  • CPU. 32 cores maximum per class allocation.
  • GPU. 8 GPUs maximum, requested in advance rather than allocated automatically.
  • Storage. 2 TB on the Blue parallel filesystem.
  • Student accounts expire two weeks after the semester ends. Request resources at least two weeks before the semester starts.

Assignment 1.5 walks students through deploying an MCP server on HiPerGator and pointing it at the NaviGator endpoints. The project track uses both platforms throughout.

The class held group-presentation finals at the end of the term. Five group winners advanced, and the class voted Shane Thomas’s lateral-movement detection work the overall best presentation. The four group finalists (Zachary Zeng, Nikhitha Nagabhyru, Kevin Tran, Zachary Allen) carry the silver-medal badge below. The featured set also includes four papers that did not advance to finals but stood out on methodological grounds: Sanjeev’s pre-registered volatility study, Xiaomeng’s mixed-effects regression across nine essay scorers, Atul’s hallucination-as-metric framing on NYC 311 data, and Adnan’s MCP-orchestrated comparison against four canonical HoloClean baselines. Click any card to watch the student walk through the work.

Tag legend. Method tags identify the architectural approach (MCP RAG Entity Resolution Tiered Routing Hybrid LLM+Rule Statistical Eval Negative Result Pre-Registered Agentic Data Cleaning) and domain tags identify the application area (Healthcare Cybersecurity Civic Data Music Chemistry Education Jobs Books Logs Sports).
🏆 Overall Best Presentation · Class Vote

Two-Phase Agentic Lateral-Movement Detection on LANL

Shane Thomas

Cybersecurity Agentic Statistical Eval

"The agent never sees ground truth. The evaluator never sees the agent's internal state. The separation is the whole point."

Shane built a two-phase agent to find adversary lateral movement in the LANL authentication logs. Phase one scans the graph blind and surfaces anomalies. Phase two interrogates each anomaly entity by entity. The cohort voted this the strongest presentation of the term because the experimental design carries the result: thirty balanced cases across three independent GPT-OSS-120b runs put mean F1 at 0.747, against a Cypher-rules baseline at 0.732, with the false-positive distribution broken down by failure mode rather than averaged away.

Watch the demo

The thirteen projects below cover a wider sweep of the architectural space, from purely deterministic ETL baselines to fully agentic detection loops. Several read as practitioner case studies. Vatsal’s cross-city permit work, Rukaiya’s F1 race-strategy paper, and the two HoloClean cleaning projects each pit several architectures against one another on a real dataset and report the disagreement breakdown in full. The Books, PubChem, and skill-extraction projects work at the schema-mapping end of the spectrum, where the LLM’s job is to land a record into the right slot of a known taxonomy. Each card below is worth reading for the methodological choice the author defended in their final paper, not the headline number alone.

Cross-City Permit Integration with the Valentine Matcher Stack

Vatsal Harish Shah

Civic Data Entity Resolution Schema Matching

Two cities, two permit ontologies, and a question. Can an LLM line them up better than the established schema-matching toolkit can? Vatsal pitted five Valentine matchers (Cupid, Similarity Flooding, distribution-based, Jaccard, embedding-based) against five LLM endpoints on Gainesville and San Francisco permit data. Neither approach won outright. The strongest signal came when he intersected the matchers' correct predictions with one LLM's predictions, producing P=1 and F1=2/3 from a hybrid consensus neither method could reach alone.

Watch the demo
🥈 Group Finalist · Group D Winner

LLM-Guided Query Rewriting for PubChem

Zachary Allen

RAG Chemistry

PubChem is one of the world's largest chemistry databases, and asking it the right question is half the battle. Zachary built a system that rewrites a user's vague chemistry query into a verified PubChem REST call, with vector retrieval grounding the rewrite. To keep the test fair, he had Claude generate the vague queries, sidestepping prompt-answer leakage. On 500 compounds and 120 evaluation queries, the rewriter hit 70% accuracy against a 34.17% base, a 35.83-point lift. The paper is honest about what it does not solve, listing six limitations including the "re-run until identified" confound.

Watch the demo

Drift-Aware ETL with diff_manifests Audit Trail

Sai Meghana Barla

ETL Schema Drift Tiered Routing Civic Data

Schemas drift in production, and most pipelines silently mishandle the drift until something downstream breaks. Sai Meghana built an ETL pipeline that watches its own schema, derived from a 27-column auto-manifest off a Gainesville 311 snapshot. Her diff_manifests audit trail separates real production drift from synthetic test drift. A Jensen-Shannon-based recovery-stability index sits beside the usual mapping F1. The hybrid-triage policy reaches the same 0.8333 schema-recovery score as the LLM-only baseline at 73.6% fewer tokens.

Demo video hosted privately.

Synthetic-vs-Real Benchmark Gap on MIMIC-III/IV Vital Signs

Ian Arnold

Healthcare Hybrid LLM+Rule Statistical Eval

Clinical NLP papers often validate on synthetic templates, then ship to production EHR text and learn the hard way. Ian quantified the gap. He kept everything constant (the same code path, the same prompts, the same patterns) and swapped only the data, running three backends (regex, Claude Sonnet, hybrid) over synthetic MIMIC-IV templates and real MIMIC-III nursing and physician notes. Macro F1 collapsed 8.2× moving from one to the other. On the real data, the hybrid backend holds 88% of the LLM's macro F1 at 38% of the API cost.

Demo video hosted privately.

MCP vs Non-MCP Route Optimization

Vittal Chintamaneni

MCP Routing

Vittal pitted Google's base directions, an LLM with no tool calls, and an LLM with MCP-orchestrated context against each other on five routes spanning short and long durations. The non-MCP LLM averaged +46.39 minutes off the Google baseline. The MCP-augmented LLM averaged near zero. A three-variant ablation (no checkpoints, no weather, no LLM) shows which inputs the agent leaned on when it got close.

Watch the demo
🥈 Group Finalist · Group B Winner

LLM vs Hand-Coded ETL on Three Gainesville Open-Data Sources

Kevin Tran

Civic Data Hybrid LLM+Rule ETL

Kevin asked a question every working data engineer has stared at. When does an LLM save time over handwritten ETL? He compared four pipelines (handwritten ETL, string similarity, few-shot LLM, zero-shot LLM) on three Gainesville open-data sources (Socrata IDs p798-x3nx, vu9p-a5f7, gvua-xt9q) and reported accuracy broken down by field type. The answer depends on field type. The per-field breakdown across numeric, date, categorical, and free-text fields reports where each pipeline wins and loses. Sample size is small at 75 held-out records.

Watch the demo

Honest Negative-Finding Job-Extraction Benchmark

Sai Teja Appani

Jobs Hybrid LLM+Rule Negative Result

Sai Teja studied structured-field extraction from 100 LinkedIn job postings. Four target fields and a five-config ablation (no requeue, no validation, no few-shot, threshold 0.4, threshold 0.8) map out where the LLM helps and where it does not. The LLM wins on salary, where regex scores F1=0.00 and the LLM scores F1=0.49. The LLM loses on closed-vocabulary fields where the regex grammar is well-defined. The proposed hybrid architecture costs $14.88 to process 124,000 records.

Watch the demo

Controlled-Corruption Cleaning Comparison on HoloClean Hospital

Sanya Chaturvedi

Healthcare Data Cleaning MCP

Sanya took the canonical HoloClean hospital dataset and broke it on purpose, at both 5% and 20% severity, across three different error categories. A tool-bounded LLM orchestrator then tried to fix the damage, alongside the rule-based baseline. The per-category breakdown shows where the LLM earns its cost. Typo correction is the clearest case, with the LLM at 0.74 accuracy against the rule's 0.39. For other categories the gap narrows or reverses, with the per-category numbers reported in full.

Watch the demo

Skill Extraction with LLM+MCP+ESCO

Palavalli Shyam

MCP Jobs

Palavalli built a skill-extractor for Hacker News job postings that decouples the probabilistic part (an LLM proposes candidate skills) from the deterministic part (an MCP tool validates them against the ESCO taxonomy). On 200 postings with silver-standard ground truth, the system reaches 87.8% recall against the regex baseline's 38.0%. The adversarial test added five typos per posting. Recall fell to 50%, still above the baseline.

Watch the demo

Four-Method Entity Resolution on News and Tweets

Sri Ashritha Appalchity

Entity Resolution

Sri Ashritha ran four entity-resolution methods head-to-head on 204K All-the-News-2.0 articles and 2,483 labeled CLEF RepLab tweet clusters. The four were rule-based matching, Magellan Random Forest, Ditto S-BERT, and GPT-4o zero-shot. MinHash LSH blocking validated recall at 0.92 on news and 0.87 on tweets. The paper's cost-breakeven analysis lands at roughly 4,700 pairs. At that volume the GPT-4o approach costs the same as Magellan. Below it, the LLM is cheaper.

Watch the demo

Book Metadata Integration with FRBR Schema

Harris Barton

Books Entity Resolution

Harris wired OpenLibrary and GoogleBooks together end-to-end into the FRBR bibliographic schema. He curated a 200-book ground-truth set by hand, split it 140/60 for train and test, and reported three-run mean and standard deviation across six fields. The prompt-fidelity ablation reports the null result. Low, mid, and high fidelity prompts give F1 of 0.925, 0.921, and 0.928. Prompt engineering did not move the metric on this task, and the paper reports the finding rather than hiding it.

Watch the demo

Reproducibility-First Deterministic ETL Baseline

Siyuan Pan

ETL Civic Data

Siyuan's project pivoted mid-semester. The proposal was for an LLM-orchestrated pipeline. The final deliverable is a fully deterministic ETL integrating NYC 311 with restaurant inspection data into a 22-column unified schema. The pivot is disclosed up front in the paper rather than buried. The deliverable runs all 19 unit tests in 1.07 seconds at $0.00 marginal cost.

Watch the demo

Negative-Orchestration on F1 Race Strategy

Rukaiya Khan

Sports RAG Negative Result Agentic

On Formula 1 race-strategy decisions, the "no_multi_agent" configuration scored 0.900 and the "full_rag" configuration scored 0.520. Five-seed runs per variant rule out luck. RAGAS metrics (faithfulness, answer-relevancy, context-precision, recall) trace the failure to a collapse in answer-relevancy under multi-agent decomposition. The paper documents where orchestration hurt rather than helped.

Demo video hosted privately.

Hybrid Permit Normalization with Token-Budget Controller

Jiangwei Wang

Civic Data Hybrid LLM+Rule

Jiangwei normalized 1,000 permit records with a hybrid rule-plus-LLM pipeline under a 20,000-token budget cap. The disagreement matrix is the clearest signal: the hybrid corrected 260 records the rules missed, the rules corrected only 2 records the hybrid missed, both got 683 right, and both missed 55. The marginal contribution of the LLM is the difference between those two counts.

Demo video hosted privately.

Resources

Acknowledgments

The work above represents fifteen weeks of student effort. The students chose their own problems, defended their methodological choices in conference-style peer review, and shipped artifacts that hold up against the published literature in their respective subareas. They are the reason this term’s record looks the way it does.

If you are a researcher whose work appears in any of the cohort’s reference lists and you would like to be in touch with the student who cited you, please reach out.


← Back to course homepage