Data Engineering at the University of Florida
Fifteen weeks. Twenty-four graduate students. Real datasets, real systems, real evaluations. The showcase below collects what they built: data pipelines that route deterministic rules, retrieval over past fixes, and LLM reasoning, each on a real dataset with measured cost and accuracy.
This is a data engineering course. The discipline has its own canon of pipelines and primitives: extract from sources, integrate heterogeneous schemas, clean and label, evaluate, audit. The course covered the same canon in the Spring 2025 offering and in earlier semesters. What shifted this year is that large language models started reshaping how each stage of the pipeline gets built, from extraction to entity resolution to data cleaning to evaluation. The Spring 2026 cohort set out to examine that shift, asking at every stop where in the pipeline an LLM belongs and what has to surround it for the contribution to remain auditable when you let it in. The twenty-four student projects below are the cohort’s experiments testing that question on real data.
An LLM call costs roughly 10,000× more than a regex match and 100× more than a SQL lookup, so the answer is rarely “everywhere.” Each project worked out a different piece of the answer on a real dataset, against a real baseline, with the cost arithmetic written down. Some projects built LLM-orchestrated pipelines from scratch. Some pitted an LLM-augmented approach against a traditional baseline. Some proposed novel architectures that the literature had not yet tried.
Two industry engineers visited the cohort in February. Dr. Shiree Hughes (General Motors, Big Data) came on Monday, February 9 to walk through the Hadoop, Spark, and Kafka stack that ingests telemetry from millions of connected vehicles, schema-evolution practices, and how data quality is handled at automotive scale. Mikhail Sinanan (Spotify, Global Head of Engineering for Audiobooks Supply Chain) joined on Friday, February 13 for a fireside on Spotify’s data platform, where Kafka handles 50 million events per second and the platform serves 1.4 trillion events per day. The visits covered streaming architectures, platform ownership across teams, and the cost arithmetic that governs what ships versus what does not.
Designs and operates the Hadoop, Spark, and Kafka pipelines that ingest telemetry from connected vehicles. Former President of ACM-Women. Joined an in-class discussion on streaming, schema evolution, and the practical realities of running data infrastructure at automotive scale.
See the visit page →
Architect of foundational systems behind Spotify Wrapped. Scaled the music platform 4× while cutting cloud costs 40%. Joined a fireside chat on processing 1.4 trillion data points a day, Kafka at 50M events/second, and what platform engineering looks like at global scale.
See the visit page →The skill-building track moves students through six stages over the first eight weeks. Each stage introduces one architectural primitive the cohort then uses in the final-paper projects shown later on this page.
Every student in the cohort gets access to two UF-operated platforms that this course relies on. NaviGator is the AI gateway. HiPerGator is the research computing cluster. Both are free to UF students, faculty, and staff.
UF's AI gateway. Provides API access to commercial and open-weight models through a single credential, plus hosted chat and notebook interfaces.
UF's research computing cluster. Runs the cohort's MCP servers, model evaluations, and any workload that needs more than a laptop.
Assignment 1.5 walks students through deploying an MCP server on HiPerGator and pointing it at the NaviGator endpoints. The project track uses both platforms throughout.
The class held group-presentation finals at the end of the term. Five group winners advanced, and the class voted Shane Thomas’s lateral-movement detection work the overall best presentation. The four group finalists (Zachary Zeng, Nikhitha Nagabhyru, Kevin Tran, Zachary Allen) carry the silver-medal badge below. The featured set also includes four papers that did not advance to finals but stood out on methodological grounds: Sanjeev’s pre-registered volatility study, Xiaomeng’s mixed-effects regression across nine essay scorers, Atul’s hallucination-as-metric framing on NYC 311 data, and Adnan’s MCP-orchestrated comparison against four canonical HoloClean baselines. Click any card to watch the student walk through the work.
"The agent never sees ground truth. The evaluator never sees the agent's internal state. The separation is the whole point."
Shane built a two-phase agent to find adversary lateral movement in the LANL authentication logs. Phase one scans the graph blind and surfaces anomalies. Phase two interrogates each anomaly entity by entity. The cohort voted this the strongest presentation of the term because the experimental design carries the result: thirty balanced cases across three independent GPT-OSS-120b runs put mean F1 at 0.747, against a Cypher-rules baseline at 0.732, with the false-positive distribution broken down by failure mode rather than averaged away.
"Most of the F1 gain prior multi-agent ER papers credited to decomposition was actually hiding inside the retrieval step."
Multi-agent entity resolution papers had been claiming that breaking a matching task into specialist agents lifts F1. Zachary held the base model fixed at gpt-oss-120b and stripped retrieval and external knowledge from the pipeline, so decomposition could be measured alone. Five independent runs on the Abt-Buy benchmark and 115 discordant pooled pairs later, the McNemar p-value sits at 0.26 and the Cohen's kappa at 0.942 ± 0.005. Decomposition on its own does not move the needle. The gain prior work credited to it was hiding inside the retrieval step his study removed.
"Hypothesis registered before any data was collected. Hypothesis rejected by the data."
Sanjeev wrote down his hypothesis before he collected a single data point. Schema validation should cut LLM output volatility in half, the pre-registration said. Six hundred baseline calls across thirty countries and two model scales (gpt-oss-120b and llama-3.1-8b) later, the data said otherwise. SLMs gained 11.9% and the gain is statistically significant. LLMs gained 6.3% and the gain is not. The residual volatility that schema validation could not catch is what the paper names the canonicalization gap.
"Encoders go up with rewriting complexity. GPTs go down. The split was non-obvious until mixed-effects regression made it visible."
If a student rewrites the same essay answer at three escalating levels of linguistic complexity while saying exactly the same thing, do automated scorers grade them the same? Xiaomeng built 743 four-version sets from 1,672 ASAP-SAS responses, kept the semantic content identical, and ran nine scorers across the variations. Encoders like BERT and DeBERTa-v3 score the rewrites higher as the prose gets more complex. GPT-4.1, GPT-5.1, and Claude Sonnet 4.6 score them lower. Mixed-effects regression with response-level random intercepts surfaced the split.
"What does an LLM-augmented cleaning pipeline actually risk? Atul put hallucination rate alongside precision and recall."
Atul wanted to know what an LLM-augmented data-cleaning pipeline puts at risk. He took 1,000 NYC 311 records, injected six controlled corruption families across 542 fields, and watched three cleaning strategies handle the wreckage. Drop-and-log, regex repair, and an MCP+LLM agent with thresholded retrieval (sim=0.25) each saw the same test data. The result reads not as a leaderboard but as a map of where each method earns its cost. The 0.1671 hallucination rate sits alongside precision and recall instead of being buried in an appendix.
"A 2× reduction in LLM calls at unchanged accuracy, because the router remembers what it has already fixed."
Nikhitha built a self-healing ETL router that learns from its own history. Deterministic rules try first. A RAG memory of past repair decisions tries second. Only the genuinely novel inputs reach the LLM. On a cold cache the router fires the LLM for 31.2% of the Magellan Walmart-Amazon records. On a warm cache, after the system has seen its first test set, the rate drops to 15.6%. F1 stays at 0.995 against the LLM-only baseline, at six times lower cost and five times lower latency. A five-action constrained vocabulary (rename, cast, fillna, drop, add_default) keeps the LLM from ever writing directly into the data.
Demo video hosted privately.
"Cheap regex finds everything that might be a vulnerability. The LLM only has to recognize the real ones."
Static security analyzers drown teams in false positives. Juan flipped the usual architecture. Cheap regex sweeps first and flags every candidate. Llama-3.3-70B then triages each candidate and discards the noise. The LLM never has to find vulnerabilities; it only has to recognize them. On 1,000 BigCode Stack files at 30% injection rate with seed 42, the architecture cuts false positives 31% at matched recall, for $4.87 per thousand files. The four-configuration ablation (no few-shot, no schema normalization, no context window, threshold sweep) isolates which prompt choices move the needle.
"Strip the user's demographic profile from the prompt and LLM accuracy collapses 69×."
Kanishka built a frequency-based RAG (not vector RAG) to predict the next track in a music session, over 19M Last.fm events from 992 users. The baseline win is real and verifiable: McNemar's chi-squared 16.12, p=3.76e-5. The more interesting question came from the ablation. When Kanishka stripped the user's demographic profile from the prompt, LLM accuracy collapsed 69-fold, from 14% down to 0.2%. The number quantifies what practitioners feel when deciding whether to spend on LLMs for cold-start cases or stick with frequency.
"An LLM cannot read 1.5 GB of logs. Vivek built it a way to navigate them instead."
No LLM can read 1.5 GB of logs. Vivek built it a way to navigate them. Three MCP tools (search_logs, get_error_counts, get_context) let the agent start from a cluster-summary view of 6.8 million HDFS rows and drill in step by step toward the specific traces that matter, at a 170,000:1 compression ratio between corpus and answer. A subsection called "why grep scores 1.0" documents an evaluation artifact that affected the comparison, with the methodology and the correction reported in detail.
"An MCP-based cleaner, benchmarked head-to-head against the four canonical probabilistic systems."
The HoloClean hospital dataset is the canonical proving ground for probabilistic-inference cleaners. Adnan built an MCP-based architecture for the same task. One server extracts records, a second discovers candidate violations, a third proposes repairs. The paper benchmarks the system against HoloClean, Holistic, KATARA, and SCARE in a single comparison table, then breaks the result down by error type with full TP/FP/FN counts and a token-cost column. The cost-accuracy tradeoff against probabilistic inference is reported across the comparison.
The thirteen projects below cover a wider sweep of the architectural space, from purely deterministic ETL baselines to fully agentic detection loops. Several read as practitioner case studies. Vatsal’s cross-city permit work, Rukaiya’s F1 race-strategy paper, and the two HoloClean cleaning projects each pit several architectures against one another on a real dataset and report the disagreement breakdown in full. The Books, PubChem, and skill-extraction projects work at the schema-mapping end of the spectrum, where the LLM’s job is to land a record into the right slot of a known taxonomy. Each card below is worth reading for the methodological choice the author defended in their final paper, not the headline number alone.
Two cities, two permit ontologies, and a question. Can an LLM line them up better than the established schema-matching toolkit can? Vatsal pitted five Valentine matchers (Cupid, Similarity Flooding, distribution-based, Jaccard, embedding-based) against five LLM endpoints on Gainesville and San Francisco permit data. Neither approach won outright. The strongest signal came when he intersected the matchers' correct predictions with one LLM's predictions, producing P=1 and F1=2/3 from a hybrid consensus neither method could reach alone.
PubChem is one of the world's largest chemistry databases, and asking it the right question is half the battle. Zachary built a system that rewrites a user's vague chemistry query into a verified PubChem REST call, with vector retrieval grounding the rewrite. To keep the test fair, he had Claude generate the vague queries, sidestepping prompt-answer leakage. On 500 compounds and 120 evaluation queries, the rewriter hit 70% accuracy against a 34.17% base, a 35.83-point lift. The paper is honest about what it does not solve, listing six limitations including the "re-run until identified" confound.
Schemas drift in production, and most pipelines silently mishandle the drift until something downstream breaks. Sai Meghana built an ETL pipeline that watches its own schema, derived from a 27-column auto-manifest off a Gainesville 311 snapshot. Her diff_manifests audit trail separates real production drift from synthetic test drift. A Jensen-Shannon-based recovery-stability index sits beside the usual mapping F1. The hybrid-triage policy reaches the same 0.8333 schema-recovery score as the LLM-only baseline at 73.6% fewer tokens.
Demo video hosted privately.
Clinical NLP papers often validate on synthetic templates, then ship to production EHR text and learn the hard way. Ian quantified the gap. He kept everything constant (the same code path, the same prompts, the same patterns) and swapped only the data, running three backends (regex, Claude Sonnet, hybrid) over synthetic MIMIC-IV templates and real MIMIC-III nursing and physician notes. Macro F1 collapsed 8.2× moving from one to the other. On the real data, the hybrid backend holds 88% of the LLM's macro F1 at 38% of the API cost.
Demo video hosted privately.
Vittal pitted Google's base directions, an LLM with no tool calls, and an LLM with MCP-orchestrated context against each other on five routes spanning short and long durations. The non-MCP LLM averaged +46.39 minutes off the Google baseline. The MCP-augmented LLM averaged near zero. A three-variant ablation (no checkpoints, no weather, no LLM) shows which inputs the agent leaned on when it got close.
Kevin asked a question every working data engineer has stared at. When does an LLM save time over handwritten ETL? He compared four pipelines (handwritten ETL, string similarity, few-shot LLM, zero-shot LLM) on three Gainesville open-data sources (Socrata IDs p798-x3nx, vu9p-a5f7, gvua-xt9q) and reported accuracy broken down by field type. The answer depends on field type. The per-field breakdown across numeric, date, categorical, and free-text fields reports where each pipeline wins and loses. Sample size is small at 75 held-out records.
Sai Teja studied structured-field extraction from 100 LinkedIn job postings. Four target fields and a five-config ablation (no requeue, no validation, no few-shot, threshold 0.4, threshold 0.8) map out where the LLM helps and where it does not. The LLM wins on salary, where regex scores F1=0.00 and the LLM scores F1=0.49. The LLM loses on closed-vocabulary fields where the regex grammar is well-defined. The proposed hybrid architecture costs $14.88 to process 124,000 records.
Sanya took the canonical HoloClean hospital dataset and broke it on purpose, at both 5% and 20% severity, across three different error categories. A tool-bounded LLM orchestrator then tried to fix the damage, alongside the rule-based baseline. The per-category breakdown shows where the LLM earns its cost. Typo correction is the clearest case, with the LLM at 0.74 accuracy against the rule's 0.39. For other categories the gap narrows or reverses, with the per-category numbers reported in full.
Palavalli built a skill-extractor for Hacker News job postings that decouples the probabilistic part (an LLM proposes candidate skills) from the deterministic part (an MCP tool validates them against the ESCO taxonomy). On 200 postings with silver-standard ground truth, the system reaches 87.8% recall against the regex baseline's 38.0%. The adversarial test added five typos per posting. Recall fell to 50%, still above the baseline.
Sri Ashritha ran four entity-resolution methods head-to-head on 204K All-the-News-2.0 articles and 2,483 labeled CLEF RepLab tweet clusters. The four were rule-based matching, Magellan Random Forest, Ditto S-BERT, and GPT-4o zero-shot. MinHash LSH blocking validated recall at 0.92 on news and 0.87 on tweets. The paper's cost-breakeven analysis lands at roughly 4,700 pairs. At that volume the GPT-4o approach costs the same as Magellan. Below it, the LLM is cheaper.
Harris wired OpenLibrary and GoogleBooks together end-to-end into the FRBR bibliographic schema. He curated a 200-book ground-truth set by hand, split it 140/60 for train and test, and reported three-run mean and standard deviation across six fields. The prompt-fidelity ablation reports the null result. Low, mid, and high fidelity prompts give F1 of 0.925, 0.921, and 0.928. Prompt engineering did not move the metric on this task, and the paper reports the finding rather than hiding it.
Siyuan's project pivoted mid-semester. The proposal was for an LLM-orchestrated pipeline. The final deliverable is a fully deterministic ETL integrating NYC 311 with restaurant inspection data into a 22-column unified schema. The pivot is disclosed up front in the paper rather than buried. The deliverable runs all 19 unit tests in 1.07 seconds at $0.00 marginal cost.
On Formula 1 race-strategy decisions, the "no_multi_agent" configuration scored 0.900 and the "full_rag" configuration scored 0.520. Five-seed runs per variant rule out luck. RAGAS metrics (faithfulness, answer-relevancy, context-precision, recall) trace the failure to a collapse in answer-relevancy under multi-agent decomposition. The paper documents where orchestration hurt rather than helped.
Demo video hosted privately.
Jiangwei normalized 1,000 permit records with a hybrid rule-plus-LLM pipeline under a 20,000-token budget cap. The disagreement matrix is the clearest signal: the hybrid corrected 260 records the rules missed, the rules corrected only 2 records the hybrid missed, both got 683 right, and both missed 55. The marginal contribution of the LLM is the difference between those two counts.
Demo video hosted privately.
The work above represents fifteen weeks of student effort. The students chose their own problems, defended their methodological choices in conference-style peer review, and shipped artifacts that hold up against the published literature in their respective subareas. They are the reason this term’s record looks the way it does.
If you are a researcher whose work appears in any of the cohort’s reference lists and you would like to be in touch with the student who cited you, please reach out.