Data Engineering at the University of Florida
Due: Monday, March 30, 2026 at 11:59 PM
Points: 100
Submission: Push to cis6930sp26-project repository in paper/ directory
The draft paper is a complete first version of your research paper. All sections should be present with substantive content. This draft receives peer review feedback that you will address in the final paper.
Submit a complete draft paper (6-8 pages) with:
cis6930sp26-project/
├── paper/
│ ├── paper.tex (or paper.md)
│ ├── figures/
│ │ ├── architecture.pdf
│ │ └── results.pdf
│ └── references.bib
└── ...
The abstract should:
For the draft, reviewers evaluate completeness and direction. The final paper uses the full conference-style rubric.
| Criterion | Weight | Description |
|---|---|---|
| Completeness | 30% | Are all sections present with substantive content? |
| Technical Soundness | 25% | Is the methodology appropriate and evaluation reasonable? |
| Clarity | 25% | Is the writing clear and well-organized? |
| Progress | 20% | Does the paper reflect significant project progress? |
| Score | Meaning |
|---|---|
| 5 | Excellent - Ready for final polish |
| 4 | Good - On track with minor gaps |
| 3 | Satisfactory - Needs work but salvageable |
| 2 | Below Average - Significant sections incomplete |
| 1 | Incomplete - Major revision needed |
We present TransitLLM, an LLM-augmented data pipeline for integrating heterogeneous smart city transit data. Current approaches to transit data integration require extensive manual schema mapping and custom ETL code for each data source. Our system uses MCP servers to expose transit APIs and an LLM orchestrator to perform automatic schema mapping and data validation. We evaluate TransitLLM on three Gainesville data portals, comparing against hand-coded baseline pipelines. Results show that our approach achieves 94% schema mapping accuracy while reducing development effort by 60%. Our findings suggest that LLM-orchestrated pipelines offer a promising alternative for data integration tasks with moderate complexity.
This paper makes the following contributions:
- A system architecture for LLM-orchestrated data integration using MCP servers as a modular abstraction layer
- An empirical comparison of LLM-based schema mapping against manual approaches on real smart city data
- A cost-benefit analysis examining the trade-off between LLM token cost and development time savings
- An open-source implementation with MCP servers for three Gainesville data portals
| System | Precision | Recall | F1 | Tokens | Time (s) |
|---|---|---|---|---|---|
| Baseline (hand-coded) | 0.96 | 0.94 | 0.95 | - | 0.2 |
| TransitLLM (GPT-4) | 0.94 | 0.92 | 0.93 | 1,240 | 3.8 |
| TransitLLM (Claude) | 0.93 | 0.91 | 0.92 | 1,180 | 4.1 |
We examined the 14 schema mapping errors made by TransitLLM. The majority (9/14) occurred when source fields had ambiguous names. For example, the field “datetime” in the 311 API was incorrectly mapped to “created_date” instead of “resolved_date” because the LLM lacked context about the data semantics. Three errors occurred with nested JSON structures where the LLM failed to flatten arrays correctly. The remaining two errors were due to inconsistent date formats that the LLM did not detect.
Your draft will receive reviews from 2-3 classmates using the paper rubric.
When you receive reviews:
Use a standard conference format. Example using article class:
\documentclass[11pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{hyperref}
\usepackage{cleveref}
\title{Your Paper Title}
\author{Your Name}
\date{}
\begin{document}
\maketitle
\begin{abstract}
Your abstract here.
\end{abstract}
\section{Introduction}
Your introduction here.
\section{Related Work}
Prior work discussion.
\section{Methodology}
System description.
\section{Evaluation}
Experiments and results.
\section{Conclusion}
Summary and future work.
\bibliographystyle{plain}
\bibliography{references}
\end{document}
paper/ directory