Data Engineering at the University of Florida
The course project explores the intersection of data engineering and large language models. You will design, implement, and evaluate an LLM-augmented data pipeline, producing a research paper that demonstrates rigorous evaluation.
cis6930sp26-project (private, add cegme as Admin)| Milestone | Due Date | Points | Deliverable |
|---|---|---|---|
| Proposal | Feb 23, 11:59 PM | 100 | Research question + initial design |
| Design Review | Mar 2, 11:59 PM | 100 | Detailed architecture + evaluation plan |
| Code Checkpoint | Mar 23, 11:59 PM | 150 | Working prototype + initial results |
| Draft Paper | Mar 30, 11:59 PM | 100 | Complete draft for feedback |
| Final Paper | Apr 13, 11:59 PM | 400 | Polished paper with full evaluation |
| Presentation | Week of Apr 20 | 150 | 10-minute presentation + Q&A |
Choose one of the following directions:
Design an LLM-orchestrated data pipeline for a dataset of your choice. Demonstrate how LLMs can handle real-world data engineering challenges.
Data Sources: Smart city portals, government data, scientific data, web APIs
Replicate a traditional data pipeline approach and compare it to an LLM-augmented version. This direction produces strong evaluation sections.
Compare: Hand-coded ETL vs MCP-orchestrated, rule-based validation vs LLM detection
Propose and implement a new architecture combining MCP servers, LLM reasoning, and traditional tools.
Ideas: Adaptive ETL, self-healing pipelines, multi-agent integration
Your project must answer a research question involving data engineering and LLMs:
| Category | Example |
|---|---|
| Performance | Does LLM-orchestrated ETL achieve comparable accuracy to hand-coded transformations? |
| Robustness | How does an LLM handle schema drift compared to rule-based validation? |
| Efficiency | What is the trade-off between LLM token cost and development time? |
| Capability | Can LLMs perform entity resolution without domain-specific training? |
Evaluation is critical. Your grade depends heavily on rigorous evaluation:
| Section | Points |
|---|---|
| Introduction & Motivation | 50 |
| Related Work | 50 |
| Methodology | 100 |
| Evaluation | 150 |
| Conclusion | 50 |
| Criterion | Points |
|---|---|
| Working MCP servers | 100 |
| LLM orchestration | 100 |
| Tests and documentation | 50 |
| Criterion | Points |
|---|---|
| Clear communication | 50 |
| Demo of working system | 50 |
| Handling questions | 50 |
cis6930sp26-project/
├── README.md
├── proposal/
│ └── proposal.md (or .pdf)
├── src/
│ ├── servers/
│ │ └── (MCP server code)
│ └── pipeline/
│ └── (orchestration code)
├── tests/
├── data/
│ └── .gitkeep
├── paper/
│ └── paper.tex (or .md)
├── docs/
├── COLLABORATORS.md
└── pyproject.toml