CIS 6930 Spring 26

Logo

Data Engineering at the University of Florida

Course Project: LLM-Augmented Data Pipelines

The course project explores the intersection of data engineering and large language models. You will design, implement, and evaluate an LLM-augmented data pipeline, producing a research paper that demonstrates rigorous evaluation.

Project Structure

Timeline

Milestone Due Date Points Deliverable
Proposal Feb 23, 11:59 PM 100 Research question + initial design
Design Review Mar 2, 11:59 PM 100 Detailed architecture + evaluation plan
Code Checkpoint Mar 23, 11:59 PM 150 Working prototype + initial results
Draft Paper Mar 30, 11:59 PM 100 Complete draft for feedback
Final Paper Apr 13, 11:59 PM 400 Polished paper with full evaluation
Presentation Week of Apr 20 150 10-minute presentation + Q&A

Project Directions

Choose one of the following directions:

Direction A: LLM Pipeline over a New Data Source

Design an LLM-orchestrated data pipeline for a dataset of your choice. Demonstrate how LLMs can handle real-world data engineering challenges.

Data Sources: Smart city portals, government data, scientific data, web APIs

Direction B: Compare LLM vs Traditional Technologies

Replicate a traditional data pipeline approach and compare it to an LLM-augmented version. This direction produces strong evaluation sections.

Compare: Hand-coded ETL vs MCP-orchestrated, rule-based validation vs LLM detection

Direction C: Novel Data Orchestration Architecture

Propose and implement a new architecture combining MCP servers, LLM reasoning, and traditional tools.

Ideas: Adaptive ETL, self-healing pipelines, multi-agent integration


Research Questions

Your project must answer a research question involving data engineering and LLMs:

Category Example
Performance Does LLM-orchestrated ETL achieve comparable accuracy to hand-coded transformations?
Robustness How does an LLM handle schema drift compared to rule-based validation?
Efficiency What is the trade-off between LLM token cost and development time?
Capability Can LLMs perform entity resolution without domain-specific training?

Evaluation Requirements

Evaluation is critical. Your grade depends heavily on rigorous evaluation:

1. Define Metrics

2. Establish Baselines

3. Report Results


Grading Rubric

Final Paper (400 points)

Section Points
Introduction & Motivation 50
Related Work 50
Methodology 100
Evaluation 150
Conclusion 50

Implementation (250 points via checkpoints)

Criterion Points
Working MCP servers 100
LLM orchestration 100
Tests and documentation 50

Presentation (150 points)

Criterion Points
Clear communication 50
Demo of working system 50
Handling questions 50

Repository Structure

cis6930sp26-project/
├── README.md
├── proposal/
│   └── proposal.md (or .pdf)
├── src/
│   ├── servers/
│   │   └── (MCP server code)
│   └── pipeline/
│       └── (orchestration code)
├── tests/
├── data/
│   └── .gitkeep
├── paper/
│   └── paper.tex (or .md)
├── docs/
├── COLLABORATORS.md
└── pyproject.toml

Research Topics

LLM-Powered Data Engineering

Pipeline Architectures

Evaluation & Benchmarks


Resources


back