Course Project: LLM-Augmented Data Pipelines

The course project explores the intersection of data engineering and large language models. You will design, implement, and evaluate an LLM-augmented data pipeline, producing a research paper that demonstrates rigorous evaluation.

Project Structure

Type: Individual project (no teams)
Repository: cis6930sp26-project (private, add cegme as Admin)
Total Points: 1000

Timeline

Milestone	Due Date	Points	Deliverable
Proposal	Feb 23, 11:59 PM	100	Research question + initial design
Design Review	Mar 2, 11:59 PM	100	Detailed architecture + evaluation plan
Code Checkpoint	Mar 23, 11:59 PM	150	Working prototype + initial results
Draft Paper	Mar 30, 11:59 PM	100	Complete draft for feedback
Final Paper	Apr 13, 11:59 PM	400	Polished paper with full evaluation
Presentation	Week of Apr 20	150	10-minute presentation + Q&A

Project Directions

Choose one of the following directions:

Direction A: LLM Pipeline over a New Data Source

Design an LLM-orchestrated data pipeline for a dataset of your choice. Demonstrate how LLMs can handle real-world data engineering challenges.

Data Sources: Smart city portals, government data, scientific data, web APIs

Direction B: Compare LLM vs Traditional Technologies

Replicate a traditional data pipeline approach and compare it to an LLM-augmented version. This direction produces strong evaluation sections.

Compare: Hand-coded ETL vs MCP-orchestrated, rule-based validation vs LLM detection

Direction C: Novel Data Orchestration Architecture

Propose and implement a new architecture combining MCP servers, LLM reasoning, and traditional tools.

Ideas: Adaptive ETL, self-healing pipelines, multi-agent integration

Research Questions

Your project must answer a research question involving data engineering and LLMs:

Category	Example
Performance	Does LLM-orchestrated ETL achieve comparable accuracy to hand-coded transformations?
Robustness	How does an LLM handle schema drift compared to rule-based validation?
Efficiency	What is the trade-off between LLM token cost and development time?
Capability	Can LLMs perform entity resolution without domain-specific training?

Evaluation Requirements

Evaluation is critical. Your grade depends heavily on rigorous evaluation:

1. Define Metrics

Data extraction: Precision, recall, F1
Data cleaning: Error detection rate, false positive rate
End-to-end: Throughput, latency, cost

2. Establish Baselines

Rule-based heuristics
Traditional ML approaches
No-LLM pipeline comparison

3. Report Results

Quantitative comparison tables
Error analysis
Cost analysis (tokens, API calls, time)

Grading Rubric

Final Paper (400 points)

Section	Points
Introduction & Motivation	50
Related Work	50
Methodology	100
Evaluation	150
Conclusion	50

Implementation (250 points via checkpoints)

Criterion	Points
Working MCP servers	100
LLM orchestration	100
Tests and documentation	50

Presentation (150 points)

Criterion	Points
Clear communication	50
Demo of working system	50
Handling questions	50

Repository Structure

cis6930sp26-project/
├── README.md
├── proposal/
│   └── proposal.md (or .pdf)
├── src/
│   ├── servers/
│   │   └── (MCP server code)
│   └── pipeline/
│       └── (orchestration code)
├── tests/
├── data/
│   └── .gitkeep
├── paper/
│   └── paper.tex (or .md)
├── docs/
├── COLLABORATORS.md
└── pyproject.toml

Research Topics

LLM-Powered Data Engineering

Schema inference and mapping
Data cleaning and validation
Entity resolution
Automated documentation

Pipeline Architectures

MCP-based orchestration
Multi-agent coordination
Hybrid LLM/rule-based systems
Self-healing pipelines

Evaluation & Benchmarks

Data wrangling benchmarks
Cost-quality trade-offs
Reproducibility studies

Resources

Project Selection Guide - Detailed directions and dataset options
Rubrics - All evaluation rubrics
MCP Documentation
FastMCP
NavigatorAI - UF’s LLM service

back