Course Project: LLM-Augmented Data Pipelines

Proposal Due: Monday, February 23, 2026 at 11:59 PM Points: 100 (proposal) + 900 (project total) = 1000 total Repository: cis6930sp26-project (private, add cegme as Admin) Type: Individual project (no teams)

Overview

For your course project, you will design, implement, and evaluate an LLM-augmented data pipeline. Your project must include:

A research question that explores the intersection of data engineering and LLMs
An implementation using MCP servers and LLM orchestration
A rigorous evaluation comparing your approach to baselines

This is an individual project. You will work independently throughout the semester.

Research Questions

Your project should answer a research question that involves data engineering and LLMs. Good research questions are:

Specific: Focused on a measurable outcome
Comparative: Compare LLM approach to traditional methods
Evaluable: Can be tested with concrete metrics

Why Research Questions Matter

A well-crafted research question is the foundation of a successful project. Your research question determines what you build, how you evaluate it, and what conclusions you can draw. Projects without clear research questions often result in unfocused implementations and weak evaluations.

Interesting questions motivate your work and engage your readers. Ask yourself: Would someone working in data engineering care about the answer? Does the question reveal something non-obvious about how LLMs interact with data pipelines? The best questions challenge assumptions or explore uncharted territory.

Measurable questions allow you to draw concrete conclusions. Avoid vague questions like “Are LLMs good at data cleaning?” Instead, specify what “good” means: accuracy, speed, cost, robustness to edge cases. Your question should naturally lead to experiments with quantifiable outcomes.

Consider these contrasts:

Weak Question	Stronger Question
Can LLMs help with data quality?	Does GPT-4 detect more data anomalies than rule-based validation on sensor data with 5% injected errors?
Is MCP useful for ETL?	How does development time compare between hand-coded and MCP-orchestrated ETL for integrating 3 heterogeneous APIs?
Are LLMs better than traditional methods?	At what data volume does the token cost of LLM-based schema matching exceed the development cost of training a supervised matcher?

Your research question will evolve as you work. Start with a hypothesis, and refine it as you learn more about your dataset and the capabilities of your pipeline.

Example Research Questions

Category	Example Question
Performance	Does an LLM-orchestrated ETL pipeline achieve comparable accuracy to hand-coded transformations on [dataset]?
Robustness	How does an LLM handle schema drift compared to rule-based validation?
Efficiency	What is the trade-off between LLM token cost and development time for data cleaning tasks?
Capability	Can LLMs perform entity resolution on [domain] data without domain-specific training?
Architecture	How should MCP servers be designed to maximize reusability across different data sources?

Project Directions

Choose one of the following directions:

Direction A: LLM Pipeline over a New Data Source

Design an LLM-orchestrated data pipeline for a dataset of your choice. Your system should demonstrate how LLMs can handle real-world data engineering challenges.

Suggested Data Sources:

Source	Examples
Smart City Portals	Gainesville, NYC, Chicago, LA open data
Government Data	data.gov, census.gov, EPA datasets
Scientific Data	NASA, NOAA, genomics databases
Research Paper Datasets	Benchmarks from published papers (see Dataset Selection below)
Web Data	APIs, web scraping (with permission)

Focus Areas:

Data extraction from multiple sources
Schema mapping and integration
Data quality detection and repair
Automated documentation generation

Direction B: Compare LLM vs Traditional Technologies

Replicate a traditional data pipeline approach and compare it to an LLM-augmented version. This direction produces strong evaluation sections.

Technologies to Compare:

Traditional	LLM-Augmented
Hand-coded ETL scripts	MCP-orchestrated pipeline
Rule-based data validation	LLM-based anomaly detection
Regex/heuristic extraction	LLM extraction with prompts
Manual schema mapping	LLM-inferred mappings
Static documentation	LLM-generated docs

Suggested Papers to Replicate:

Paper	Topic
Magellan (Konda et al.)	Entity matching pipeline
HoloClean (Rekatsinas et al.)	Data cleaning with ML
Valentine (Koutras et al.)	Schema matching benchmark
Snorkel (Ratner et al.)	Weak supervision for labeling
Can Foundation Models Wrangle Your Data? (Narayan et al.)	LLM data wrangling

Direction C: Novel Data Orchestration Architecture

Propose and implement a new architecture that combines MCP servers, LLM reasoning, and traditional data engineering tools.

Architecture Ideas:

Adaptive ETL: Pipeline that adjusts transformations based on data characteristics
Self-Healing Pipeline: System that diagnoses and recovers from failures
Multi-Agent Integration: Specialized LLM agents coordinating on complex tasks
Hybrid Orchestration: LLM for complex decisions, deterministic code for routine operations

Dataset Selection

You must select a source dataset for your project. This dataset will serve as the foundation for your LLM-augmented data pipeline. Choose a dataset that aligns with your research question and provides opportunities to demonstrate data engineering challenges.

Option 1: Smart City Data Portals

Smart city data provides real-world, messy datasets with integration challenges.

City	Portal	Example Datasets
Gainesville	data.cityofgainesville.org	Transit, utilities, permits, 311 requests
New York City	opendata.cityofnewyork.us	311 complaints, taxi trips, housing data
Chicago	data.cityofchicago.org	Crime data, food inspections, traffic
Los Angeles	data.lacity.org	Building permits, parking, business licenses
San Francisco	datasf.org	Film permits, energy usage, transit

Option 2: Hugging Face Datasets

Hugging Face hosts curated datasets suitable for ML and data engineering projects.

Category	Example Datasets
Tabular	scikit-learn, openml
Text & NLP	wikipedia, common_voice
Multi-modal	imagenet, coco
Code	the-stack, github-code
Scientific	pubmed, arxiv

Browse all datasets: huggingface.co/datasets

Option 3: Datasets from Research Papers

Research papers often release benchmark datasets designed for specific data engineering tasks. Using these datasets allows you to compare your results directly against published baselines.

Paper	Dataset	Task
Magellan (Konda et al.)	Multiple ER benchmarks	Entity matching
Valentine (Koutras et al.)	Schema matching benchmark	Schema matching
HoloClean (Rekatsinas et al.)	Hospital, Flights	Data cleaning
DITTO (Li et al.)	Structured ER datasets	Entity matching with pre-training
Rotom (Miao et al.)	Data augmentation benchmarks	Low-resource matching

To find datasets from papers:

Check the paper’s “Experiments” or “Evaluation” section for dataset descriptions
Look for links to GitHub repositories or project pages
Search Papers With Code for datasets by task

Option 4: Other Data Sources

Source	Examples
Government	data.gov, census.gov, EPA
Scientific	NASA, NOAA, genomics databases
Web APIs	REST APIs with public access (document permissions)

Dataset Requirements

Your chosen dataset must:

Be publicly accessible or have documented permission for academic use
Present data engineering challenges (e.g., schema heterogeneity, missing values, entity resolution needs)
Support your research question with sufficient data volume and variety
Include or allow creation of ground truth for evaluation

In your proposal, specify:

Which dataset(s) you will use
Why this dataset is appropriate for your research question
What data engineering challenges it presents
How you will obtain or create ground truth labels

Evaluation Requirements

Evaluation is critical. Your project grade depends heavily on how well you evaluate your pipeline. You must:

1. Define Metrics

Choose appropriate metrics for your task:

Task	Metrics
Data extraction	Precision, recall, F1
Data cleaning	Error detection rate, false positive rate
Entity resolution	Precision, recall, F1 at pair/cluster level
Schema matching	Accuracy, MRR (mean reciprocal rank)
End-to-end pipeline	Throughput, latency, cost

2. Establish Baselines

Compare your LLM approach to at least one baseline:

Rule-based: Hand-coded heuristics
Traditional ML: Trained classifier/model
No-LLM pipeline: Same pipeline without LLM orchestration
Published results: If replicating a paper

3. Report Results

Your evaluation section should include:

Quantitative comparison tables
Statistical significance (if applicable)
Error analysis (what went wrong and why)
Cost analysis (tokens, API calls, time)

Proposal Requirements (Due Feb 23)

Submit an initial design proposal (2-3 pages) that includes:

1. Research Question

State your research question clearly
Explain why it matters for data engineering

2. Project Direction

Which direction (A, B, or C)?
What dataset or paper are you using?

3. Initial Design

Architecture diagram of your pipeline
What MCP servers will you build?
What will the LLM orchestrate?

4. Evaluation Plan

What metrics will you use?
What baselines will you compare against?
How will you collect ground truth?

5. Timeline

Weekly milestones through the semester

Repository Setup

Create a private GitHub repository:

Repository name: cis6930sp26-project

Add cegme as an Admin collaborator.

Initial structure:

cis6930sp26-project/
├── README.md
├── proposal/
│   └── proposal.md (or .pdf)
├── src/
│   └── (your code)
├── tests/
├── data/
│   └── .gitkeep
├── docs/
├── COLLABORATORS.md
└── pyproject.toml

Timeline

Milestone	Date	Points	Deliverable
Project Proposal	Mon, Feb 23	100	Initial design with research question
Design Review	Mon, Mar 2	100	Detailed architecture, evaluation plan
Code Checkpoint	Mon, Mar 23	150	Working prototype, initial results
Draft Paper	Mon, Mar 30	100	Complete draft for feedback
Final Paper	Mon, Apr 13	400	Polished paper with full evaluation
Presentation	Week of Apr 20	150	10-minute presentation + Q&A
Total		1000

Grading Rubric

Final Paper (400 points)

Section	Points	Criteria
Introduction & Motivation	50	Clear problem statement, why it matters
Related Work	50	Relevant papers cited and discussed
Methodology	100	Clear description of pipeline, reproducible
Evaluation	150	Rigorous metrics, baselines, analysis
Conclusion	50	Summary, limitations, future work

Implementation (250 points via checkpoints)

Criterion	Points
Working MCP servers	100
LLM orchestration	100
Tests and documentation	50

Presentation (150 points)

Criterion	Points
Clear communication	50
Demo of working system	50
Handling questions	50

Tips for Success

Start with evaluation: Define how you will measure success before building
Build incrementally: Get a simple pipeline working first, then add complexity
Document everything: Keep a log of LLM decisions and failures
Test continuously: Write tests as you build, not at the end
Manage scope: A focused project with strong evaluation beats an ambitious project with weak results
Ask early: Come to office hours if you’re stuck on research direction

Finding Papers

Search Strategies

Semantic Scholar (semanticscholar.org) - Filter by venue (VLDB, SIGMOD, ICDE)
Papers With Code (paperswithcode.com) - Find papers with available code
ACL Anthology (aclanthology.org) - NLP and text processing papers
arXiv (arxiv.org) - Latest preprints in cs.DB, cs.CL, cs.LG

What Makes a Good Paper to Build On

Clear methodology with step-by-step description
Available dataset or reproducible data generation
Defined metrics for evaluation
Reasonable scope for a semester project

Resources

Data Sources

Tools

MCP Documentation
FastMCP
NavigatorAI (UF’s LLM service)

Questions?

Post questions on the course discussion board or attend office hours.

back