CIS 6930 Spring 26

Logo

Data Engineering at the University of Florida

Course Project: LLM-Augmented Data Pipelines

Proposal Due: Monday, February 23, 2026 at 11:59 PM Points: 100 (proposal) + 900 (project total) = 1000 total Repository: cis6930sp26-project (private, add cegme as Admin) Type: Individual project (no teams)


Overview

For your course project, you will design, implement, and evaluate an LLM-augmented data pipeline. Your project must include:

  1. A research question that explores the intersection of data engineering and LLMs
  2. An implementation using MCP servers and LLM orchestration
  3. A rigorous evaluation comparing your approach to baselines

This is an individual project. You will work independently throughout the semester.


Research Questions

Your project should answer a research question that involves data engineering and LLMs. Good research questions are:

Why Research Questions Matter

A well-crafted research question is the foundation of a successful project. Your research question determines what you build, how you evaluate it, and what conclusions you can draw. Projects without clear research questions often result in unfocused implementations and weak evaluations.

Interesting questions motivate your work and engage your readers. Ask yourself: Would someone working in data engineering care about the answer? Does the question reveal something non-obvious about how LLMs interact with data pipelines? The best questions challenge assumptions or explore uncharted territory.

Measurable questions allow you to draw concrete conclusions. Avoid vague questions like “Are LLMs good at data cleaning?” Instead, specify what “good” means: accuracy, speed, cost, robustness to edge cases. Your question should naturally lead to experiments with quantifiable outcomes.

Consider these contrasts:

Weak Question Stronger Question
Can LLMs help with data quality? Does GPT-4 detect more data anomalies than rule-based validation on sensor data with 5% injected errors?
Is MCP useful for ETL? How does development time compare between hand-coded and MCP-orchestrated ETL for integrating 3 heterogeneous APIs?
Are LLMs better than traditional methods? At what data volume does the token cost of LLM-based schema matching exceed the development cost of training a supervised matcher?

Your research question will evolve as you work. Start with a hypothesis, and refine it as you learn more about your dataset and the capabilities of your pipeline.

Example Research Questions

Category Example Question
Performance Does an LLM-orchestrated ETL pipeline achieve comparable accuracy to hand-coded transformations on [dataset]?
Robustness How does an LLM handle schema drift compared to rule-based validation?
Efficiency What is the trade-off between LLM token cost and development time for data cleaning tasks?
Capability Can LLMs perform entity resolution on [domain] data without domain-specific training?
Architecture How should MCP servers be designed to maximize reusability across different data sources?

Project Directions

Choose one of the following directions:

Direction A: LLM Pipeline over a New Data Source

Design an LLM-orchestrated data pipeline for a dataset of your choice. Your system should demonstrate how LLMs can handle real-world data engineering challenges.

Suggested Data Sources:

Source Examples
Smart City Portals Gainesville, NYC, Chicago, LA open data
Government Data data.gov, census.gov, EPA datasets
Scientific Data NASA, NOAA, genomics databases
Research Paper Datasets Benchmarks from published papers (see Dataset Selection below)
Web Data APIs, web scraping (with permission)

Focus Areas:


Direction B: Compare LLM vs Traditional Technologies

Replicate a traditional data pipeline approach and compare it to an LLM-augmented version. This direction produces strong evaluation sections.

Technologies to Compare:

Traditional LLM-Augmented
Hand-coded ETL scripts MCP-orchestrated pipeline
Rule-based data validation LLM-based anomaly detection
Regex/heuristic extraction LLM extraction with prompts
Manual schema mapping LLM-inferred mappings
Static documentation LLM-generated docs

Suggested Papers to Replicate:

Paper Topic
Magellan (Konda et al.) Entity matching pipeline
HoloClean (Rekatsinas et al.) Data cleaning with ML
Valentine (Koutras et al.) Schema matching benchmark
Snorkel (Ratner et al.) Weak supervision for labeling
Can Foundation Models Wrangle Your Data? (Narayan et al.) LLM data wrangling

Direction C: Novel Data Orchestration Architecture

Propose and implement a new architecture that combines MCP servers, LLM reasoning, and traditional data engineering tools.

Architecture Ideas:


Dataset Selection

You must select a source dataset for your project. This dataset will serve as the foundation for your LLM-augmented data pipeline. Choose a dataset that aligns with your research question and provides opportunities to demonstrate data engineering challenges.

Option 1: Smart City Data Portals

Smart city data provides real-world, messy datasets with integration challenges.

City Portal Example Datasets
Gainesville data.cityofgainesville.org Transit, utilities, permits, 311 requests
New York City opendata.cityofnewyork.us 311 complaints, taxi trips, housing data
Chicago data.cityofchicago.org Crime data, food inspections, traffic
Los Angeles data.lacity.org Building permits, parking, business licenses
San Francisco datasf.org Film permits, energy usage, transit

Option 2: Hugging Face Datasets

Hugging Face hosts curated datasets suitable for ML and data engineering projects.

Category Example Datasets
Tabular scikit-learn, openml
Text & NLP wikipedia, common_voice
Multi-modal imagenet, coco
Code the-stack, github-code
Scientific pubmed, arxiv

Browse all datasets: huggingface.co/datasets

Option 3: Datasets from Research Papers

Research papers often release benchmark datasets designed for specific data engineering tasks. Using these datasets allows you to compare your results directly against published baselines.

Paper Dataset Task
Magellan (Konda et al.) Multiple ER benchmarks Entity matching
Valentine (Koutras et al.) Schema matching benchmark Schema matching
HoloClean (Rekatsinas et al.) Hospital, Flights Data cleaning
DITTO (Li et al.) Structured ER datasets Entity matching with pre-training
Rotom (Miao et al.) Data augmentation benchmarks Low-resource matching

To find datasets from papers:

Option 4: Other Data Sources

Source Examples
Government data.gov, census.gov, EPA
Scientific NASA, NOAA, genomics databases
Web APIs REST APIs with public access (document permissions)

Dataset Requirements

Your chosen dataset must:

  1. Be publicly accessible or have documented permission for academic use
  2. Present data engineering challenges (e.g., schema heterogeneity, missing values, entity resolution needs)
  3. Support your research question with sufficient data volume and variety
  4. Include or allow creation of ground truth for evaluation

In your proposal, specify:


Evaluation Requirements

Evaluation is critical. Your project grade depends heavily on how well you evaluate your pipeline. You must:

1. Define Metrics

Choose appropriate metrics for your task:

Task Metrics
Data extraction Precision, recall, F1
Data cleaning Error detection rate, false positive rate
Entity resolution Precision, recall, F1 at pair/cluster level
Schema matching Accuracy, MRR (mean reciprocal rank)
End-to-end pipeline Throughput, latency, cost

2. Establish Baselines

Compare your LLM approach to at least one baseline:

3. Report Results

Your evaluation section should include:


Proposal Requirements (Due Feb 23)

Submit an initial design proposal (2-3 pages) that includes:

1. Research Question

2. Project Direction

3. Initial Design

4. Evaluation Plan

5. Timeline


Repository Setup

Create a private GitHub repository:

Repository name: cis6930sp26-project

Add cegme as an Admin collaborator.

Initial structure:

cis6930sp26-project/
├── README.md
├── proposal/
│   └── proposal.md (or .pdf)
├── src/
│   └── (your code)
├── tests/
├── data/
│   └── .gitkeep
├── docs/
├── COLLABORATORS.md
└── pyproject.toml

Timeline

Milestone Date Points Deliverable
Project Proposal Mon, Feb 23 100 Initial design with research question
Design Review Mon, Mar 2 100 Detailed architecture, evaluation plan
Code Checkpoint Mon, Mar 23 150 Working prototype, initial results
Draft Paper Mon, Mar 30 100 Complete draft for feedback
Final Paper Mon, Apr 13 400 Polished paper with full evaluation
Presentation Week of Apr 20 150 10-minute presentation + Q&A
Total   1000  

Grading Rubric

Final Paper (400 points)

Section Points Criteria
Introduction & Motivation 50 Clear problem statement, why it matters
Related Work 50 Relevant papers cited and discussed
Methodology 100 Clear description of pipeline, reproducible
Evaluation 150 Rigorous metrics, baselines, analysis
Conclusion 50 Summary, limitations, future work

Implementation (250 points via checkpoints)

Criterion Points
Working MCP servers 100
LLM orchestration 100
Tests and documentation 50

Presentation (150 points)

Criterion Points
Clear communication 50
Demo of working system 50
Handling questions 50

Tips for Success

  1. Start with evaluation: Define how you will measure success before building
  2. Build incrementally: Get a simple pipeline working first, then add complexity
  3. Document everything: Keep a log of LLM decisions and failures
  4. Test continuously: Write tests as you build, not at the end
  5. Manage scope: A focused project with strong evaluation beats an ambitious project with weak results
  6. Ask early: Come to office hours if you’re stuck on research direction

Finding Papers

Search Strategies

  1. Semantic Scholar (semanticscholar.org) - Filter by venue (VLDB, SIGMOD, ICDE)
  2. Papers With Code (paperswithcode.com) - Find papers with available code
  3. ACL Anthology (aclanthology.org) - NLP and text processing papers
  4. arXiv (arxiv.org) - Latest preprints in cs.DB, cs.CL, cs.LG

What Makes a Good Paper to Build On


Resources

Data Sources

Tools


Questions?

Post questions on the course discussion board or attend office hours.


back