Data Engineering at the University of Florida
Proposal Due: Monday, February 23, 2026 at 11:59 PM
Points: 100 (proposal) + 900 (project total) = 1000 total
Repository: cis6930sp26-project (private, add cegme as Admin)
Type: Individual project (no teams)
For your course project, you will design, implement, and evaluate an LLM-augmented data pipeline. Your project must include:
This is an individual project. You will work independently throughout the semester.
Your project should answer a research question that involves data engineering and LLMs. Good research questions are:
A well-crafted research question is the foundation of a successful project. Your research question determines what you build, how you evaluate it, and what conclusions you can draw. Projects without clear research questions often result in unfocused implementations and weak evaluations.
Interesting questions motivate your work and engage your readers. Ask yourself: Would someone working in data engineering care about the answer? Does the question reveal something non-obvious about how LLMs interact with data pipelines? The best questions challenge assumptions or explore uncharted territory.
Measurable questions allow you to draw concrete conclusions. Avoid vague questions like “Are LLMs good at data cleaning?” Instead, specify what “good” means: accuracy, speed, cost, robustness to edge cases. Your question should naturally lead to experiments with quantifiable outcomes.
Consider these contrasts:
| Weak Question | Stronger Question |
|---|---|
| Can LLMs help with data quality? | Does GPT-4 detect more data anomalies than rule-based validation on sensor data with 5% injected errors? |
| Is MCP useful for ETL? | How does development time compare between hand-coded and MCP-orchestrated ETL for integrating 3 heterogeneous APIs? |
| Are LLMs better than traditional methods? | At what data volume does the token cost of LLM-based schema matching exceed the development cost of training a supervised matcher? |
Your research question will evolve as you work. Start with a hypothesis, and refine it as you learn more about your dataset and the capabilities of your pipeline.
| Category | Example Question |
|---|---|
| Performance | Does an LLM-orchestrated ETL pipeline achieve comparable accuracy to hand-coded transformations on [dataset]? |
| Robustness | How does an LLM handle schema drift compared to rule-based validation? |
| Efficiency | What is the trade-off between LLM token cost and development time for data cleaning tasks? |
| Capability | Can LLMs perform entity resolution on [domain] data without domain-specific training? |
| Architecture | How should MCP servers be designed to maximize reusability across different data sources? |
Choose one of the following directions:
Design an LLM-orchestrated data pipeline for a dataset of your choice. Your system should demonstrate how LLMs can handle real-world data engineering challenges.
Suggested Data Sources:
| Source | Examples |
|---|---|
| Smart City Portals | Gainesville, NYC, Chicago, LA open data |
| Government Data | data.gov, census.gov, EPA datasets |
| Scientific Data | NASA, NOAA, genomics databases |
| Research Paper Datasets | Benchmarks from published papers (see Dataset Selection below) |
| Web Data | APIs, web scraping (with permission) |
Focus Areas:
Replicate a traditional data pipeline approach and compare it to an LLM-augmented version. This direction produces strong evaluation sections.
Technologies to Compare:
| Traditional | LLM-Augmented |
|---|---|
| Hand-coded ETL scripts | MCP-orchestrated pipeline |
| Rule-based data validation | LLM-based anomaly detection |
| Regex/heuristic extraction | LLM extraction with prompts |
| Manual schema mapping | LLM-inferred mappings |
| Static documentation | LLM-generated docs |
Suggested Papers to Replicate:
| Paper | Topic |
|---|---|
| Magellan (Konda et al.) | Entity matching pipeline |
| HoloClean (Rekatsinas et al.) | Data cleaning with ML |
| Valentine (Koutras et al.) | Schema matching benchmark |
| Snorkel (Ratner et al.) | Weak supervision for labeling |
| Can Foundation Models Wrangle Your Data? (Narayan et al.) | LLM data wrangling |
Propose and implement a new architecture that combines MCP servers, LLM reasoning, and traditional data engineering tools.
Architecture Ideas:
You must select a source dataset for your project. This dataset will serve as the foundation for your LLM-augmented data pipeline. Choose a dataset that aligns with your research question and provides opportunities to demonstrate data engineering challenges.
Smart city data provides real-world, messy datasets with integration challenges.
| City | Portal | Example Datasets |
|---|---|---|
| Gainesville | data.cityofgainesville.org | Transit, utilities, permits, 311 requests |
| New York City | opendata.cityofnewyork.us | 311 complaints, taxi trips, housing data |
| Chicago | data.cityofchicago.org | Crime data, food inspections, traffic |
| Los Angeles | data.lacity.org | Building permits, parking, business licenses |
| San Francisco | datasf.org | Film permits, energy usage, transit |
Hugging Face hosts curated datasets suitable for ML and data engineering projects.
| Category | Example Datasets |
|---|---|
| Tabular | scikit-learn, openml |
| Text & NLP | wikipedia, common_voice |
| Multi-modal | imagenet, coco |
| Code | the-stack, github-code |
| Scientific | pubmed, arxiv |
Browse all datasets: huggingface.co/datasets
Research papers often release benchmark datasets designed for specific data engineering tasks. Using these datasets allows you to compare your results directly against published baselines.
| Paper | Dataset | Task |
|---|---|---|
| Magellan (Konda et al.) | Multiple ER benchmarks | Entity matching |
| Valentine (Koutras et al.) | Schema matching benchmark | Schema matching |
| HoloClean (Rekatsinas et al.) | Hospital, Flights | Data cleaning |
| DITTO (Li et al.) | Structured ER datasets | Entity matching with pre-training |
| Rotom (Miao et al.) | Data augmentation benchmarks | Low-resource matching |
To find datasets from papers:
| Source | Examples |
|---|---|
| Government | data.gov, census.gov, EPA |
| Scientific | NASA, NOAA, genomics databases |
| Web APIs | REST APIs with public access (document permissions) |
Your chosen dataset must:
In your proposal, specify:
Evaluation is critical. Your project grade depends heavily on how well you evaluate your pipeline. You must:
Choose appropriate metrics for your task:
| Task | Metrics |
|---|---|
| Data extraction | Precision, recall, F1 |
| Data cleaning | Error detection rate, false positive rate |
| Entity resolution | Precision, recall, F1 at pair/cluster level |
| Schema matching | Accuracy, MRR (mean reciprocal rank) |
| End-to-end pipeline | Throughput, latency, cost |
Compare your LLM approach to at least one baseline:
Your evaluation section should include:
Submit an initial design proposal (2-3 pages) that includes:
Create a private GitHub repository:
Repository name: cis6930sp26-project
Add cegme as an Admin collaborator.
Initial structure:
cis6930sp26-project/
├── README.md
├── proposal/
│ └── proposal.md (or .pdf)
├── src/
│ └── (your code)
├── tests/
├── data/
│ └── .gitkeep
├── docs/
├── COLLABORATORS.md
└── pyproject.toml
| Milestone | Date | Points | Deliverable |
|---|---|---|---|
| Project Proposal | Mon, Feb 23 | 100 | Initial design with research question |
| Design Review | Mon, Mar 2 | 100 | Detailed architecture, evaluation plan |
| Code Checkpoint | Mon, Mar 23 | 150 | Working prototype, initial results |
| Draft Paper | Mon, Mar 30 | 100 | Complete draft for feedback |
| Final Paper | Mon, Apr 13 | 400 | Polished paper with full evaluation |
| Presentation | Week of Apr 20 | 150 | 10-minute presentation + Q&A |
| Total | 1000 |
| Section | Points | Criteria |
|---|---|---|
| Introduction & Motivation | 50 | Clear problem statement, why it matters |
| Related Work | 50 | Relevant papers cited and discussed |
| Methodology | 100 | Clear description of pipeline, reproducible |
| Evaluation | 150 | Rigorous metrics, baselines, analysis |
| Conclusion | 50 | Summary, limitations, future work |
| Criterion | Points |
|---|---|
| Working MCP servers | 100 |
| LLM orchestration | 100 |
| Tests and documentation | 50 |
| Criterion | Points |
|---|---|
| Clear communication | 50 |
| Demo of working system | 50 |
| Handling questions | 50 |
Post questions on the course discussion board or attend office hours.