Data Engineering at the University of Florida
Due: Monday, February 23, 2026 at 11:59 PM
Points: 100
Submission: Push to cis6930sp26-project repository in proposal/ directory
The proposal establishes the foundation for your project. You will define a research question, select a direction, and outline your approach. A strong proposal demonstrates that you understand the problem space and have a feasible plan.
Submit a proposal document (2-3 pages) that includes:
cis6930sp26-project/
├── proposal/
│ └── proposal.md (or proposal.pdf)
└── ...
| Criterion | Weight | Description |
|---|---|---|
| Problem Statement | 20% | Is the research problem clearly defined and well-motivated? |
| Related Work | 20% | Does the proposal demonstrate knowledge of prior work? |
| Proposed Approach | 25% | Is the methodology feasible and technically sound? |
| Evaluation Plan | 20% | Are the proposed metrics and baselines appropriate? |
| Writing Quality | 15% | Is the proposal well-written and organized? |
| Score | Meaning |
|---|---|
| 5 | Excellent - Ready to proceed |
| 4 | Good - Strong with minor improvements needed |
| 3 | Satisfactory - Acceptable but needs refinement |
| 2 | Needs Work - Significant gaps or issues |
| 1 | Incomplete - Major revision required |
| Score | Description |
|---|---|
| 5 | Crystal clear problem; compelling motivation; significance well-argued |
| 4 | Clear problem statement; good motivation |
| 3 | Problem understandable but motivation could be stronger |
| 2 | Problem vague or poorly motivated |
| 1 | No clear problem statement |
Guiding Questions:
| Score | Description |
|---|---|
| 5 | Comprehensive survey; clear positioning relative to prior work |
| 4 | Good coverage of relevant work; identifies gaps |
| 3 | Some relevant work cited; positioning could be clearer |
| 2 | Limited related work; missing key references |
| 1 | No related work or completely irrelevant citations |
Guiding Questions:
| Score | Description |
|---|---|
| 5 | Innovative approach; clearly feasible; well-justified choices |
| 4 | Sound methodology; reasonable approach |
| 3 | Approach understandable but some details unclear |
| 2 | Methodology vague or potentially infeasible |
| 1 | No clear approach or fundamentally flawed |
Guiding Questions:
| Score | Description |
|---|---|
| 5 | Comprehensive evaluation; appropriate metrics; strong baselines |
| 4 | Good evaluation plan; reasonable metrics and baselines |
| 3 | Basic evaluation outlined; some gaps |
| 2 | Evaluation unclear or inappropriate metrics |
| 1 | No evaluation plan |
Guiding Questions:
| Score | Description |
|---|---|
| 5 | Exceptionally clear; well-organized; no errors |
| 4 | Clear writing; good organization; minor errors |
| 3 | Understandable but could be clearer; some disorganization |
| 2 | Hard to follow; significant writing issues |
| 1 | Incomprehensible or severely disorganized |
Research Question: Can LLM-orchestrated MCP servers achieve comparable accuracy to hand-coded ETL scripts when integrating heterogeneous smart city data sources?
Abstract: This project develops an LLM-augmented data pipeline for integrating data from multiple smart city portals. The system uses MCP servers to expose APIs for Gainesville’s transit, utilities, and 311 request data. An LLM orchestrator coordinates data extraction, schema mapping, and quality validation. I evaluate the approach by comparing extraction accuracy and development effort against equivalent hand-coded Python scripts.
Architecture:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Transit API │ │ Utilities API │ │ 311 API │
│ MCP Server │ │ MCP Server │ │ MCP Server │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────────┬───────┴───────────────────────┘
│
┌─────▼─────┐
│ LLM │
│Orchestrator│
└─────┬─────┘
│
┌─────▼─────┐
│Integrated │
│ Database │
└───────────┘
Evaluation Plan:
Research Question: How does GPT-4 entity resolution performance compare to Magellan on structured product catalogs, and at what scale does token cost exceed traditional ML training cost?
Abstract: This project compares LLM-based entity resolution against Magellan, a traditional ML-based entity matching system. Using the Abt-Buy product matching benchmark, I implement both approaches and evaluate matching accuracy, runtime, and cost. The project produces a cost-performance tradeoff analysis to guide practitioners in choosing between approaches.
Evaluation Plan:
Research Question: Can an LLM-based diagnostic agent reduce data pipeline downtime by automatically detecting and suggesting fixes for common failures?
Abstract: This project develops a self-healing data pipeline architecture where an LLM agent monitors pipeline health, diagnoses failures, and suggests or applies fixes. The system uses MCP servers to expose pipeline metadata, logs, and configuration. I evaluate the approach by injecting common failures (schema drift, API rate limits, data quality issues) and measuring detection accuracy and fix appropriateness.
Evaluation Plan:
Be specific - Vague proposals receive lower scores. Specify exact datasets, metrics, and methods.
Scope appropriately - A focused project with strong evaluation beats an ambitious project you cannot complete.
Start with evaluation - Define how you will measure success before designing the system.
Cite relevant work - Show you understand the landscape. Include 5-10 relevant papers.
Include an architecture diagram - A picture clarifies your design better than paragraphs of text.
Address feasibility - Acknowledge risks and explain how you will mitigate them.
proposal/ directorycegme as Admin collaborator