CIS 6930 Spring 26

Logo

Data Engineering at the University of Florida

Project Proposal

Due: Monday, February 23, 2026 at 11:59 PM Points: 100 Submission: Push to cis6930sp26-project repository in proposal/ directory


Overview

The proposal establishes the foundation for your project. You will define a research question, select a direction, and outline your approach. A strong proposal demonstrates that you understand the problem space and have a feasible plan.

Deliverables

Submit a proposal document (2-3 pages) that includes:

  1. Research Question - A specific, measurable question about LLM-augmented data pipelines
  2. Project Direction - Which direction (A, B, or C) you are pursuing
  3. Dataset Selection - What data source you will use and why
  4. Initial Design - Architecture diagram and component description
  5. Evaluation Plan - Metrics, baselines, and ground truth strategy
  6. Timeline - Weekly milestones through the semester

File Structure

cis6930sp26-project/
├── proposal/
│   └── proposal.md (or proposal.pdf)
└── ...

Rubric

Criterion Weight Description
Problem Statement 20% Is the research problem clearly defined and well-motivated?
Related Work 20% Does the proposal demonstrate knowledge of prior work?
Proposed Approach 25% Is the methodology feasible and technically sound?
Evaluation Plan 20% Are the proposed metrics and baselines appropriate?
Writing Quality 15% Is the proposal well-written and organized?

Scoring Scale

Score Meaning
5 Excellent - Ready to proceed
4 Good - Strong with minor improvements needed
3 Satisfactory - Acceptable but needs refinement
2 Needs Work - Significant gaps or issues
1 Incomplete - Major revision required

Detailed Criteria

Problem Statement (20%)

Score Description
5 Crystal clear problem; compelling motivation; significance well-argued
4 Clear problem statement; good motivation
3 Problem understandable but motivation could be stronger
2 Problem vague or poorly motivated
1 No clear problem statement

Guiding Questions:

Score Description
5 Comprehensive survey; clear positioning relative to prior work
4 Good coverage of relevant work; identifies gaps
3 Some relevant work cited; positioning could be clearer
2 Limited related work; missing key references
1 No related work or completely irrelevant citations

Guiding Questions:

Proposed Approach (25%)

Score Description
5 Innovative approach; clearly feasible; well-justified choices
4 Sound methodology; reasonable approach
3 Approach understandable but some details unclear
2 Methodology vague or potentially infeasible
1 No clear approach or fundamentally flawed

Guiding Questions:

Evaluation Plan (20%)

Score Description
5 Comprehensive evaluation; appropriate metrics; strong baselines
4 Good evaluation plan; reasonable metrics and baselines
3 Basic evaluation outlined; some gaps
2 Evaluation unclear or inappropriate metrics
1 No evaluation plan

Guiding Questions:

Writing Quality (15%)

Score Description
5 Exceptionally clear; well-organized; no errors
4 Clear writing; good organization; minor errors
3 Understandable but could be clearer; some disorganization
2 Hard to follow; significant writing issues
1 Incomprehensible or severely disorganized

Example Proposals

Example 1: Direction A - Smart City Data Pipeline

Research Question: Can LLM-orchestrated MCP servers achieve comparable accuracy to hand-coded ETL scripts when integrating heterogeneous smart city data sources?

Abstract: This project develops an LLM-augmented data pipeline for integrating data from multiple smart city portals. The system uses MCP servers to expose APIs for Gainesville’s transit, utilities, and 311 request data. An LLM orchestrator coordinates data extraction, schema mapping, and quality validation. I evaluate the approach by comparing extraction accuracy and development effort against equivalent hand-coded Python scripts.

Architecture:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Transit API    │     │  Utilities API  │     │  311 API        │
│  MCP Server     │     │  MCP Server     │     │  MCP Server     │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         └───────────────┬───────┴───────────────────────┘
                         │
                   ┌─────▼─────┐
                   │   LLM     │
                   │Orchestrator│
                   └─────┬─────┘
                         │
                   ┌─────▼─────┐
                   │Integrated │
                   │ Database  │
                   └───────────┘

Evaluation Plan:


Example 2: Direction B - LLM vs Traditional Entity Resolution

Research Question: How does GPT-4 entity resolution performance compare to Magellan on structured product catalogs, and at what scale does token cost exceed traditional ML training cost?

Abstract: This project compares LLM-based entity resolution against Magellan, a traditional ML-based entity matching system. Using the Abt-Buy product matching benchmark, I implement both approaches and evaluate matching accuracy, runtime, and cost. The project produces a cost-performance tradeoff analysis to guide practitioners in choosing between approaches.

Evaluation Plan:


Example 3: Direction C - Self-Healing Data Pipeline

Research Question: Can an LLM-based diagnostic agent reduce data pipeline downtime by automatically detecting and suggesting fixes for common failures?

Abstract: This project develops a self-healing data pipeline architecture where an LLM agent monitors pipeline health, diagnoses failures, and suggests or applies fixes. The system uses MCP servers to expose pipeline metadata, logs, and configuration. I evaluate the approach by injecting common failures (schema drift, API rate limits, data quality issues) and measuring detection accuracy and fix appropriateness.

Evaluation Plan:


Tips for a Strong Proposal

  1. Be specific - Vague proposals receive lower scores. Specify exact datasets, metrics, and methods.

  2. Scope appropriately - A focused project with strong evaluation beats an ambitious project you cannot complete.

  3. Start with evaluation - Define how you will measure success before designing the system.

  4. Cite relevant work - Show you understand the landscape. Include 5-10 relevant papers.

  5. Include an architecture diagram - A picture clarifies your design better than paragraphs of text.

  6. Address feasibility - Acknowledge risks and explain how you will mitigate them.


Submission Checklist


Resources


back