Data Engineering at the University of Florida
Due: Wednesday, February 18, 2026 at 11:59 PM Points: 100 Submission: GitHub repository + Canvas link Peer Review: Due February 18, 2026 at 11:59 PM
In this assignment, you will build a multi-agent data processing pipeline using the Model Context Protocol (MCP). You will create MCP servers that expose data engineering tools, connect them to an LLM, and use the LLM to orchestrate a complete ETL (Extract, Transform, Load) workflow.
This assignment demonstrates how LLMs can serve as intelligent orchestrators for data pipelines, making decisions about data quality, transformation strategies, and error handling.
By completing this assignment, you will:
You will work with the City of Gainesville Crime Responses dataset, available through the City of Gainesville Open Data Portal.
The City of Gainesville publishes a variety of public datasets through its Open Data Portal at https://data.cityofgainesville.org/. This portal provides access to datasets across public safety, transportation, utilities, and more. Each dataset can be explored in the browser, downloaded in multiple formats, or accessed programmatically through a Socrata Open Data API (SODA) endpoint.
For this assignment, you will use the Crime Responses dataset. You can browse and explore the dataset here: https://data.cityofgainesville.org/d/gvua-xt9q
API Endpoint: https://data.cityofgainesville.org/resource/gvua-xt9q.json
This dataset contains incident reports from the Gainesville Police Department, including:
Build an MCP-powered data pipeline with three components:
extract_server.py)Create an MCP server that exposes tools for data extraction:
@mcp.tool()
def fetch_incidents(limit: int = 100, offset: int = 0) -> str:
"""Fetch crime incident data from the Gainesville API."""
# Implementation
@mcp.tool()
def get_incident_types() -> list[str]:
"""Get a list of unique incident types in the dataset."""
# Implementation
@mcp.resource("schema://incidents")
def get_schema() -> str:
"""Return the schema of the incidents data."""
# Implementation
transform_server.py)Create an MCP server that exposes tools for data cleaning and transformation:
@mcp.tool()
def clean_dates(data: str) -> str:
"""Parse and standardize date fields."""
# Implementation
@mcp.tool()
def categorize_incidents(data: str, categories: list[str]) -> str:
"""Group incidents into broader categories."""
# Implementation
@mcp.tool()
def detect_anomalies(data: str) -> str:
"""Identify potential data quality issues."""
# Implementation
load_server.py)Create an MCP server for storage and analysis:
@mcp.tool()
def save_to_sqlite(data: str, table_name: str) -> str:
"""Save processed data to SQLite database."""
# Implementation
@mcp.tool()
def query_database(sql: str) -> str:
"""Execute a SQL query on the processed data."""
# Implementation
@mcp.tool()
def generate_summary(table_name: str) -> str:
"""Generate summary statistics for a table."""
# Implementation
pipeline.py)Create a script that:
Example LLM prompt:
You are a data engineer assistant with access to MCP tools for data extraction,
transformation, and loading. Process the Gainesville crime data by:
1. First, check the schema and fetch a sample of data
2. Identify any data quality issues
3. Clean and transform the data appropriately
4. Load it into the database
5. Generate a summary report
Use the available tools to complete each step. Explain your decisions.
| Requirement | Points |
|---|---|
| Extract server with 3+ tools | 15 |
| Transform server with 3+ tools | 15 |
| Load server with 3+ tools | 10 |
| Proper error handling in all tools | 5 |
| Requirement | Points |
|---|---|
| Successfully connects to MCP servers | 10 |
| LLM makes appropriate tool calls | 10 |
| Pipeline completes end-to-end | 5 |
| Requirement | Points |
|---|---|
| Completion of assigned peer reviews | 10 |
| Requirement | Points |
|---|---|
| README with setup, usage, and pipeline comparison | 5 |
| Tests for each MCP server | 10 |
| COLLABORATORS.md | 5 |
Your repository must include a well-structured README.md that serves as the primary documentation for your project. The README should contain the following sections:
Your README must enable someone unfamiliar with your project to set it up and run it from scratch. Include:
Compare your MCP pipeline to a traditional implementation by addressing:
Document any known bugs, limitations, or assumptions made during development.
Your repository must include a COLLABORATORS.md file documenting all collaboration and assistance. This file should list:
This file is required for academic integrity. Be thorough and honest.
Here is an example project structure:
cis6930sp26-assignment1/
├── .github/
│ └── workflows/
│ └── pytest.yml
├── servers/
│ ├── extract_server.py
│ ├── transform_server.py
│ └── load_server.py
├── tests/
│ ├── test_extract.py
│ ├── test_transform.py
│ └── test_load.py
├── data/
│ └── incidents.db (generated)
├── .env.example
├── COLLABORATORS.md
├── LICENSE
├── README.md
├── pipeline.py
└── pyproject.toml
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "assignment1"
version = "1.0.0"
description = "CIS 6930 Assignment 1 - MCP Data Pipeline"
authors = [{name = "Your Name", email = "your.email@ufl.edu"}]
requires-python = ">=3.11"
dependencies = [
"mcp>=1.0",
"requests>=2.31",
"pandas>=2.0",
"pytest>=8.0",
]
[tool.pytest.ini_options]
testpaths = ["tests"]
Your README should include instructions like:
# Install dependencies
uv sync
# Start the MCP servers (in separate terminals or as background processes)
uv run python servers/extract_server.py
uv run python servers/transform_server.py
uv run python servers/load_server.py
# Run the LLM-orchestrated pipeline
uv run python pipeline.py
# Or run tests
uv run pytest -v
cis6930sp26-assignment1cegme as an Admin collaborator on your repositorygit tag v1.0
git push origin v1.0
Due date: Friday, February 6, 2026 at 11:59 PM
You may submit after the due date until grading begins (typically 1-3 days after due date). The exact grading start time will not be announced. No submissions accepted after grading begins.
I strongly encourage submitting by the due date to avoid:
You will review 2 classmates’ submissions. Completing your assigned peer reviews is worth 10 points. Evaluate:
| Component | Points | Graded By |
|---|---|---|
| MCP Servers | 45 | Automated + Peers |
| LLM Orchestration | 25 | Peers |
| Documentation & Testing | 20 | Peers |
| Peer Review | 10 | Instructor |
| Total | 100 |
This is an individual assignment. You may discuss concepts with classmates, but all code must be your own. Document all collaboration and AI assistance in COLLABORATORS.md (see requirements above).