CIS 6930 Spring 26

Logo

Data Engineering at the University of Florida

Assignment 1: MCP Data Pipeline

Due: Wednesday, February 18, 2026 at 11:59 PM Points: 100 Submission: GitHub repository + Canvas link Peer Review: Due February 18, 2026 at 11:59 PM


Overview

In this assignment, you will build a multi-agent data processing pipeline using the Model Context Protocol (MCP). You will create MCP servers that expose data engineering tools, connect them to an LLM, and use the LLM to orchestrate a complete ETL (Extract, Transform, Load) workflow.

This assignment demonstrates how LLMs can serve as intelligent orchestrators for data pipelines, making decisions about data quality, transformation strategies, and error handling.


Learning Objectives

By completing this assignment, you will:

  1. Build MCP servers using Python’s FastMCP framework
  2. Design tools with proper input/output schemas
  3. Connect MCP servers to an LLM client
  4. Implement an LLM-orchestrated data pipeline
  5. Compare LLM-orchestrated vs. traditional pipeline approaches

Dataset

You will work with the City of Gainesville Crime Responses dataset, available through the City of Gainesville Open Data Portal.

The City of Gainesville publishes a variety of public datasets through its Open Data Portal at https://data.cityofgainesville.org/. This portal provides access to datasets across public safety, transportation, utilities, and more. Each dataset can be explored in the browser, downloaded in multiple formats, or accessed programmatically through a Socrata Open Data API (SODA) endpoint.

For this assignment, you will use the Crime Responses dataset. You can browse and explore the dataset here: https://data.cityofgainesville.org/d/gvua-xt9q

API Endpoint: https://data.cityofgainesville.org/resource/gvua-xt9q.json

This dataset contains incident reports from the Gainesville Police Department, including:


Task

Build an MCP-powered data pipeline with three components:

1. MCP Server: Data Extraction (extract_server.py)

Create an MCP server that exposes tools for data extraction:

@mcp.tool()
def fetch_incidents(limit: int = 100, offset: int = 0) -> str:
    """Fetch crime incident data from the Gainesville API."""
    # Implementation

@mcp.tool()
def get_incident_types() -> list[str]:
    """Get a list of unique incident types in the dataset."""
    # Implementation

@mcp.resource("schema://incidents")
def get_schema() -> str:
    """Return the schema of the incidents data."""
    # Implementation

2. MCP Server: Data Transformation (transform_server.py)

Create an MCP server that exposes tools for data cleaning and transformation:

@mcp.tool()
def clean_dates(data: str) -> str:
    """Parse and standardize date fields."""
    # Implementation

@mcp.tool()
def categorize_incidents(data: str, categories: list[str]) -> str:
    """Group incidents into broader categories."""
    # Implementation

@mcp.tool()
def detect_anomalies(data: str) -> str:
    """Identify potential data quality issues."""
    # Implementation

3. MCP Server: Data Loading & Analysis (load_server.py)

Create an MCP server for storage and analysis:

@mcp.tool()
def save_to_sqlite(data: str, table_name: str) -> str:
    """Save processed data to SQLite database."""
    # Implementation

@mcp.tool()
def query_database(sql: str) -> str:
    """Execute a SQL query on the processed data."""
    # Implementation

@mcp.tool()
def generate_summary(table_name: str) -> str:
    """Generate summary statistics for a table."""
    # Implementation

4. LLM Orchestration (pipeline.py)

Create a script that:

  1. Connects to your MCP servers
  2. Uses an LLM (via NavigatorAI) to orchestrate the pipeline
  3. The LLM decides:
    • How much data to fetch
    • Which transformations to apply based on data quality
    • What summary statistics to generate

Example LLM prompt:

You are a data engineer assistant with access to MCP tools for data extraction,
transformation, and loading. Process the Gainesville crime data by:
1. First, check the schema and fetch a sample of data
2. Identify any data quality issues
3. Clean and transform the data appropriately
4. Load it into the database
5. Generate a summary report

Use the available tools to complete each step. Explain your decisions.

Requirements

MCP Servers (45 points)

Requirement Points
Extract server with 3+ tools 15
Transform server with 3+ tools 15
Load server with 3+ tools 10
Proper error handling in all tools 5

LLM Orchestration (25 points)

Requirement Points
Successfully connects to MCP servers 10
LLM makes appropriate tool calls 10
Pipeline completes end-to-end 5

Peer Review (10 points)

Requirement Points
Completion of assigned peer reviews 10

Documentation & Testing (20 points)

Requirement Points
README with setup, usage, and pipeline comparison 5
Tests for each MCP server 10
COLLABORATORS.md 5

README.md

Your repository must include a well-structured README.md that serves as the primary documentation for your project. The README should contain the following sections:

Setup and Usage Instructions

Your README must enable someone unfamiliar with your project to set it up and run it from scratch. Include:

Pipeline Comparison

Compare your MCP pipeline to a traditional implementation by addressing:

  1. Flexibility: How does the LLM handle unexpected data quality issues?
  2. Transparency: Can you understand why the LLM made certain decisions?
  3. Reliability: Did the LLM ever make mistakes? How did you handle them?
  4. Performance: Compare execution time and token usage

Bugs and Assumptions

Document any known bugs, limitations, or assumptions made during development.

COLLABORATORS.md

Your repository must include a COLLABORATORS.md file documenting all collaboration and assistance. This file should list:

This file is required for academic integrity. Be thorough and honest.


Project Structure

Here is an example project structure:

cis6930sp26-assignment1/
├── .github/
│   └── workflows/
│       └── pytest.yml
├── servers/
│   ├── extract_server.py
│   ├── transform_server.py
│   └── load_server.py
├── tests/
│   ├── test_extract.py
│   ├── test_transform.py
│   └── test_load.py
├── data/
│   └── incidents.db (generated)
├── .env.example
├── COLLABORATORS.md
├── LICENSE
├── README.md
├── pipeline.py
└── pyproject.toml

pyproject.toml

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "assignment1"
version = "1.0.0"
description = "CIS 6930 Assignment 1 - MCP Data Pipeline"
authors = [{name = "Your Name", email = "your.email@ufl.edu"}]
requires-python = ">=3.11"
dependencies = [
    "mcp>=1.0",
    "requests>=2.31",
    "pandas>=2.0",
    "pytest>=8.0",
]

[tool.pytest.ini_options]
testpaths = ["tests"]

Running the Pipeline

Your README should include instructions like:

# Install dependencies
uv sync

# Start the MCP servers (in separate terminals or as background processes)
uv run python servers/extract_server.py
uv run python servers/transform_server.py
uv run python servers/load_server.py

# Run the LLM-orchestrated pipeline
uv run python pipeline.py

# Or run tests
uv run pytest -v

Submission

  1. Create a private repository named cis6930sp26-assignment1
  2. Add cegme as an Admin collaborator on your repository
  3. Tag your final submission:
    git tag v1.0
    git push origin v1.0
    
  4. Submit the repository URL to Canvas

Late Policy

Due date: Friday, February 6, 2026 at 11:59 PM

You may submit after the due date until grading begins (typically 1-3 days after due date). The exact grading start time will not be announced. No submissions accepted after grading begins.

I strongly encourage submitting by the due date to avoid:


Peer Review (10 points)

You will review 2 classmates’ submissions. Completing your assigned peer reviews is worth 10 points. Evaluate:

  1. Do the MCP servers start and respond correctly?
  2. Does the LLM successfully orchestrate the pipeline?
  3. Is the code well-organized and documented?
  4. Does the pipeline comparison in the README provide meaningful insights?

Tips

  1. Start with one server - Get the extract server working before building the others
  2. Test tools independently - Use the MCP Inspector to test tools before connecting to the LLM
  3. Handle JSON carefully - MCP tools exchange data as strings; use JSON serialization
  4. Log LLM decisions - Print the LLM’s reasoning to understand its choices
  5. Plan for failures - The LLM may call tools incorrectly; add validation

Resources


Grading

Component Points Graded By
MCP Servers 45 Automated + Peers
LLM Orchestration 25 Peers
Documentation & Testing 20 Peers
Peer Review 10 Instructor
Total 100  

Academic Integrity

This is an individual assignment. You may discuss concepts with classmates, but all code must be your own. Document all collaboration and AI assistance in COLLABORATORS.md (see requirements above).


Sources