Assignment 1: MCP Data Pipeline

Due: Wednesday, February 18, 2026 at 11:59 PM Points: 100 Submission: GitHub repository + Canvas link Peer Review: Due February 18, 2026 at 11:59 PM

Overview

In this assignment, you will build a multi-agent data processing pipeline using the Model Context Protocol (MCP). You will create MCP servers that expose data engineering tools, connect them to an LLM, and use the LLM to orchestrate a complete ETL (Extract, Transform, Load) workflow.

This assignment demonstrates how LLMs can serve as intelligent orchestrators for data pipelines, making decisions about data quality, transformation strategies, and error handling.

Learning Objectives

By completing this assignment, you will:

Build MCP servers using Python’s FastMCP framework
Design tools with proper input/output schemas
Connect MCP servers to an LLM client
Implement an LLM-orchestrated data pipeline
Compare LLM-orchestrated vs. traditional pipeline approaches

Dataset

You will work with the City of Gainesville Crime Responses dataset, available through the City of Gainesville Open Data Portal.

The City of Gainesville publishes a variety of public datasets through its Open Data Portal at https://data.cityofgainesville.org/. This portal provides access to datasets across public safety, transportation, utilities, and more. Each dataset can be explored in the browser, downloaded in multiple formats, or accessed programmatically through a Socrata Open Data API (SODA) endpoint.

For this assignment, you will use the Crime Responses dataset. You can browse and explore the dataset here: https://data.cityofgainesville.org/d/gvua-xt9q

API Endpoint: https://data.cityofgainesville.org/resource/gvua-xt9q.json

This dataset contains incident reports from the Gainesville Police Department, including:

Incident type and description
Report and offense dates
Location (latitude/longitude, rounded per Marsy’s Law)
Case status

Task

Build an MCP-powered data pipeline with three components:

1. MCP Server: Data Extraction (`extract_server.py`)

Create an MCP server that exposes tools for data extraction:

@mcp.tool()
def fetch_incidents(limit: int = 100, offset: int = 0) -> str:
    """Fetch crime incident data from the Gainesville API."""
    # Implementation

@mcp.tool()
def get_incident_types() -> list[str]:
    """Get a list of unique incident types in the dataset."""
    # Implementation

@mcp.resource("schema://incidents")
def get_schema() -> str:
    """Return the schema of the incidents data."""
    # Implementation

2. MCP Server: Data Transformation (`transform_server.py`)

Create an MCP server that exposes tools for data cleaning and transformation:

@mcp.tool()
def clean_dates(data: str) -> str:
    """Parse and standardize date fields."""
    # Implementation

@mcp.tool()
def categorize_incidents(data: str, categories: list[str]) -> str:
    """Group incidents into broader categories."""
    # Implementation

@mcp.tool()
def detect_anomalies(data: str) -> str:
    """Identify potential data quality issues."""
    # Implementation

3. MCP Server: Data Loading & Analysis (`load_server.py`)

Create an MCP server for storage and analysis:

@mcp.tool()
def save_to_sqlite(data: str, table_name: str) -> str:
    """Save processed data to SQLite database."""
    # Implementation

@mcp.tool()
def query_database(sql: str) -> str:
    """Execute a SQL query on the processed data."""
    # Implementation

@mcp.tool()
def generate_summary(table_name: str) -> str:
    """Generate summary statistics for a table."""
    # Implementation

4. LLM Orchestration (`pipeline.py`)

Create a script that:

Connects to your MCP servers
Uses an LLM (via NavigatorAI) to orchestrate the pipeline
The LLM decides:
- How much data to fetch
- Which transformations to apply based on data quality
- What summary statistics to generate

Example LLM prompt:

You are a data engineer assistant with access to MCP tools for data extraction,
transformation, and loading. Process the Gainesville crime data by:
1. First, check the schema and fetch a sample of data
2. Identify any data quality issues
3. Clean and transform the data appropriately
4. Load it into the database
5. Generate a summary report

Use the available tools to complete each step. Explain your decisions.

Requirements

MCP Servers (45 points)

Requirement	Points
Extract server with 3+ tools	15
Transform server with 3+ tools	15
Load server with 3+ tools	10
Proper error handling in all tools	5

LLM Orchestration (25 points)

Requirement	Points
Successfully connects to MCP servers	10
LLM makes appropriate tool calls	10
Pipeline completes end-to-end	5

Peer Review (10 points)

Requirement	Points
Completion of assigned peer reviews	10

Documentation & Testing (20 points)

Requirement	Points
README with setup, usage, and pipeline comparison	5
Tests for each MCP server	10
COLLABORATORS.md	5

README.md

Your repository must include a well-structured README.md that serves as the primary documentation for your project. The README should contain the following sections:

Setup and Usage Instructions

Your README must enable someone unfamiliar with your project to set it up and run it from scratch. Include:

Any prerequisites (Python version, system dependencies, API keys, etc.)
How to clone the repository and install dependencies
How to start MCP servers and execute the orchestration script
How to run the full pipeline end-to-end
How to run tests
Expected output or how to verify that the pipeline ran successfully

Pipeline Comparison

Compare your MCP pipeline to a traditional implementation by addressing:

Flexibility: How does the LLM handle unexpected data quality issues?
Transparency: Can you understand why the LLM made certain decisions?
Reliability: Did the LLM ever make mistakes? How did you handle them?
Performance: Compare execution time and token usage

Bugs and Assumptions

Document any known bugs, limitations, or assumptions made during development.

COLLABORATORS.md

Your repository must include a COLLABORATORS.md file documenting all collaboration and assistance. This file should list:

Names of classmates you discussed concepts or approaches with, and a brief description of what was discussed
AI tools used (e.g., ChatGPT, Claude, Copilot), including what you used them for and how they contributed to your solution
Online resources beyond those listed in the assignment (e.g., Stack Overflow posts, blog articles, tutorials)
If you received no outside help, state that explicitly

This file is required for academic integrity. Be thorough and honest.

Project Structure

Here is an example project structure:

cis6930sp26-assignment1/
├── .github/
│   └── workflows/
│       └── pytest.yml
├── servers/
│   ├── extract_server.py
│   ├── transform_server.py
│   └── load_server.py
├── tests/
│   ├── test_extract.py
│   ├── test_transform.py
│   └── test_load.py
├── data/
│   └── incidents.db (generated)
├── .env.example
├── COLLABORATORS.md
├── LICENSE
├── README.md
├── pipeline.py
└── pyproject.toml

pyproject.toml

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "assignment1"
version = "1.0.0"
description = "CIS 6930 Assignment 1 - MCP Data Pipeline"
authors = [{name = "Your Name", email = "your.email@ufl.edu"}]
requires-python = ">=3.11"
dependencies = [
    "mcp>=1.0",
    "requests>=2.31",
    "pandas>=2.0",
    "pytest>=8.0",
]

[tool.pytest.ini_options]
testpaths = ["tests"]

Running the Pipeline

Your README should include instructions like:

# Install dependencies
uv sync

# Start the MCP servers (in separate terminals or as background processes)
uv run python servers/extract_server.py
uv run python servers/transform_server.py
uv run python servers/load_server.py

# Run the LLM-orchestrated pipeline
uv run python pipeline.py

# Or run tests
uv run pytest -v

Submission

Create a private repository named cis6930sp26-assignment1
Add cegme as an Admin collaborator on your repository
Tag your final submission:
```
git tag v1.0
git push origin v1.0
```
Submit the repository URL to Canvas

Late Policy

Due date: Friday, February 6, 2026 at 11:59 PM

You may submit after the due date until grading begins (typically 1-3 days after due date). The exact grading start time will not be announced. No submissions accepted after grading begins.

I strongly encourage submitting by the due date to avoid:

Missing the grading window
Mounting assignment burden
Delayed peer review assignment

Peer Review (10 points)

You will review 2 classmates’ submissions. Completing your assigned peer reviews is worth 10 points. Evaluate:

Do the MCP servers start and respond correctly?
Does the LLM successfully orchestrate the pipeline?
Is the code well-organized and documented?
Does the pipeline comparison in the README provide meaningful insights?

Tips

Start with one server - Get the extract server working before building the others
Test tools independently - Use the MCP Inspector to test tools before connecting to the LLM
Handle JSON carefully - MCP tools exchange data as strings; use JSON serialization
Log LLM decisions - Print the LLM’s reasoning to understand its choices
Plan for failures - The LLM may call tools incorrectly; add validation

Resources

Grading

Component	Points	Graded By
MCP Servers	45	Automated + Peers
LLM Orchestration	25	Peers
Documentation & Testing	20	Peers
Peer Review	10	Instructor
Total	100

Academic Integrity

This is an individual assignment. You may discuss concepts with classmates, but all code must be your own. Document all collaboration and AI assistance in COLLABORATORS.md (see requirements above).

CIS 6930 Spring 26

Assignment 1: MCP Data Pipeline

Overview

Learning Objectives

Dataset

Task

1. MCP Server: Data Extraction (`extract_server.py`)

2. MCP Server: Data Transformation (`transform_server.py`)

3. MCP Server: Data Loading & Analysis (`load_server.py`)

4. LLM Orchestration (`pipeline.py`)

Requirements

MCP Servers (45 points)

LLM Orchestration (25 points)

Peer Review (10 points)

Documentation & Testing (20 points)

README.md

Setup and Usage Instructions

Pipeline Comparison

Bugs and Assumptions

COLLABORATORS.md

Project Structure

pyproject.toml

Running the Pipeline

Submission

Late Policy

Peer Review (10 points)

Tips

Resources

Grading

Academic Integrity

Sources

Assignment 1: MCP Data Pipeline

Overview

Learning Objectives

Dataset

Task

1. MCP Server: Data Extraction (extract_server.py)

2. MCP Server: Data Transformation (transform_server.py)

3. MCP Server: Data Loading & Analysis (load_server.py)

4. LLM Orchestration (pipeline.py)

Requirements

MCP Servers (45 points)

LLM Orchestration (25 points)

Peer Review (10 points)

Documentation & Testing (20 points)

README.md

Setup and Usage Instructions

Pipeline Comparison

Bugs and Assumptions

COLLABORATORS.md

Project Structure

pyproject.toml

Running the Pipeline

Submission

Late Policy

Peer Review (10 points)

Tips

Resources

Grading

Academic Integrity

Sources

1. MCP Server: Data Extraction (`extract_server.py`)

2. MCP Server: Data Transformation (`transform_server.py`)

3. MCP Server: Data Loading & Analysis (`load_server.py`)

4. LLM Orchestration (`pipeline.py`)