Data Engineering at the University of Florida
Due: Sunday, March 9, 2026 at 11:59 PM Points: 60 (50 implementation + 10 peer review) Peer Review Due: Thursday, March 12, 2026 at 11:59 PM Submission: GitHub repository + Canvas link
In this assignment, you will build a Retrieval-Augmented Generation (RAG) system that answers questions about research papers from this course. You will implement document chunking, vector storage, retrieval, and answer generation with citations.
Starter code is provided. Your task is to complete the implementation of key functions.
By completing this assignment, you will:
cis6930sp26-assignment2git clone https://github.com/YOUR_USERNAME/cis6930sp26-assignment2.git
cd cis6930sp26-assignment2
# Install dependencies
uv sync
Download at least 5 PDF documents (research papers, documentation, etc.) and place them in the papers/ directory:
papers/
├── lewis2020rag.pdf # Required: Original RAG paper
├── wei2022cot.pdf # Required: Chain-of-Thought paper
├── paper3.pdf
├── paper4.pdf
└── paper5.pdf
cp .env.example .env
# Edit .env with your API keys
The starter code provides the structure. You need to implement the functions marked with # TODO.
chunker.py (10 points)Implement document chunking:
def chunk_document(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""
Split a document into overlapping chunks.
Args:
text: The document text to chunk
chunk_size: Maximum characters per chunk
overlap: Number of characters to overlap between chunks
Returns:
List of text chunks
TODO: Implement this function.
- Split on sentence boundaries when possible
- Ensure chunks don't exceed chunk_size
- Include overlap characters from previous chunk
"""
raise NotImplementedError("Implement chunk_document")
def chunk_by_paragraphs(text: str, max_chunk_size: int = 1000) -> list[str]:
"""
Split a document by paragraphs, merging small paragraphs.
Args:
text: The document text to chunk
max_chunk_size: Maximum characters per chunk
Returns:
List of text chunks (each containing one or more paragraphs)
TODO: Implement this function.
- Split on double newlines (paragraphs)
- Merge consecutive small paragraphs until max_chunk_size
- Don't split paragraphs mid-text
"""
raise NotImplementedError("Implement chunk_by_paragraphs")
vectorstore.py (15 points)Implement vector storage and retrieval:
def create_vectorstore(chunks: list[str], metadatas: list[dict]) -> Chroma:
"""
Create a Chroma vector store from document chunks.
Args:
chunks: List of text chunks
metadatas: List of metadata dicts (one per chunk)
Returns:
Chroma vector store instance
TODO: Implement this function.
- Use sentence-transformers for embeddings (all-MiniLM-L6-v2)
- Store chunks with their metadata
- Persist to ./chroma_db directory
"""
raise NotImplementedError("Implement create_vectorstore")
def retrieve(vectorstore: Chroma, query: str, k: int = 3) -> list[Document]:
"""
Retrieve the top-k most relevant chunks for a query.
Args:
vectorstore: The Chroma vector store
query: The search query
k: Number of documents to retrieve
Returns:
List of Document objects with page_content and metadata
TODO: Implement this function.
- Use similarity search
- Return top k results
"""
raise NotImplementedError("Implement retrieve")
def retrieve_with_scores(vectorstore: Chroma, query: str, k: int = 3) -> list[tuple[Document, float]]:
"""
Retrieve top-k chunks with their similarity scores.
Args:
vectorstore: The Chroma vector store
query: The search query
k: Number of documents to retrieve
Returns:
List of (Document, score) tuples, sorted by relevance
TODO: Implement this function.
- Use similarity_search_with_score
- Return documents with their scores
"""
raise NotImplementedError("Implement retrieve_with_scores")
generator.py (15 points)Implement answer generation:
def generate_answer(query: str, context_docs: list[Document], llm) -> str:
"""
Generate an answer based on retrieved context.
Args:
query: The user's question
context_docs: Retrieved documents to use as context
llm: The language model to use
Returns:
Generated answer string
TODO: Implement this function.
- Format the context documents into the prompt
- Include source attribution instructions
- Call the LLM and return the response
"""
raise NotImplementedError("Implement generate_answer")
def generate_answer_with_citations(query: str, context_docs: list[Document], llm) -> dict:
"""
Generate an answer with explicit citations to source documents.
Args:
query: The user's question
context_docs: Retrieved documents to use as context
llm: The language model to use
Returns:
Dictionary with:
- "answer": The generated answer text
- "citations": List of cited source documents
TODO: Implement this function.
- Number each source in the prompt (e.g., [1], [2])
- Instruct the LLM to cite sources by number
- Parse citations from the response
- Return structured output
"""
raise NotImplementedError("Implement generate_answer_with_citations")
evaluate.py (10 points)Implement evaluation metrics:
def precision_at_k(retrieved_ids: list[str], relevant_ids: list[str], k: int) -> float:
"""
Calculate precision@k for retrieval evaluation.
Args:
retrieved_ids: List of retrieved document IDs (in order)
relevant_ids: List of actually relevant document IDs
k: Number of top results to consider
Returns:
Precision@k score (0.0 to 1.0)
TODO: Implement this function.
- Consider only the top k retrieved documents
- Calculate: (relevant docs in top k) / k
"""
raise NotImplementedError("Implement precision_at_k")
def mean_reciprocal_rank(queries_results: list[tuple[list[str], str]]) -> float:
"""
Calculate Mean Reciprocal Rank (MRR) across multiple queries.
Args:
queries_results: List of (retrieved_ids, first_relevant_id) tuples
Returns:
MRR score (0.0 to 1.0)
TODO: Implement this function.
- For each query, find rank of first relevant document
- Reciprocal rank = 1/rank (or 0 if not found)
- Return mean across all queries
"""
raise NotImplementedError("Implement mean_reciprocal_rank")
After implementing the functions:
# Run tests (do this first!)
uv run pytest
# Index the papers
uv run python index.py
# Query the system
uv run python query.py "What is retrieval augmented generation?"
# Run evaluation
uv run python run_evaluation.py
| Component | Points | Description |
|---|---|---|
Chunking (chunker.py) |
10 | Both chunking functions work correctly |
Vector Store (vectorstore.py) |
15 | Vector store creation and retrieval work |
Generation (generator.py) |
15 | Answer generation with citations works |
| Component | Points | Description |
|---|---|---|
Metrics (evaluate.py) |
5 | Metrics implemented correctly |
| Evaluation Results | 5 | Run evaluation on 5+ test queries, report results in README |
| Component | Points | Description |
|---|---|---|
| Complete 2 peer reviews | 10 | Submit reviews as GitHub Issues by March 12 |
Your README must include:
How to install dependencies and run the system.
Report your retrieval evaluation results:
## Evaluation Results
| Metric | Score |
|--------|-------|
| Precision@3 | 0.XX |
| Precision@5 | 0.XX |
| MRR | 0.XX |
### Test Queries Used
1. "What is chain-of-thought prompting?"
2. "How does RAG reduce hallucination?"
3. ...
Show 2-3 example queries with the system’s answers and citations.
Briefly explain your chunking approach and any experiments you tried.
Document all collaboration and AI assistance (required).
cis6930sp26-assignment2/
├── papers/ # Your PDF papers (not committed)
│ ├── lewis2020rag.pdf
│ └── ...
├── tests/
│ ├── test_chunker.py
│ ├── test_vectorstore.py
│ ├── test_generator.py
│ └── test_evaluate.py
├── chroma_db/ # Generated vector store
├── chunker.py # TODO: Implement chunking
├── vectorstore.py # TODO: Implement vector store
├── generator.py # TODO: Implement generation
├── evaluate.py # TODO: Implement evaluation
├── index.py # Provided: Indexing script
├── query.py # Provided: Query script
├── run_evaluation.py # Provided: Evaluation script
├── .env.example
├── .gitignore
├── COLLABORATORS.md
├── README.md
└── pyproject.toml
cis6930sp26-assignment2cegme as an Admin collaboratorgit tag v1.0
git push origin v1.0
You will review 2 classmates’ submissions. Peer reviews are assigned in Canvas after the submission deadline.
Due: Thursday, March 12, 2026 at 11:59 PM
For each submission you review, check:
uv sync and uv run pytest?uv run python index.py?uv run python query.py "test question"?Submit your review as a GitHub Issue on the repository you’re reviewing. Use this template:
## Peer Review
**Reviewer:** [Your Name]
### Functionality
- [ ] Tests pass
- [ ] Indexing works
- [ ] Querying returns relevant answers with citations
### Code Quality
- [ ] Chunking implemented correctly
- [ ] Vector store operations work
- [ ] Answer generation includes citations
### Documentation
- [ ] README has evaluation results
- [ ] Example queries shown
- [ ] Chunking strategy explained
### Comments
[Your constructive feedback here]
### Score Suggestion
[X/50] - Brief justification
| Component | Points |
|---|---|
| Implementation | 40 |
| Evaluation | 10 |
| Peer Review Completion | 10 |
| Total | 60 |
pytest tests/test_chunker.py before moving onThis is an individual assignment. You may discuss concepts with classmates, but all code must be your own. Document all collaboration in COLLABORATORS.md.