Data Engineering at the University of Florida
This document maps required and optional readings to each lecture in the course.
No readings - infrastructure focus
Lecture: MCP Fundamentals, Building MCP Servers, Multi-agent Pipelines
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | Model Context Protocol Specification | MCP Docs | Official specification for MCP, covering core concepts, architecture, and protocol design. Essential for understanding how MCP enables communication between LLMs and external tools/data sources. |
| Required | MCP Quickstart Guide | MCP Quickstart | Hands-on guide to building your first MCP server. Covers server creation, tool registration, and client integration. |
| Optional | Building MCP Servers Tutorial | MCP Servers | Detailed tutorial on implementing custom MCP servers with examples. |
| Optional | Multi-Agent Orchestration Patterns | MCP Patterns | Architectural patterns for building multi-agent systems with MCP. |
Lecture: Prompt engineering fundamentals, Chain-of-Thought, Structured Outputs
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | Chain-of-Thought Prompting Elicits Reasoning (Wei et al., 2022) | arXiv:2201.11903 | Demonstrates that including reasoning steps in prompts enables LLMs to solve complex arithmetic, commonsense, and symbolic reasoning tasks. A 540B-parameter model with 8 CoT examples achieved SOTA on math word problems. Foundational work for understanding prompting techniques. |
| Optional | The Prompt Report: A Systematic Survey of Prompting Techniques | arXiv:2406.06608 | Comprehensive taxonomy of 58 prompting techniques and 33 vocabulary terms. Use as a reference guide. If reading, focus on Sections 1-3 (Introduction, Taxonomy, Core Techniques) only - the full survey is extensive. |
| Optional | Large Language Models are Zero-Shot Reasoners (Kojima et al., 2022) | arXiv:2205.11916 | Shows that simply adding “Let’s think step by step” improves reasoning performance dramatically (+61 percentage points on MultiArith). |
| Optional | Tree of Thoughts: Deliberate Problem Solving with LLMs | arXiv:2305.10601 | Extends CoT by allowing exploration of multiple reasoning paths with backtracking. Achieved 74% success on Game of 24 compared to 4% with standard CoT prompting. |
| Optional | Graph of Thoughts: Solving Elaborate Problems with LLMs | arXiv:2308.09687 | Models reasoning as an arbitrary graph with aggregation and refinement. Achieves 62% better quality than ToT on sorting while reducing costs by 31%. |
Discussion: How to read research papers
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | How to Read a Paper (Keshav) | Classic 3-pass method for reading research papers efficiently: first pass for overview (5 min), second for understanding (1 hour), third for deep comprehension. | |
| Required | LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models (Arora & Dell, 2024) | ACL 2024 Demo | Open-source package making transformer-based record linkage accessible without deep learning expertise. Treats linkage as text retrieval using sentence embeddings. Used as the in-class 3-pass reading exercise. |
Lecture: Data integration (reuse lecture7) Lecture: Entity resolution (reuse lecture9)
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | LinkTransformer: Record Linkage with Transformer LMs | ACL 2024 Demo | Open-source package making transformer-based record linkage accessible without deep learning expertise. Outperforms string matching methods by a wide margin and supports multiple languages. |
| Optional | Entity Resolution in Voice Interfaces (EMNLP 2024 Industry) | EMNLP 2024 | Industry application of entity resolution in voice assistant systems. |
Guest Speaker: Dr. Shiree Hughes (Monday, Feb 9) - Big Data at General Motors Guest Speaker: Mikhail Sinanan (Friday, Feb 13) - Data Engineering at Spotify
| Type | Resource | Link | Summary |
|---|---|---|---|
| Required | Apache Kafka Introduction | Kafka Intro | Official introduction to Kafka’s distributed event streaming platform. Covers producers, consumers, topics, and how Kafka handles real-time data feeds at scale. |
| Required | Apache Spark Quick Start | Spark Quick Start | Official getting started guide for Spark. Introduces the API through interactive shell and shows how Spark processes data in-memory for 100x faster performance than MapReduce. |
| Required | Hadoop Tutorial Overview | GeeksforGeeks Hadoop | Overview of Hadoop ecosystem: HDFS for distributed storage, MapReduce for processing, and YARN for resource management. |
| Optional | Apache Kafka for Beginners | DataCamp Kafka | Comprehensive beginner guide covering Kafka architecture, brokers, partitions, and consumer groups. |
| Optional | Spark By Examples | Spark By Examples | Hands-on Spark tutorials with code examples in Scala and PySpark. |
| Optional | Building Blocks of Hadoop | Pluralsight | Deep dive into HDFS, MapReduce, and YARN architecture. |
| Type | Resource | Link | Summary |
|---|---|---|---|
| Required | Spotify’s Data Platform Explained | Spotify Engineering | Overview of Spotify’s data infrastructure processing 1.4 trillion data points daily. Covers data collection, processing, and management architecture. |
| Required | NerdOut@Spotify: A Trillion Events | Spotify Podcast | 38-minute podcast on handling 50M events/second, Kafka to cloud transition, and data quality at scale. |
Lecture: ML fundamentals (reuse lecture11 - PyTorch) Lecture: Embeddings (reuse lecture14)
| Type | Paper | Link | Summary |
|---|---|---|---|
| Optional | Word2Vec (Mikolov et al., 2013) | arXiv:1301.3781 | Introduces efficient architectures for learning word vectors from large datasets. Achieved SOTA on syntactic and semantic word similarity while training on 1.6B words in less than a day. Foundational work for modern embeddings. |
| Optional | MTEB: Massive Text Embedding Benchmark | arXiv:2210.07316 | Comprehensive benchmark spanning 8 embedding tasks, 58 datasets, and 112 languages. Reveals that no single embedding method dominates across all tasks. Essential for understanding embedding evaluation. |
Lecture: RAG architecture overview (Day 18) Demo: Vector databases (Chroma, FAISS) (Day 19)
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP (Lewis et al., 2020) | arXiv:2005.11401 | Foundational RAG paper combining parametric and non-parametric memory using Wikipedia retrieval. Achieved SOTA on open-domain QA and generates more factual, specific content than pure parametric models. |
| Required | Enhancing RAG: A Study of Best Practices | COLING 2025 | Systematic study of RAG best practices and optimization strategies. Covers chunking, retrieval, and generation optimization. |
| Recommended | GraphRAG (Microsoft) | arXiv:2404.16130 | Addresses RAG limitations for global queries by building entity knowledge graphs with community summaries. Shows substantial improvements in answer comprehensiveness and diversity for analytical questions. Read if interested in advanced RAG. |
| Optional | RAG Survey: Comprehensive Survey of Architectures (2025) | arXiv:2506.00054 | Comprehensive survey covering granularity-aware retrieval, robustness, and RAG frontiers. Good reference for understanding the RAG landscape. |
| Optional | Agentic RAG Survey | arXiv:2501.09136 | Survey on autonomous agents in RAG pipelines. Covers reflection, planning, tool use, and multi-agent collaboration patterns. |
| Optional | Introduction to Information Retrieval (Ch. 6-7: Vector Space Model) | Stanford IR Book | Classic textbook chapters on vector space models and similarity search. |
Tutorials & Implementation Guides:
Additional RAG Variants (Reference): For students interested in exploring more RAG approaches, see these papers:
Lecture: LLM Evaluation Fundamentals (Day 20) Demo: RAGAS Evaluation (Day 21)
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | RAGAS: Automated Evaluation of Retrieval Augmented Generation | EACL 2024 | Reference-free framework for evaluating RAG systems across three dimensions: retrieval relevance, LLM faithfulness, and generation quality. Enables faster evaluation cycles without ground truth annotations. |
| Required | In Benchmarks We Trust… Or Not? | EMNLP 2025 | Critical examination of benchmark reliability and limitations. |
| Recommended | Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023) | arXiv:2306.05685 | Introduces the LLM-as-Judge paradigm for scalable evaluation. GPT-4 as judge achieves >80% agreement with human preferences. Foundational work for automated LLM evaluation. |
| Optional | Chatbot Arena: An Open Platform for Evaluating LLMs (Chiang et al., 2024) | arXiv:2403.04132 | Crowd-sourced pairwise comparison platform with 1M+ votes. Demonstrates that human pairwise comparisons correlate strongly with expert judgments. |
| Optional | MMLU: Measuring Massive Multitask Language Understanding | arXiv:2009.03300 | Comprehensive benchmark covering 57 academic and professional domains. Found that even the best models fall short of expert-level performance, especially on socially critical subjects like law and morality. |
| Optional | MMLU-Pro: A More Robust and Challenging Multi-Task Benchmark (Wang et al., 2024) | arXiv:2406.01574 | Harder MMLU with 10 answer choices instead of 4. Accuracy drops 16-33% vs original MMLU; reduces prompt sensitivity and benchmark saturation. |
| Optional | BERTScore: Evaluating Text Generation with BERT (Zhang et al., 2020) | arXiv:1904.09675 | Embedding-based evaluation that captures paraphrases and synonyms. Correlates better with human judgment than n-gram metrics like BLEU and ROUGE. |
| Optional | Evaluation of LLMs Should Not Ignore Non-Determinism | NAACL 2025 | Examines the impact of LLM non-determinism on evaluation reliability. |
| Optional | Examining Robustness of LLM Evaluation | ACL 2024 | Studies robustness issues in LLM evaluation methodologies. |
| Optional | Large Language Models are not Fair Evaluators (Wang et al., 2023) | arXiv:2305.17926 | Documents biases in LLM-as-Judge including position bias, verbosity bias, and self-enhancement bias. |
| Optional | NLP Evaluation in trouble (Sainz et al., 2023) | EMNLP Findings | Evidence of benchmark contamination where test data appears in training. Published benchmark scores may be inflated due to memorization. |
| Optional | Detecting Pretraining Data from LLMs (Shi et al., 2024) | arXiv:2310.16789 | Methods for detecting training data contamination including Min-k% Prob. Helps identify when models have memorized benchmark answers. |
| Optional | HalluLens: LLM Hallucination Benchmark | ACL 2025 | Benchmark specifically designed for measuring LLM hallucinations. |
Classic N-gram Metrics (Reference):
Focus: Assignment 2 completion, RAGAS demo, project work
No new readings - focus on applying Week 8 evaluation concepts to Assignment 2
⏸️ Spring Break follows (March 14-21)
Lecture: AI fairness and bias Lecture: Human-in-the-loop systems
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | Bias and Fairness in Large Language Models: A Survey | Computational Linguistics 2024 | Comprehensive 80+ page review proposing taxonomies for evaluation metrics, datasets, and mitigation techniques for LLM bias. Organizes mitigation by intervention stage: pre-processing, in-training, intra-processing, and post-processing. |
| Required | Building Effective Agents (Anthropic, 2024) | Anthropic Research | Practical guide distinguishing workflows (predefined orchestration) from agents (dynamic LLM control). Presents 6 design patterns from augmented LLM to orchestrator-workers. Emphasizes simplicity, transparency, and careful tool design. |
| Optional | A Trip Towards Fairness: Bias and De-Biasing in LLMs | *SEM 2024 | Explores approaches to identifying and mitigating bias in LLMs. |
| Optional | Addressing Statistical and Causal Gender Fairness in NLP | NAACL 2024 Findings | Examines gender fairness from statistical and causal perspectives. |
| Optional | ReAct: Synergizing Reasoning and Acting in LLMs | arXiv:2210.03629 | Interleaves reasoning traces with actions for better task-solving. Overcomes hallucination by grounding in external knowledge, achieving 34% and 10% absolute improvements on decision-making benchmarks. |
| Optional | Efficient Agents: Building Effective Agents While Reducing Cost | arXiv:2508.02694 | Strategies for building cost-efficient agentic systems. |
Lecture: Paper writing workshop
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | How to Write a Great Research Paper (Simon Peyton Jones) | Video | Classic talk on research paper writing: start with the idea, write early and often, structure for clarity. Emphasizes that writing is thinking. |
| Optional | The Science of Scientific Writing | Cognitive principles for clear scientific writing: put old information before new, keep subjects and verbs close together. |
Lecture: Crowdsourcing and annotation (reuse lecture15)
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | HumEval 2024 Workshop Proceedings | ACL Anthology | Collection of papers on human evaluation of NLP systems. |
| Required | Capturing Perspectives of Crowdsourced Annotators | NAACL 2024 | Proposes AART to learn individual annotator representations rather than majority voting. Addresses fairness concerns for underrepresented perspectives in subjective classification tasks. |
| Optional | On Crowdsourcing Task Design for Discourse Annotation | COLING 2025 | Best practices for designing crowdsourcing annotation tasks. |
| Optional | Evaluating Saliency Explanations by Crowdsourcing | LREC-COLING 2024 | Using crowdsourcing to evaluate model explanations. |
Lecture: How to write good reviews
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | Advice for Peer Reviewers (ACL) | ACL Reviewer Guidelines | Official ACL guidelines for writing constructive, fair peer reviews. |
| Optional | NIPS Experiment (reviewing consistency) | arXiv:2109.09774 | Famous study showing 25% of papers had inconsistent accept/reject decisions when reviewed by different committees. Highlights subjectivity in peer review. |
Presentations and finals Focus on project completion - minimal new readings
These are classic papers students should be aware of, organized by topic:
| Paper | Link | Summary | |——-|——|———| | GPT-3: Language Models are Few-Shot Learners | arXiv:2005.14165 | Landmark paper demonstrating that scaling to 175B parameters enables strong few-shot performance without fine-tuning. Introduced the in-context learning paradigm. | | Scaling Laws for Neural Language Models | arXiv:2001.08361 | Discovered predictable power-law relationships between model size, data, compute, and loss. Showed optimal training uses large models on modest data with early stopping. | | LLaMA: Open and Efficient Foundation Language Models | arXiv:2302.13971 | Demonstrated SOTA models can be trained using only public data. LLaMA-13B outperforms GPT-3. Made models available to researchers, catalyzing open-source LLM development. |
| Paper | Link | Summary | |——-|——|———| | MMLU Pro | arXiv:2406.01574 | More challenging MMLU with 10 options instead of 4 and reasoning-focused questions. Accuracy drops 16-33% vs original; reduced prompt sensitivity. Better tracks AI progress. | | SWE-Bench: Evaluating LLMs on Real-World Software Issues | arXiv:2310.06770 | 2,294 real GitHub issues requiring multi-file code changes. Best model (Claude 2) solved only 1.96%, revealing gap between code generation and software engineering capabilities. | | IFEval: Instruction-Following Evaluation | arXiv:2311.07911 | Benchmark using verifiable instructions (word count, keywords) for objective evaluation. 25 instruction types across 500 prompts. Avoids biases of LLM-based evaluation. |
| Paper | Link | Summary | |——-|——|———| | The Stack: Code Dataset | arXiv:2211.15533 | 3.1 TB of permissively licensed code in 30 languages. Shows deduplication improves model performance. Includes opt-out mechanism for developers. | | HumanEval: Evaluating Code Generation | arXiv:2107.03374 | Introduces Codex and HumanEval benchmark for code synthesis from docstrings. Codex achieved 28.8% pass@1; repeated sampling solves 70% of problems. |
| Paper | Link | Summary | |——-|——|———| | LoRA: Low-Rank Adaptation of LLMs | arXiv:2106.09685 | Efficient fine-tuning by freezing base weights and training low-rank matrices. Reduces trainable parameters 10,000x and GPU memory 3x while matching or exceeding full fine-tuning. | | DPO: Direct Preference Optimization | arXiv:2305.18290 | Simplifies RLHF to a classification loss by deriving optimal policy in closed form. More stable and efficient training while achieving comparable or better alignment than RLHF. |
–>
Last updated: March 2026 Source: latent.space 2025 reading list + ACL Anthology 2024-2025