Data Engineering at the University of Florida
This document maps required and optional readings to each lecture in the course.
No readings - infrastructure focus
Lecture: MCP Fundamentals, Building MCP Servers, Multi-agent Pipelines
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | Model Context Protocol Specification | MCP Docs | Official specification for MCP, covering core concepts, architecture, and protocol design. Essential for understanding how MCP enables communication between LLMs and external tools/data sources. |
| Required | MCP Quickstart Guide | MCP Quickstart | Hands-on guide to building your first MCP server. Covers server creation, tool registration, and client integration. |
| Optional | Building MCP Servers Tutorial | MCP Servers | Detailed tutorial on implementing custom MCP servers with examples. |
| Optional | Multi-Agent Orchestration Patterns | MCP Patterns | Architectural patterns for building multi-agent systems with MCP. |
Lecture: Prompt engineering fundamentals, Chain-of-Thought, Structured Outputs
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | Chain-of-Thought Prompting Elicits Reasoning (Wei et al., 2022) | arXiv:2201.11903 | Demonstrates that including reasoning steps in prompts enables LLMs to solve complex arithmetic, commonsense, and symbolic reasoning tasks. A 540B-parameter model with 8 CoT examples achieved SOTA on math word problems. Foundational work for understanding prompting techniques. |
| Optional | The Prompt Report: A Systematic Survey of Prompting Techniques | arXiv:2406.06608 | Comprehensive taxonomy of 58 prompting techniques and 33 vocabulary terms. Use as a reference guide. If reading, focus on Sections 1-3 (Introduction, Taxonomy, Core Techniques) only - the full survey is extensive. |
| Optional | Large Language Models are Zero-Shot Reasoners (Kojima et al., 2022) | arXiv:2205.11916 | Shows that simply adding “Let’s think step by step” improves reasoning performance dramatically (+61 percentage points on MultiArith). |
| Optional | Tree of Thoughts: Deliberate Problem Solving with LLMs | arXiv:2305.10601 | Extends CoT by allowing exploration of multiple reasoning paths with backtracking. Achieved 74% success on Game of 24 compared to 4% with standard CoT prompting. |
| Optional | Graph of Thoughts: Solving Elaborate Problems with LLMs | arXiv:2308.09687 | Models reasoning as an arbitrary graph with aggregation and refinement. Achieves 62% better quality than ToT on sorting while reducing costs by 31%. |
Discussion: How to read research papers
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | How to Read a Paper (Keshav) | Classic 3-pass method for reading research papers efficiently: first pass for overview (5 min), second for understanding (1 hour), third for deep comprehension. | |
| Required | LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models (Arora & Dell, 2024) | ACL 2024 Demo | Open-source package making transformer-based record linkage accessible without deep learning expertise. Treats linkage as text retrieval using sentence embeddings. Used as the in-class 3-pass reading exercise. |
Lecture: Data integration (reuse lecture7) Lecture: Entity resolution (reuse lecture9)
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | LinkTransformer: Record Linkage with Transformer LMs | ACL 2024 Demo | Open-source package making transformer-based record linkage accessible without deep learning expertise. Outperforms string matching methods by a wide margin and supports multiple languages. |
| Optional | Entity Resolution in Voice Interfaces (EMNLP 2024 Industry) | EMNLP 2024 | Industry application of entity resolution in voice assistant systems. |
Guest Speaker: Dr. Shiree Hughes (Monday, Feb 9) - Big Data at General Motors Guest Speaker: Mikhail Sinanan (Friday, Feb 13) - Data Engineering at Spotify
| Type | Resource | Link | Summary |
|---|---|---|---|
| Required | Apache Kafka Introduction | Kafka Intro | Official introduction to Kafka’s distributed event streaming platform. Covers producers, consumers, topics, and how Kafka handles real-time data feeds at scale. |
| Required | Apache Spark Quick Start | Spark Quick Start | Official getting started guide for Spark. Introduces the API through interactive shell and shows how Spark processes data in-memory for 100x faster performance than MapReduce. |
| Required | Hadoop Tutorial Overview | GeeksforGeeks Hadoop | Overview of Hadoop ecosystem: HDFS for distributed storage, MapReduce for processing, and YARN for resource management. |
| Optional | Apache Kafka for Beginners | DataCamp Kafka | Comprehensive beginner guide covering Kafka architecture, brokers, partitions, and consumer groups. |
| Optional | Spark By Examples | Spark By Examples | Hands-on Spark tutorials with code examples in Scala and PySpark. |
| Optional | Building Blocks of Hadoop | Pluralsight | Deep dive into HDFS, MapReduce, and YARN architecture. |
| Type | Resource | Link | Summary |
|---|---|---|---|
| Required | Spotify’s Data Platform Explained | Spotify Engineering | Overview of Spotify’s data infrastructure processing 1.4 trillion data points daily. Covers data collection, processing, and management architecture. |
| Required | NerdOut@Spotify: A Trillion Events | Spotify Podcast | 38-minute podcast on handling 50M events/second, Kafka to cloud transition, and data quality at scale. |
Lecture: ML fundamentals (reuse lecture11 - PyTorch) Lecture: Embeddings (reuse lecture14)
| Type | Paper | Link | Summary |
|---|---|---|---|
| Optional | Word2Vec (Mikolov et al., 2013) | arXiv:1301.3781 | Introduces efficient architectures for learning word vectors from large datasets. Achieved SOTA on syntactic and semantic word similarity while training on 1.6B words in less than a day. Foundational work for modern embeddings. |
| Optional | MTEB: Massive Text Embedding Benchmark | arXiv:2210.07316 | Comprehensive benchmark spanning 8 embedding tasks, 58 datasets, and 112 languages. Reveals that no single embedding method dominates across all tasks. Essential for understanding embedding evaluation. |
NEW Lecture: RAG architecture overview
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP (Lewis et al., 2020) | arXiv:2005.11401 | Foundational RAG paper combining parametric and non-parametric memory using Wikipedia retrieval. Achieved SOTA on open-domain QA and generates more factual, specific content than pure parametric models. |
| Required | Enhancing RAG: A Study of Best Practices | COLING 2025 | Systematic study of RAG best practices and optimization strategies. |
| Optional | Knowledge Graph-Guided RAG (KG2RAG) | NAACL 2025 | Integrates knowledge graphs with RAG for improved retrieval. |
| Optional | GRAG: Graph Retrieval-Augmented Generation | NAACL 2025 Findings | Graph-based approach to retrieval-augmented generation. |
| Optional | GraphRAG (Microsoft) | arXiv:2404.16130 | Addresses RAG limitations for global queries by building entity knowledge graphs with community summaries. Shows substantial improvements in answer comprehensiveness and diversity for analytical questions. |
| Optional | Multi-Agent Filtering RAG (MAIN-RAG) | ACL 2025 | Multi-agent approach to filtering and refining RAG outputs. |
| Optional | Towards Omni-RAG | ACL 2025 | Explores unified RAG approaches across modalities. |
NEW Demo: Vector databases (Chroma, FAISS)
| Type | Paper | Link | Summary |
|---|---|---|---|
| Optional | Introduction to Information Retrieval (Ch. 6-7: Vector Space Model) | Stanford IR Book | Classic textbook chapters on vector space models and similarity search. |
NEW Lecture: LLM evaluation fundamentals
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | RAGAS: Automated Evaluation of Retrieval Augmented Generation | EACL 2024 | Reference-free framework for evaluating RAG systems across three dimensions: retrieval relevance, LLM faithfulness, and generation quality. Enables faster evaluation cycles without ground truth annotations. |
| Required | In Benchmarks We Trust… Or Not? | EMNLP 2025 | Critical examination of benchmark reliability and limitations. |
| Optional | MMLU: Measuring Massive Multitask Language Understanding | arXiv:2009.03300 | Comprehensive benchmark covering 57 academic and professional domains. Found that even the best models fall short of expert-level performance, especially on socially critical subjects like law and morality. |
| Optional | Evaluation of LLMs Should Not Ignore Non-Determinism | NAACL 2025 | Examines the impact of LLM non-determinism on evaluation reliability. |
| Optional | Examining Robustness of LLM Evaluation | ACL 2024 | Studies robustness issues in LLM evaluation methodologies. |
| Optional | HalluLens: LLM Hallucination Benchmark | ACL 2025 | Benchmark specifically designed for measuring LLM hallucinations. |
No readings
Lecture: AI fairness and bias (reuse lecture18)
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | Bias and Fairness in Large Language Models: A Survey | Computational Linguistics 2024 | Comprehensive 80+ page review proposing taxonomies for evaluation metrics, datasets, and mitigation techniques for LLM bias. Organizes mitigation by intervention stage: pre-processing, in-training, intra-processing, and post-processing. |
| Optional | A Trip Towards Fairness: Bias and De-Biasing in LLMs | *SEM 2024 | Explores approaches to identifying and mitigating bias in LLMs. |
| Optional | Addressing Statistical and Causal Gender Fairness in NLP | NAACL 2024 Findings | Examines gender fairness from statistical and causal perspectives. |
NEW Lecture: Human-in-the-loop systems
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | Building Effective Agents (Anthropic, 2024) | Anthropic Research | Practical guide distinguishing workflows (predefined orchestration) from agents (dynamic LLM control). Presents 6 design patterns from augmented LLM to orchestrator-workers. Emphasizes simplicity, transparency, and careful tool design. |
| Optional | ReAct: Synergizing Reasoning and Acting in LLMs | arXiv:2210.03629 | Interleaves reasoning traces with actions for better task-solving. Overcomes hallucination by grounding in external knowledge, achieving 34% and 10% absolute improvements on decision-making benchmarks. |
| Optional | Efficient Agents: Building Effective Agents While Reducing Cost | arXiv:2508.02694 | Strategies for building cost-efficient agentic systems. |
NEW Lecture: Paper writing workshop
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | How to Write a Great Research Paper (Simon Peyton Jones) | Video | Classic talk on research paper writing: start with the idea, write early and often, structure for clarity. Emphasizes that writing is thinking. |
| Optional | The Science of Scientific Writing | Cognitive principles for clear scientific writing: put old information before new, keep subjects and verbs close together. |
Lecture: Crowdsourcing and annotation (reuse lecture15)
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | HumEval 2024 Workshop Proceedings | ACL Anthology | Collection of papers on human evaluation of NLP systems. |
| Required | Capturing Perspectives of Crowdsourced Annotators | NAACL 2024 | Proposes AART to learn individual annotator representations rather than majority voting. Addresses fairness concerns for underrepresented perspectives in subjective classification tasks. |
| Optional | On Crowdsourcing Task Design for Discourse Annotation | COLING 2025 | Best practices for designing crowdsourcing annotation tasks. |
| Optional | Evaluating Saliency Explanations by Crowdsourcing | LREC-COLING 2024 | Using crowdsourcing to evaluate model explanations. |
NEW Lecture: How to write good reviews
| Type | Paper | Link | Summary |
|---|---|---|---|
| Required | Advice for Peer Reviewers (ACL) | ACL Reviewer Guidelines | Official ACL guidelines for writing constructive, fair peer reviews. |
| Optional | NIPS Experiment (reviewing consistency) | arXiv:2109.09774 | Famous study showing 25% of papers had inconsistent accept/reject decisions when reviewed by different committees. Highlights subjectivity in peer review. |
Lecture: Visualizations (reuse lecture19) Focus on project completion - minimal new readings
These are classic papers students should be aware of, organized by topic:
| Paper | Link | Summary | |——-|——|———| | GPT-3: Language Models are Few-Shot Learners | arXiv:2005.14165 | Landmark paper demonstrating that scaling to 175B parameters enables strong few-shot performance without fine-tuning. Introduced the in-context learning paradigm. | | Scaling Laws for Neural Language Models | arXiv:2001.08361 | Discovered predictable power-law relationships between model size, data, compute, and loss. Showed optimal training uses large models on modest data with early stopping. | | LLaMA: Open and Efficient Foundation Language Models | arXiv:2302.13971 | Demonstrated SOTA models can be trained using only public data. LLaMA-13B outperforms GPT-3. Made models available to researchers, catalyzing open-source LLM development. |
| Paper | Link | Summary | |——-|——|———| | MMLU Pro | arXiv:2406.01574 | More challenging MMLU with 10 options instead of 4 and reasoning-focused questions. Accuracy drops 16-33% vs original; reduced prompt sensitivity. Better tracks AI progress. | | SWE-Bench: Evaluating LLMs on Real-World Software Issues | arXiv:2310.06770 | 2,294 real GitHub issues requiring multi-file code changes. Best model (Claude 2) solved only 1.96%, revealing gap between code generation and software engineering capabilities. | | IFEval: Instruction-Following Evaluation | arXiv:2311.07911 | Benchmark using verifiable instructions (word count, keywords) for objective evaluation. 25 instruction types across 500 prompts. Avoids biases of LLM-based evaluation. |
| Paper | Link | Summary | |——-|——|———| | The Stack: Code Dataset | arXiv:2211.15533 | 3.1 TB of permissively licensed code in 30 languages. Shows deduplication improves model performance. Includes opt-out mechanism for developers. | | HumanEval: Evaluating Code Generation | arXiv:2107.03374 | Introduces Codex and HumanEval benchmark for code synthesis from docstrings. Codex achieved 28.8% pass@1; repeated sampling solves 70% of problems. |
| Paper | Link | Summary | |——-|——|———| | LoRA: Low-Rank Adaptation of LLMs | arXiv:2106.09685 | Efficient fine-tuning by freezing base weights and training low-rank matrices. Reduces trainable parameters 10,000x and GPU memory 3x while matching or exceeding full fine-tuning. | | DPO: Direct Preference Optimization | arXiv:2305.18290 | Simplifies RLHF to a classification loss by deriving optimal policy in closed form. More stable and efficient training while achieving comparable or better alignment than RLHF. |
–>
Last updated: January 2026 Source: latent.space 2025 reading list + ACL Anthology 2024-2025