CIS 6930 SP26 - Lecture Reading List

This document maps required and optional readings to each lecture in the course.

Module 1: Foundations (Weeks 1-4)

Week 1: Course Setup

No readings - infrastructure focus

Week 2: Model Context Protocol (MCP)

Lecture: MCP Fundamentals, Building MCP Servers, Multi-agent Pipelines

Type	Paper	Link	Summary
Required	Model Context Protocol Specification	MCP Docs	Official specification for MCP, covering core concepts, architecture, and protocol design. Essential for understanding how MCP enables communication between LLMs and external tools/data sources.
Required	MCP Quickstart Guide	MCP Quickstart	Hands-on guide to building your first MCP server. Covers server creation, tool registration, and client integration.
Optional	Building MCP Servers Tutorial	MCP Servers	Detailed tutorial on implementing custom MCP servers with examples.
Optional	Multi-Agent Orchestration Patterns	MCP Patterns	Architectural patterns for building multi-agent systems with MCP.

Week 3: Prompt Engineering Basics

Lecture: Prompt engineering fundamentals, Chain-of-Thought, Structured Outputs

Type	Paper	Link	Summary
Required	Chain-of-Thought Prompting Elicits Reasoning (Wei et al., 2022)	arXiv:2201.11903	Demonstrates that including reasoning steps in prompts enables LLMs to solve complex arithmetic, commonsense, and symbolic reasoning tasks. A 540B-parameter model with 8 CoT examples achieved SOTA on math word problems. Foundational work for understanding prompting techniques.
Optional	The Prompt Report: A Systematic Survey of Prompting Techniques	arXiv:2406.06608	Comprehensive taxonomy of 58 prompting techniques and 33 vocabulary terms. Use as a reference guide. If reading, focus on Sections 1-3 (Introduction, Taxonomy, Core Techniques) only - the full survey is extensive.
Optional	Large Language Models are Zero-Shot Reasoners (Kojima et al., 2022)	arXiv:2205.11916	Shows that simply adding “Let’s think step by step” improves reasoning performance dramatically (+61 percentage points on MultiArith).
Optional	Tree of Thoughts: Deliberate Problem Solving with LLMs	arXiv:2305.10601	Extends CoT by allowing exploration of multiple reasoning paths with backtracking. Achieved 74% success on Game of 24 compared to 4% with standard CoT prompting.
Optional	Graph of Thoughts: Solving Elaborate Problems with LLMs	arXiv:2308.09687	Models reasoning as an arbitrary graph with aggregation and refinement. Achieves 62% better quality than ToT on sorting while reducing costs by 31%.

Discussion: How to read research papers

Type	Paper	Link	Summary
Required	How to Read a Paper (Keshav)	PDF	Classic 3-pass method for reading research papers efficiently: first pass for overview (5 min), second for understanding (1 hour), third for deep comprehension.
Required	LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models (Arora & Dell, 2024)	ACL 2024 Demo	Open-source package making transformer-based record linkage accessible without deep learning expertise. Treats linkage as text retrieval using sentence embeddings. Used as the in-class 3-pass reading exercise.

Week 4: Data Integration & Entity Resolution

Lecture: Data integration (reuse lecture7) Lecture: Entity resolution (reuse lecture9)

Type	Paper	Link	Summary
Required	LinkTransformer: Record Linkage with Transformer LMs	ACL 2024 Demo	Open-source package making transformer-based record linkage accessible without deep learning expertise. Outperforms string matching methods by a wide margin and supports multiple languages.
Optional	Entity Resolution in Voice Interfaces (EMNLP 2024 Industry)	EMNLP 2024	Industry application of entity resolution in voice assistant systems.

Module 2: LLM for Data (Weeks 5-8)

Week 5: LLMs for Data Tasks & Guest Speakers

Guest Speaker: Dr. Shiree Hughes (Monday, Feb 9) - Big Data at General Motors Guest Speaker: Mikhail Sinanan (Friday, Feb 13) - Data Engineering at Spotify

Guest Speaker Preparation: Big Data Fundamentals

Type	Resource	Link	Summary
Required	Apache Kafka Introduction	Kafka Intro	Official introduction to Kafka’s distributed event streaming platform. Covers producers, consumers, topics, and how Kafka handles real-time data feeds at scale.
Required	Apache Spark Quick Start	Spark Quick Start	Official getting started guide for Spark. Introduces the API through interactive shell and shows how Spark processes data in-memory for 100x faster performance than MapReduce.
Required	Hadoop Tutorial Overview	GeeksforGeeks Hadoop	Overview of Hadoop ecosystem: HDFS for distributed storage, MapReduce for processing, and YARN for resource management.
Optional	Apache Kafka for Beginners	DataCamp Kafka	Comprehensive beginner guide covering Kafka architecture, brokers, partitions, and consumer groups.
Optional	Spark By Examples	Spark By Examples	Hands-on Spark tutorials with code examples in Scala and PySpark.
Optional	Building Blocks of Hadoop	Pluralsight	Deep dive into HDFS, MapReduce, and YARN architecture.

Guest Speaker Preparation: Spotify Data Platform

Type	Resource	Link	Summary
Required	Spotify’s Data Platform Explained	Spotify Engineering	Overview of Spotify’s data infrastructure processing 1.4 trillion data points daily. Covers data collection, processing, and management architecture.
Required	NerdOut@Spotify: A Trillion Events	Spotify Podcast	38-minute podcast on handling 50M events/second, Kafka to cloud transition, and data quality at scale.

Week 6: ML Fundamentals & Embeddings

Lecture: ML fundamentals (reuse lecture11 - PyTorch) Lecture: Embeddings (reuse lecture14)

Type	Paper	Link	Summary
Optional	Word2Vec (Mikolov et al., 2013)	arXiv:1301.3781	Introduces efficient architectures for learning word vectors from large datasets. Achieved SOTA on syntactic and semantic word similarity while training on 1.6B words in less than a day. Foundational work for modern embeddings.
Optional	MTEB: Massive Text Embedding Benchmark	arXiv:2210.07316	Comprehensive benchmark spanning 8 embedding tasks, 58 datasets, and 112 languages. Reveals that no single embedding method dominates across all tasks. Essential for understanding embedding evaluation.

Week 7: RAG Systems

Lecture: RAG architecture overview (Day 18) Demo: Vector databases (Chroma, FAISS) (Day 19)

Type	Paper	Link	Summary
Required	RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP (Lewis et al., 2020)	arXiv:2005.11401	Foundational RAG paper combining parametric and non-parametric memory using Wikipedia retrieval. Achieved SOTA on open-domain QA and generates more factual, specific content than pure parametric models.
Required	Enhancing RAG: A Study of Best Practices	COLING 2025	Systematic study of RAG best practices and optimization strategies. Covers chunking, retrieval, and generation optimization.
Recommended	GraphRAG (Microsoft)	arXiv:2404.16130	Addresses RAG limitations for global queries by building entity knowledge graphs with community summaries. Shows substantial improvements in answer comprehensiveness and diversity for analytical questions. Read if interested in advanced RAG.
Optional	RAG Survey: Comprehensive Survey of Architectures (2025)	arXiv:2506.00054	Comprehensive survey covering granularity-aware retrieval, robustness, and RAG frontiers. Good reference for understanding the RAG landscape.
Optional	Agentic RAG Survey	arXiv:2501.09136	Survey on autonomous agents in RAG pipelines. Covers reflection, planning, tool use, and multi-agent collaboration patterns.
Optional	Introduction to Information Retrieval (Ch. 6-7: Vector Space Model)	Stanford IR Book	Classic textbook chapters on vector space models and similarity search.

Tutorials & Implementation Guides:

LlamaIndex RAG Tutorial - Official LlamaIndex documentation
LangChain RAG Tutorial - Build RAG with LangChain
Chroma Getting Started - Vector database quickstart
Awesome-RAG-Reasoning - Curated collection of RAG papers (EMNLP 2025)

Additional RAG Variants (Reference): For students interested in exploring more RAG approaches, see these papers:

KG2RAG: Knowledge Graph-Guided RAG (NAACL 2025)
GRAG: Graph Retrieval-Augmented Generation (NAACL 2025 Findings)
RankRAG: Unifying Context Ranking (NeurIPS 2024)
SELF-RAG: Self-Reflective RAG (arXiv:2310.11511)

Week 8: Evaluation

Lecture: LLM Evaluation Fundamentals (Day 20) Demo: RAGAS Evaluation (Day 21)

Type	Paper	Link	Summary
Required	RAGAS: Automated Evaluation of Retrieval Augmented Generation	EACL 2024	Reference-free framework for evaluating RAG systems across three dimensions: retrieval relevance, LLM faithfulness, and generation quality. Enables faster evaluation cycles without ground truth annotations.
Required	In Benchmarks We Trust… Or Not?	EMNLP 2025	Critical examination of benchmark reliability and limitations.
Recommended	Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)	arXiv:2306.05685	Introduces the LLM-as-Judge paradigm for scalable evaluation. GPT-4 as judge achieves >80% agreement with human preferences. Foundational work for automated LLM evaluation.
Optional	Chatbot Arena: An Open Platform for Evaluating LLMs (Chiang et al., 2024)	arXiv:2403.04132	Crowd-sourced pairwise comparison platform with 1M+ votes. Demonstrates that human pairwise comparisons correlate strongly with expert judgments.
Optional	MMLU: Measuring Massive Multitask Language Understanding	arXiv:2009.03300	Comprehensive benchmark covering 57 academic and professional domains. Found that even the best models fall short of expert-level performance, especially on socially critical subjects like law and morality.
Optional	MMLU-Pro: A More Robust and Challenging Multi-Task Benchmark (Wang et al., 2024)	arXiv:2406.01574	Harder MMLU with 10 answer choices instead of 4. Accuracy drops 16-33% vs original MMLU; reduces prompt sensitivity and benchmark saturation.
Optional	BERTScore: Evaluating Text Generation with BERT (Zhang et al., 2020)	arXiv:1904.09675	Embedding-based evaluation that captures paraphrases and synonyms. Correlates better with human judgment than n-gram metrics like BLEU and ROUGE.
Optional	Evaluation of LLMs Should Not Ignore Non-Determinism	NAACL 2025	Examines the impact of LLM non-determinism on evaluation reliability.
Optional	Examining Robustness of LLM Evaluation	ACL 2024	Studies robustness issues in LLM evaluation methodologies.
Optional	Large Language Models are not Fair Evaluators (Wang et al., 2023)	arXiv:2305.17926	Documents biases in LLM-as-Judge including position bias, verbosity bias, and self-enhancement bias.
Optional	NLP Evaluation in trouble (Sainz et al., 2023)	EMNLP Findings	Evidence of benchmark contamination where test data appears in training. Published benchmark scores may be inflated due to memorization.
Optional	Detecting Pretraining Data from LLMs (Shi et al., 2024)	arXiv:2310.16789	Methods for detecting training data contamination including Min-k% Prob. Helps identify when models have memorized benchmark answers.
Optional	HalluLens: LLM Hallucination Benchmark	ACL 2025	Benchmark specifically designed for measuring LLM hallucinations.

Classic N-gram Metrics (Reference):

BLEU: A Method for Automatic Evaluation of Machine Translation (Papineni et al., 2002) - IBM’s foundational metric for MT evaluation
ROUGE: A Package for Automatic Evaluation of Summaries (Lin, 2004) - Recall-focused metric for summarization

Module 3: Advanced Topics (Weeks 9-10)

Week 9: Project Work & RAGAS (Mar 9, 11, 13)

Focus: Assignment 2 completion, RAGAS demo, project work

No new readings - focus on applying Week 8 evaluation concepts to Assignment 2

⏸️ Spring Break follows (March 14-21)

Week 10: Ethics and Human-AI (Mar 23, 25, 27)

Lecture: AI fairness and bias Lecture: Human-in-the-loop systems

Type	Paper	Link	Summary
Required	Bias and Fairness in Large Language Models: A Survey	Computational Linguistics 2024	Comprehensive 80+ page review proposing taxonomies for evaluation metrics, datasets, and mitigation techniques for LLM bias. Organizes mitigation by intervention stage: pre-processing, in-training, intra-processing, and post-processing.
Required	Building Effective Agents (Anthropic, 2024)	Anthropic Research	Practical guide distinguishing workflows (predefined orchestration) from agents (dynamic LLM control). Presents 6 design patterns from augmented LLM to orchestrator-workers. Emphasizes simplicity, transparency, and careful tool design.
Optional	A Trip Towards Fairness: Bias and De-Biasing in LLMs	*SEM 2024	Explores approaches to identifying and mitigating bias in LLMs.
Optional	Addressing Statistical and Causal Gender Fairness in NLP	NAACL 2024 Findings	Examines gender fairness from statistical and causal perspectives.
Optional	ReAct: Synergizing Reasoning and Acting in LLMs	arXiv:2210.03629	Interleaves reasoning traces with actions for better task-solving. Overcomes hallucination by grounding in external knowledge, achieving 34% and 10% absolute improvements on decision-making benchmarks.
Optional	Efficient Agents: Building Effective Agents While Reducing Cost	arXiv:2508.02694	Strategies for building cost-efficient agentic systems.

Module 4: Completion (Weeks 11-15)

Week 11: Paper Writing Sprint (Mar 30, Apr 1, 3)

Lecture: Paper writing workshop

Type	Paper	Link	Summary
Required	How to Write a Great Research Paper (Simon Peyton Jones)	Video	Classic talk on research paper writing: start with the idea, write early and often, structure for clarity. Emphasizes that writing is thinking.
Optional	The Science of Scientific Writing	PDF	Cognitive principles for clear scientific writing: put old information before new, keep subjects and verbs close together.

Week 12: Crowdsourcing (Apr 6, 8, 10)

Lecture: Crowdsourcing and annotation (reuse lecture15)

Type	Paper	Link	Summary
Required	HumEval 2024 Workshop Proceedings	ACL Anthology	Collection of papers on human evaluation of NLP systems.
Required	Capturing Perspectives of Crowdsourced Annotators	NAACL 2024	Proposes AART to learn individual annotator representations rather than majority voting. Addresses fairness concerns for underrepresented perspectives in subjective classification tasks.
Optional	On Crowdsourcing Task Design for Discourse Annotation	COLING 2025	Best practices for designing crowdsourcing annotation tasks.
Optional	Evaluating Saliency Explanations by Crowdsourcing	LREC-COLING 2024	Using crowdsourcing to evaluate model explanations.

Week 13: Paper Review (Apr 13, 15, 17)

Lecture: How to write good reviews

Type	Paper	Link	Summary
Required	Advice for Peer Reviewers (ACL)	ACL Reviewer Guidelines	Official ACL guidelines for writing constructive, fair peer reviews.
Optional	NIPS Experiment (reviewing consistency)	arXiv:2109.09774	Famous study showing 25% of papers had inconsistent accept/reject decisions when reviewed by different committees. Highlights subjectivity in peer review.

Week 14-15: Final Push (Apr 20-May 1)

Presentations and finals Focus on project completion - minimal new readings

Foundational Papers (Reference List)

These are classic papers students should be aware of, organized by topic:

Language Models

| Paper | Link | Summary | |——-|——|———| | GPT-3: Language Models are Few-Shot Learners | arXiv:2005.14165 | Landmark paper demonstrating that scaling to 175B parameters enables strong few-shot performance without fine-tuning. Introduced the in-context learning paradigm. | | Scaling Laws for Neural Language Models | arXiv:2001.08361 | Discovered predictable power-law relationships between model size, data, compute, and loss. Showed optimal training uses large models on modest data with early stopping. | | LLaMA: Open and Efficient Foundation Language Models | arXiv:2302.13971 | Demonstrated SOTA models can be trained using only public data. LLaMA-13B outperforms GPT-3. Made models available to researchers, catalyzing open-source LLM development. |

Benchmarks & Evaluation

| Paper | Link | Summary | |——-|——|———| | MMLU Pro | arXiv:2406.01574 | More challenging MMLU with 10 options instead of 4 and reasoning-focused questions. Accuracy drops 16-33% vs original; reduced prompt sensitivity. Better tracks AI progress. | | SWE-Bench: Evaluating LLMs on Real-World Software Issues | arXiv:2310.06770 | 2,294 real GitHub issues requiring multi-file code changes. Best model (Claude 2) solved only 1.96%, revealing gap between code generation and software engineering capabilities. | | IFEval: Instruction-Following Evaluation | arXiv:2311.07911 | Benchmark using verifiable instructions (word count, keywords) for objective evaluation. 25 instruction types across 500 prompts. Avoids biases of LLM-based evaluation. |

Code Generation

Fine-tuning

Additional Resources

Textbooks

Introduction to Information Retrieval - Stanford (free online)
Speech and Language Processing - Jurafsky & Martin (draft)

Reading Lists

2025 AI Engineering Reading List - latent.space
Papers with Code - ML papers with implementations

Workshops & Tutorials

Synthetic Data in the Era of LLMs Workshop
HumEval Workshop - Human Evaluation of NLP

–>

Last updated: March 2026 Source: latent.space 2025 reading list + ACL Anthology 2024-2025