1/29 |
Introduction |
What is AI safety? Class intro, logistics, foundations, etc.
Readings: N/A
|
2/5 |
Specification and Robustness |
Adversarial Robustness and Misuse: If an adversary is trying to misuse a model for some purpose, how do we prevent this? How do we evaluate the limits of robustness?
Week 2 pre-debate form
Suggested Readings to Choose From for Debate (or suggest your own!):
- Zou, A., et al. (2023). Universal and Transferable Adversarial Attacks
on Aligned Language Models [link]
- Anthropic (2023). Frontier Threats Red Teaming for AI Safety
[link]
-
Li et al. (2024). LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet [link]
- Zou, A., et al. (2024). Improving
Alignment and Robustness with Circuit Breakers
[link]
- Andriushchenko, M., & Flammarion, N. (2024). Does Refusal Training in LLMs
Generalize to the Past Tense?
[link]
- Lee, D., & Tiwari, M. (2023). Prompt Infection: LLM-to-LLM Prompt Injection
within Multi-Agent Systems
[link]
-
Andriushchenko et al. (2024). Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [link]
-
Anil et al. (2024). Many-shot Jailbreaking [link]
- Liu, Y., Jia, Y., Geng, R., Jia, J., & Gong, N. Z. (2024). Formalizing and
Benchmarking Prompt Injection Attacks and Defenses
[link]
Other Useful Resources:
|
2/12 |
Specification and Robustness |
Alignment: How do we specify the right rewards? What happens when rewards are misspecified or underspecified? How do we reliably align models to human values?
Week 3 pre-debate form
Suggested Readings to Choose From for Debate (or suggest your own!):
Reward Specification and Learning
- Krakovna et al. (2020). Specification gaming: the flip side of AI ingenuity
[link]
- Christiano et al. (2017). Deep reinforcement learning from human preferences
[link]
- Casper et al. (2023). Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
[link]
- Leike et al. (2018). Scalable agent alignment via reward modeling: a research direction
[link]
[blog]
- Gao et al. (2022). Scaling Laws for Reward Model Overoptimization
[link]
- Pan, Bhatia, Steinhardt (2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
[link]
- Hadfield-Menell et al. (2016). Cooperative Inverse Reinforcement Learning
[link]
- Hadfield-Menell, Hadfield (2018). Incomplete Contracting and AI Alignment
[link]
- Li et al. (2023). Survival Instinct in Offline Reinforcement Learning
[link]
Goal Misgeneralization
- Shah et al. (2022). Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
[link]
[blog]
- Langosco et al. (2021). Goal Misgeneralization in Deep Reinforcement Learning
[link]
- Shah (2023). Categorizing failures as "outer" or "inner" misalignment is often confused
[link]
- Hubinger et al. (2019). Risks from Learned Optimization
[link]
- Pan et al. (2024). Spontaneous Reward Hacking in Iterative Self-Refinement
[link]
Human Reward Design
- Ouyang et al. (2022). Training language models to follow instructions with human feedback
[link]
- Askell et al. (2022). A General Language Assistant as a Laboratory for Alignment
[link]
- Bai et al. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
[link]
- Lanham et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning
[link]
|
2/19 |
Specification and Robustness |
Scalable Oversight: What do we do when it's really difficult to monitor and steer models with human oversight? What if most humans can't even accomplish the task? How can we align to human values at scale? Whose values?
Week 4 pre-debate form
Suggested Readings to Choose From for Debate (or suggest your own!):
Scalable Oversight
- Burns et al. (2023). Weak-to-strong generalization: Eliciting Strong Capabilities With Weak Supervision
[link]
- Christiano et al. (2018). Supervising strong learners by amplifying weak experts
[link]
- Bowman et al. (2022). Measuring Progress on Scalable Oversight for Large Language Models
[link]
- Wen et al. (2024). Language Models Learn to Mislead Humans via RLHF
[link]
- Wen et al. (2024). Learning Task Decomposition to Assist Humans in Competitive Programming
[link]
- Haupt et al. (2022). Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL
[link]
- Fluri et al. (2023). Evaluating Superhuman Models with Consistency Checks
[link]
Specifying Values and Rules at Scale
- Bai et al. (2022). Constitutional AI: Harmlessness from AI Feedback
[link]
- Hendrycks et al. (2021). Aligning AI With Shared Human Values
[link]
- Hendrycks et al. (2023). Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
[link]
- Hendrycks et al. (2023). What Would Jiminy Cricket Do? Towards Agents That Behave Morally
[link]
- Scherrer et al. (2023). Evaluating the Moral Beliefs Encoded in LLMs
[link]
- Ma et al. (2023). Let's Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning
[link]
- Sorensen et al. (2024). A Roadmap to Pluralistic Alignment
[link]
- Kirk et al. (2024). The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models
[link]
|
2/26 |
Assurance |
Mechanistic Interpretability: How can we leverage an understanding of model internals to improve alignment and safety? How do we verify correctness of these explanations?
Week 5 pre-debate form
Suggested Readings to Choose From for Debate (or suggest your own!):
Foundational Work
- Sharkey et al., Open Problems in Mechanistic Interpretability [[link]]
- Elhage et al. A Mathematical Framework for Transformer Circuits
[link]
Superposition
- Elhage et al. Toy Models of Superposition
[link]
- Gurnee et al. Finding Neurons in a Haystack
[link]
- Nanda & Rajamanoharan et al. Fact Finding
[link]
Sparse Autoencoders (SAEs)
- Bricken et al. Towards Monosemanticity
[link]
- Marks et al. Sparse Feature Circuits
[link]
- Dunefsky, Chlenski et al. Transcoders Find Interpretable LLM Feature Circuits
[link]
- Kissane & Krzyzanowski et al. Interpreting Attention Layer Outputs with Sparse Autoencoders
[link]
- Makelov & Lange et al. Towards Principled Evaluations of Sparse Autoencoders
[link]
- Rajamanoharan et al. Gated SAEs
[link]
- Gao et al. Scaling and Evaluating Sparse Autoencoders
[link]
- Templeton et al. Scaling Monosemanticity
[link]
Activation Patching
- Heimersheim & Nanda. How to Use and Interpret Activation Patching
[link]
- Redwood. Causal Scrubbing
[link]
- Nanda. Attribution Patching
[link]
- Conmy et al. Automated Circuit Discovery
[link]
- Geiger et al. Distributed Alignment Search
[link]
- Makelov & Lange. An Interpretability Illusion for Subspace Activation Patching
[link]
Narrow Circuits
- Wang et al. Indirect Object Identification
[link]
- Hanna et al. A Greater-Than Circuit
[link]
- Lieberum et al. Does Circuit Analysis Interpretability Scale?
[link]
Others
- Olsson et al. The Induction Heads Paper
[link]
- Nanda et al. Progress Measures for Grokking via Mechanistic Interpretability
[link]
- Nostalgebraist. The Logit Lens
[link]
- McGrath et al. The Hydra Effect
[link]
- McDougall, Conmy & Rushing et al. Copy Suppression
[link]
- Tigges & Hollinsworth et al. Linear Representations of Sentiment
[link]
- Elhage et al. Softmax Linear Units
[link]
- Bills et al. Language Models Can Explain Neurons in Language Models
[link]
- Bolukbasi et al. An Interpretability Illusion for BERT
[link]
- Goh et al. Multimodal Neurons in Artificial Neural Networks
[link]
|
3/5 |
Assurance |
Part I : Guest Speaker
Week 6 pre-debate form and question for guest speaker
Guest speaker: Boaz Barak
Title: AI safety via Inference-time compute
Abstract: Ensuring AI models reliably follow human intent, even in situations outside their training distribution, is a challenging problem. In this talk, we will discuss how spending more computation at inference time can be used to improve robust adherence to human-specified policies, specifically using reasoning AI models such as OpenAI’s o1-preview, o1-mini, and o1.
In particular I will present Deliberative Alignment: A new safety training paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. Deliberative alignment simultaneously increases robustness to jailbreaks while decreasing over-refusal rates, as well as improving out-of-distribution generalization. I will also discuss results showing how, unlike the case with scaling pretraining compute, adversarial robustness improves with inference-time compute.
Talk will be based on the arXiv preprints 2412.16339 and 2501.18841.
Part II: Brief Lecture & Debate
Leveraging Sociotechnical Understanding to Mitigate Harmful Failures: How can we make sure that we understand the broader context of deployments? What deployments are risky? How do we think about government uses of AI versus private sector? Is there a difference? How should sociotechnical understanding change algorithm design?
Suggested Readings to Choose From for Debate (or suggest your own!):
-
Rivera et al. (2024). Escalation Risks from Language Models in Military and Diplomatic Decision-Making [ link ]
- Emory (2020). Probabilities towards death: bugsplat, algorithmic assassinations, and ethical due care, Critical Military Studies, 1(1): 1-20 [ link ]
-
Magesh et al. (2024). Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools [ link ]
-
Chouldechova et al. (2018). A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions [ link ]
-
Marcus Williams*, Micah Carroll*, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan.
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
[ link ]
-
Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan.
Ai alignment with changing and influenceable reward functions
[ link ]
-
Micah Carroll*, Alan Chan*, Henry Ashton, David Krueger.
Characterizing Manipulation from AI Systems
[ link ]
-
Sabour et al. (2025).Human Decision-making is Susceptible to AI-driven Manipulation
[ link ]
-
Haduong, N., & Smith, N. A. (2024). How Performance Pressure Influences AI-Assisted Decision Making
[ link ]
-
Halawi, D., Zhang, F., Yueh-Han, C., & Steinhardt, J. (2024). Approaching Human-Level Forecasting with Language Models
[ link ]
|
3/12 |
Spring Recess |
No class!
|
3/19 |
Assurance |
Monitoring and Control
Week 7 pre-debate form
Guest speaker: Jaime Fernández Fisac
Suggested Readings to Choose From for Debate (or suggest your own!):
- K. C. Hsu, H. Hu, J. F. Fisac, The Safety Filter: A Unified View of Safety-Enforcing Control in Autonomous Systems [ link ]
-
Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma.
Forecasting Rare Language Model Behaviors
[ link ]
-
Sydney M. Katz, Anthony L. Corso, Esen Yel, Mykel J. Kochenderfer.
Efficient Determination of Safety Requirements for Perception Systems
[ link ]
-
ShengYun Peng, Pin-Yu Chen, Matthew Hull, Duen Horng Chau.
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
[ link ]
-
A Holistic Assessment of the Reliability of Machine Learning Systems
Anthony Corso, David Karamadian, Romeo Valentin, Mary Cooper, Mykel J. Kochenderfer
[ link ]
-
Heidy Khlaaf.
Toward Comprehensive Risk Assessments and Assurance of AI-Based Systems
[ link ]
-
Safety Cases: A Scalable Approach to Frontier AI Safety
Benjamin Hilton, Marie Davidsen Buhl, Tomek Korbak, Geoffrey Irving
[ link ]
-
Your Value Function is a Control Barrier Function
Daniel C.H. Tan, Fernando Acero, Robert McCarthy, Dimitrios Kanoulas, Zhibin Li
[ link ]
-
Guaranteeing Safety of Learned Perception Modules
via Measurement-Robust Control Barrier Functions
Sarah Dean, Andrew J. Taylor, Ryan K. Cosner, Benjamin Recht, Aaron D. Ames
[ link ]
-
Hanjiang Hu, Alexander Robey, Changliu Liu.
Steering Dialogue Dynamics for Robustness against
Multi-turn Jailbreaking Attacks
[ link ]
-
[Don't use this one for debate, but might be useful background]Control Barrier Functions: Theory and Applications
Aaron D. Ames, Samuel Coogan, Magnus Egerstedt,
Gennaro Notomista, Koushil Sreenath, Paulo Tabuada
[ link ]
|
3/26 |
Assurance + Alignment |
Multi-Agent Systems, Simulations, and Alignment
Suggested Readings to Choose From for Debate (or suggest your own!):
-
Gordon Dai, Weijia Zhang, Jinhan Li, Siqi Yang, Chidera Onochie lbe, Srihas Rao, Arthur Caetano, Misha Sra
Artificial Leviathan: Exploring Social Evolution of LLM Agents Through the Lens of Hobbesian Social Contract Theory
[ link ]
-
Lukas Aichberger, Alasdair Paren, Yarin Gal, Philip Torr, Adel Bibi,Attacking Multimodal OS Agents with Malicious Image Patches
[ link ]
-
Multi-Agent Risks from Advanced AI, Lewis Hammond et al.
[ link ]
-
Learning Safe Multi-Agent Control with Decentralized Neural Barrier Certificates, Zengyi Qin, Kaiqing Zhang, Yuxiao Chen, Jingkai Chen, Chuchu Fan
[ link ]
-
Ma et al., ELIGN: Expectation Alignment as a Multi-Agent Intrinsic Reward
[ link ]
-
Graesser et al., Emergent Linguistic Phenomena in Multi-Agent Communication Games
[ link ]
-
Nayebi et al., Barriers and Pathways to Human-AI Alignment: A Game-Theoretic Approach
[ link ]
-
Baker et al., Emergent Tool Use from Multi-Agent Autocurricula
[ link ]
-
Graesser, Cho, Kiela, Emergent Linguistic Phenomena in Multi-Agent Communication Games
[ link ]
-
Chaabouni et al., Emergent Communication at Scale
[ link ]
-
Wang et al., Large language models cannot replace human participants because they cannot portray identity groups
[ link ]
-
Louie et al., Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles
[ link ]
|
4/2 |
Assurance |
Evaluation and Red Teaming: What can go wrong with evaluation for safety? What about evaluation more broadly? How do we properly evaluate models?
Guest Speaker: Nicholas Carlini
Suggested Readings to Choose From for Debate (or suggest your own!):
-
Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma.
Forecasting Rare Language Model Behaviors
[ link ]
-
Evan Miller.
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
[ link ]
-
Joshua Clymer, Nick Gabrieli, David Krueger, Thomas Larsen.
Safety Cases: How to Justify the Safety of Advanced AI Systems
[ link ]
-
Benjamin Hilton, Marie Davidsen Buhl, Tomek Korbak, Geoffrey Irving.
Safety Cases: A Scalable Approach to Frontier AI Safety
[ link ]
-
Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell.
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
[ link ]
-
Javier Rando, Jie Zhang, Nicholas Carlini, Florian Tramèr.
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
[ link ]
(useful background, but not recommended for debate)
-
Sydney M. Katz, Anthony L. Corso, Esen Yel, Mykel J. Kochenderfer.
Efficient Determination of Safety Requirements for Perception Systems
[ link ]
-
A Holistic Assessment of the Reliability of Machine Learning Systems
Anthony Corso, David Karamadian, Romeo Valentin, Mary Cooper, Mykel J. Kochenderfer
[ link ]
|
4/9 |
Governance |
Concentration of Power, Liability, and Regulation
Suggested Readings to Choose From for Debate (or suggest your own!):
- Ketan Ramakrishnan et al., U.S. Tort Liability for Large-Scale Artificial Intelligence Damages: A Primer for Developers and Policymakers, RAND Corp. (Aug. 21, 2024) [link]
- Cynthia Estlund, What Should We Do After Work? Automation and Employment Law, 128 Yale L.J. 254 (2018)
- Lina M. Khan, Amazon's Antitrust Paradox, 126 Yale L.J. 710 (2016)
- United States v. RealPage [pdf]
- Ganesh Sitaraman, Tejas N. Narechania, An Antimonopoly Approach to Governing Artificial Intelligence, 43 Yale Law and Policy Review 95 (2024) [link]
- Gabriel Weil, Tort Law as a Tool for Mitigating Catastrophic Risk from Artificial Intelligence [link]
- Noam Kolt, Governing AI Agents, Notre Dame Law Review, Vol. 101, Forthcoming (2025) [link]
-
Martin Beraja, Andrew Kao, David Y Yang, Noam Yuchtman.
AI-tocracy
The Quarterly Journal of Economics, Volume 138, Issue 3, August 2023, Pages 1349–1402
[ link ]
|
4/16 |
Governance |
Economic Impacts
Suggested Readings to Choose From for Debate (or suggest your own!):
-
Daron Acemoglu, Asuman Ozdaglar, James Siderius.
Artificial Intelligence and Political Economy
[ link ]
-
Daron Acemoglu, Todd Lensman.
Regulating Transformative Technologies
[ link ]
American Economic Review: Insights, vol. 6, no. 3, September 2024 (pp. 359–76)
-
Kunal Handa, Alex Tamkin, Miles McCain, Saffron Huang, Esin Durmus, Sarah Heck, Jared Mueller, Jerry Hong, Stuart Ritchie, Tim Belonax, Kevin K. Troy, Dario Amodei, Jared Kaplan, Jack Clark, Deep Ganguli.
The Anthropic Economic Index
[ link ]
-
Erik Brynjolfsson, Danielle Li, Lindsey Raymond.
Generative AI at work
The Quarterly Journal of Economics, qjae044, February 2025
[ link ]
-
Daron Acemoglu, Simon Johnson.
Learning From Ricardo and Thompson: Machinery and Labor in the Early Industrial Revolution and in the Age of Artificial Intelligence
Annual Review of Economics, Vol. 16:597-621, August 2024
[ link ]
|
4/23 |
Student Presentations |
Final Project Presentations and Wrapup
|