COS 598A: AI Safety & Alignment

PRINCETON UNIVERSITY, SPRING 2025

Location: Friend Center 006

Time: Wednesdays 1:30pm-4:20pm

Home | Course Description | Schedule | FAQ

Model: playgroundai/playground-v2.5-1024px-aesthetic

Instructor

Prof. Peter Henderson

Office Hours: By Appointment

Location: Sherrerd 302B

TA

Lucy He

Office Hours: Friday 11am-12pm

Location: FC 011

TA

Amaya Dharmasiri

Office Hours: Monday 4pm-5pm

Location: FC 011

Course Description

What safety risks do modern AI systems and methods pose? How can we mitigate these risks and ensure that AI systems are safe and aligned with the intention of their builder? What is currently being done to ensure that AI systems are safe? Are current safeguards sufficient? If not, how can we improve upon them? These are the questions we consider in this advanced seminar. Topics include algorithms and optimization-based methods for alignment (including reinforcement learning), sociotechnical AI safety, accountability and governance, among others. Class will generally be a short lecture (or guest lecture) followed by debates.

Course Expectations & Grading

Components

📝 Paper Debates (35%): We will schedule 2 debate panels every week starting Week 2. Each panel will consist of 3-5 students and last 30-45 minutes. Each panel will focus on one research paper (or two) related to the topics that have been taught so far. You must sign up for 2 debates in the semester. You can only be the presenter at most 1 time. Sign-up sheet

For the debators:
- Each panel will be composed of 1 presenter, 2 critics, and 2 proponents.
- The presenter will start with a short presentation (8 minutes max) of the paper. This should include a high-level summary of the paper as well as some motivation and related works. The goal of this part is to give the audience a helpful overview of the work and contextualize it in the bigger picture.
- The proponents should consider themselves reviewers in favor of the paper being accepted. You should advocate for the paper, such as highlighting the technical novelty/contributions or how it challenges mainstream methods.
- The critics will then critique the paper, similar to how reviewers assess conference papers—highlighting limitations, weaknesses, and any claims that are not well supported by the experiments. We ask that critics make their points more precise. For example, instead of “the existing experiments are not compelling,” you should say what experiments you think are missing from the paper.
- Then the proponents would try to respond to the critics’ points and explain why they believe the problem does not exist or is not serious. This is the beginning of the multi-round open discussion phase, and proponents are encouraged to ask follow-up questions or respond to the critics. For example, you can point to specific sections in the paper that already address their concerns, or explain how you would design experiments to address certain points the other side makes.
- We will then open it up to the class for discussion. After the full-class discussion, the audience will be asked to give a “score” on the paper again (see below).
For the audience:
- In our pre-debate form, which we will update on the website on Mondays, you will be asked to write a few sentences about what you liked and didn’t like about the panelists’ papers.
- You will also be asked to indicate in the form if you lean more positively or negatively towards the paper (simulated by reviewer scoring 1,3,5,6,8,10).
- During the panel, you should pay close attention to the arguments the panel group is making and actively engage in the full-class discussion.
- After the discussion, you will be asked to vote again on whether you lean more positively or negatively towards the paper (simulated by reviewer scoring 1,3,5,6,8,10). We will then share the audience score of the paper before and after the debate panel. Please use the recurring post-debate form here.
- The pre-debate/post-debate form is separate from the reading responses, but will count partially towards your participation grade.
Timeline:
- If < Week 4: Once your debate group has picked a paper you must send it to the teaching team by Sunday 11:59 pm before the class where the debate will take place.
- If >= Week 4: Once your debate group has picked a paper you must send it to the teaching team by Friday 11:59 pm before the class where the debate will take place.
👥 Participation (15%): Active engagement in discussions (including attendance). Part of participation grade will be to submit a freeform reading response on any paper related to the topic at least 4 times via Canvas. These are due at 11:59pm on the day before the lecture. We will not grade the correctness but rather for effort in providing a thoughtful response and in-depth reading of the papers. This is in addition to the debate forms for days when you're in the audience, which will only be a small percentage of the overall grade.

Reading Response Guidelines:
- Choose a paper from the suggested readings for the week and write a thoughtful response.
- Submit at least 4 responses over the semester via Canvas.
- Due by 11:59 PM the day before lecture.
- Upload as 2-page PDF.
Response Components:
- Paper Information: Title, authors, year, source
- Brief Summary: Core problem and approach
- Key Insights: Main takeaways and contributions
- Critical Engagement: Your analysis of strengths/limitations, connections to class topics, potential applications
- Personal Reflection: Impact on your understanding, questions raised
Grading Focus:
- ✅ Depth of engagement and critical thinking
- ✅ Original analysis beyond summary
- ✅ Clear and structured writing
🔬 Project (50%): You should work as a team of 2 or 3 and complete a project related to the topics/themes of the course.
- Project Proposal - March 7th (10%)

Course Schedule

DATE	TOPIC	LECTURE AGENDA
1/29	Introduction	What is AI safety? Class intro, logistics, foundations, etc. Readings: N/A
2/5	Specification and Robustness	Adversarial Robustness and Misuse: If an adversary is trying to misuse a model for some purpose, how do we prevent this? How do we evaluate the limits of robustness? Week 2 pre-debate form Suggested Readings to Choose From for Debate (or suggest your own!): Zou, A., et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models [link] Anthropic (2023). Frontier Threats Red Teaming for AI Safety [link] Li et al. (2024). LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet [link] Zou, A., et al. (2024). Improving Alignment and Robustness with Circuit Breakers [link] Andriushchenko, M., & Flammarion, N. (2024). Does Refusal Training in LLMs Generalize to the Past Tense? [link] Lee, D., & Tiwari, M. (2023). Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems [link] Andriushchenko et al. (2024). Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [link] Anil et al. (2024). Many-shot Jailbreaking [link] Liu, Y., Jia, Y., Geng, R., Jia, J., & Gong, N. Z. (2024). Formalizing and Benchmarking Prompt Injection Attacks and Defenses [link] Other Useful Resources: An explainer of prompt injection attacks
2/12	Specification and Robustness	Alignment: How do we specify the right rewards? What happens when rewards are misspecified or underspecified? How do we reliably align models to human values? Week 3 pre-debate form Suggested Readings to Choose From for Debate (or suggest your own!): Reward Specification and Learning Krakovna et al. (2020). Specification gaming: the flip side of AI ingenuity [link] Christiano et al. (2017). Deep reinforcement learning from human preferences [link] Casper et al. (2023). Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback [link] Leike et al. (2018). Scalable agent alignment via reward modeling: a research direction [link] [blog] Gao et al. (2022). Scaling Laws for Reward Model Overoptimization [link] Pan, Bhatia, Steinhardt (2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models [link] Hadfield-Menell et al. (2016). Cooperative Inverse Reinforcement Learning [link] Hadfield-Menell, Hadfield (2018). Incomplete Contracting and AI Alignment [link] Li et al. (2023). Survival Instinct in Offline Reinforcement Learning [link] Goal Misgeneralization Shah et al. (2022). Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals [link] [blog] Langosco et al. (2021). Goal Misgeneralization in Deep Reinforcement Learning [link] Shah (2023). Categorizing failures as "outer" or "inner" misalignment is often confused [link] Hubinger et al. (2019). Risks from Learned Optimization [link] Pan et al. (2024). Spontaneous Reward Hacking in Iterative Self-Refinement [link] Human Reward Design Ouyang et al. (2022). Training language models to follow instructions with human feedback [link] Askell et al. (2022). A General Language Assistant as a Laboratory for Alignment [link] Bai et al. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback [link] Lanham et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning [link]
2/19	Specification and Robustness	Scalable Oversight: What do we do when it's really difficult to monitor and steer models with human oversight? What if most humans can't even accomplish the task? How can we align to human values at scale? Whose values? Week 4 pre-debate form Suggested Readings to Choose From for Debate (or suggest your own!): Scalable Oversight Burns et al. (2023). Weak-to-strong generalization: Eliciting Strong Capabilities With Weak Supervision [link] Christiano et al. (2018). Supervising strong learners by amplifying weak experts [link] Bowman et al. (2022). Measuring Progress on Scalable Oversight for Large Language Models [link] Wen et al. (2024). Language Models Learn to Mislead Humans via RLHF [link] Wen et al. (2024). Learning Task Decomposition to Assist Humans in Competitive Programming [link] Haupt et al. (2022). Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL [link] Fluri et al. (2023). Evaluating Superhuman Models with Consistency Checks [link] Specifying Values and Rules at Scale Bai et al. (2022). Constitutional AI: Harmlessness from AI Feedback [link] Hendrycks et al. (2021). Aligning AI With Shared Human Values [link] Hendrycks et al. (2023). Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark [link] Hendrycks et al. (2023). What Would Jiminy Cricket Do? Towards Agents That Behave Morally [link] Scherrer et al. (2023). Evaluating the Moral Beliefs Encoded in LLMs [link] Ma et al. (2023). Let's Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning [link] Sorensen et al. (2024). A Roadmap to Pluralistic Alignment [link] Kirk et al. (2024). The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models [link]
2/26	Assurance	Mechanistic Interpretability: How can we leverage an understanding of model internals to improve alignment and safety? How do we verify correctness of these explanations? Week 5 pre-debate form Suggested Readings to Choose From for Debate (or suggest your own!): Foundational Work Sharkey et al., Open Problems in Mechanistic Interpretability [[link]] Elhage et al. A Mathematical Framework for Transformer Circuits [link] Superposition Elhage et al. Toy Models of Superposition [link] Gurnee et al. Finding Neurons in a Haystack [link] Nanda & Rajamanoharan et al. Fact Finding [link] Sparse Autoencoders (SAEs) Bricken et al. Towards Monosemanticity [link] Marks et al. Sparse Feature Circuits [link] Dunefsky, Chlenski et al. Transcoders Find Interpretable LLM Feature Circuits [link] Kissane & Krzyzanowski et al. Interpreting Attention Layer Outputs with Sparse Autoencoders [link] Makelov & Lange et al. Towards Principled Evaluations of Sparse Autoencoders [link] Rajamanoharan et al. Gated SAEs [link] Gao et al. Scaling and Evaluating Sparse Autoencoders [link] Templeton et al. Scaling Monosemanticity [link] Activation Patching Heimersheim & Nanda. How to Use and Interpret Activation Patching [link] Redwood. Causal Scrubbing [link] Nanda. Attribution Patching [link] Conmy et al. Automated Circuit Discovery [link] Geiger et al. Distributed Alignment Search [link] Makelov & Lange. An Interpretability Illusion for Subspace Activation Patching [link] Narrow Circuits Wang et al. Indirect Object Identification [link] Hanna et al. A Greater-Than Circuit [link] Lieberum et al. Does Circuit Analysis Interpretability Scale? [link] Others Olsson et al. The Induction Heads Paper [link] Nanda et al. Progress Measures for Grokking via Mechanistic Interpretability [link] Nostalgebraist. The Logit Lens [link] McGrath et al. The Hydra Effect [link] McDougall, Conmy & Rushing et al. Copy Suppression [link] Tigges & Hollinsworth et al. Linear Representations of Sentiment [link] Elhage et al. Softmax Linear Units [link] Bills et al. Language Models Can Explain Neurons in Language Models [link] Bolukbasi et al. An Interpretability Illusion for BERT [link] Goh et al. Multimodal Neurons in Artificial Neural Networks [link]
3/5	Assurance	Part I : Guest Speaker Week 6 pre-debate form and question for guest speaker Guest speaker: Boaz Barak Title: AI safety via Inference-time compute Abstract: Ensuring AI models reliably follow human intent, even in situations outside their training distribution, is a challenging problem. In this talk, we will discuss how spending more computation at inference time can be used to improve robust adherence to human-specified policies, specifically using reasoning AI models such as OpenAI’s o1-preview, o1-mini, and o1. In particular I will present Deliberative Alignment: A new safety training paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. Deliberative alignment simultaneously increases robustness to jailbreaks while decreasing over-refusal rates, as well as improving out-of-distribution generalization. I will also discuss results showing how, unlike the case with scaling pretraining compute, adversarial robustness improves with inference-time compute. Talk will be based on the arXiv preprints 2412.16339 and 2501.18841. Part II: Brief Lecture & Debate Leveraging Sociotechnical Understanding to Mitigate Harmful Failures: How can we make sure that we understand the broader context of deployments? What deployments are risky? How do we think about government uses of AI versus private sector? Is there a difference? How should sociotechnical understanding change algorithm design? Suggested Readings to Choose From for Debate (or suggest your own!): Rivera et al. (2024). Escalation Risks from Language Models in Military and Diplomatic Decision-Making [ link ] Emory (2020). Probabilities towards death: bugsplat, algorithmic assassinations, and ethical due care, Critical Military Studies, 1(1): 1-20 [ link ] Magesh et al. (2024). Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools [ link ] Chouldechova et al. (2018). A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions [ link ] Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan. On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback [ link ] Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan. Ai alignment with changing and influenceable reward functions [ link ] Micah Carroll, Alan Chan, Henry Ashton, David Krueger. Characterizing Manipulation from AI Systems [ link ] Sabour et al. (2025).Human Decision-making is Susceptible to AI-driven Manipulation [ link ] Haduong, N., & Smith, N. A. (2024). How Performance Pressure Influences AI-Assisted Decision Making [ link ] Halawi, D., Zhang, F., Yueh-Han, C., & Steinhardt, J. (2024). Approaching Human-Level Forecasting with Language Models [ link ]
3/12	Spring Recess	No class!
3/19	Assurance	Monitoring and Control Week 7 pre-debate form Guest speaker: Jaime Fernández Fisac Suggested Readings to Choose From for Debate (or suggest your own!): K. C. Hsu, H. Hu, J. F. Fisac, The Safety Filter: A Unified View of Safety-Enforcing Control in Autonomous Systems [ link ] Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma. Forecasting Rare Language Model Behaviors [ link ] Sydney M. Katz, Anthony L. Corso, Esen Yel, Mykel J. Kochenderfer. Efficient Determination of Safety Requirements for Perception Systems [ link ] ShengYun Peng, Pin-Yu Chen, Matthew Hull, Duen Horng Chau. Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models [ link ] A Holistic Assessment of the Reliability of Machine Learning Systems Anthony Corso, David Karamadian, Romeo Valentin, Mary Cooper, Mykel J. Kochenderfer [ link ] Heidy Khlaaf. Toward Comprehensive Risk Assessments and Assurance of AI-Based Systems [ link ] Safety Cases: A Scalable Approach to Frontier AI Safety Benjamin Hilton, Marie Davidsen Buhl, Tomek Korbak, Geoffrey Irving [ link ] Your Value Function is a Control Barrier Function Daniel C.H. Tan, Fernando Acero, Robert McCarthy, Dimitrios Kanoulas, Zhibin Li [ link ] Guaranteeing Safety of Learned Perception Modules via Measurement-Robust Control Barrier Functions Sarah Dean, Andrew J. Taylor, Ryan K. Cosner, Benjamin Recht, Aaron D. Ames [ link ] Hanjiang Hu, Alexander Robey, Changliu Liu. Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks [ link ] [Don't use this one for debate, but might be useful background]Control Barrier Functions: Theory and Applications Aaron D. Ames, Samuel Coogan, Magnus Egerstedt, Gennaro Notomista, Koushil Sreenath, Paulo Tabuada [ link ]
3/26	Assurance + Alignment	Multi-Agent Systems, Simulations, and Alignment Suggested Readings to Choose From for Debate (or suggest your own!): Gordon Dai, Weijia Zhang, Jinhan Li, Siqi Yang, Chidera Onochie lbe, Srihas Rao, Arthur Caetano, Misha Sra Artificial Leviathan: Exploring Social Evolution of LLM Agents Through the Lens of Hobbesian Social Contract Theory [ link ] Lukas Aichberger, Alasdair Paren, Yarin Gal, Philip Torr, Adel Bibi,Attacking Multimodal OS Agents with Malicious Image Patches [ link ] Multi-Agent Risks from Advanced AI, Lewis Hammond et al. [ link ] Learning Safe Multi-Agent Control with Decentralized Neural Barrier Certificates, Zengyi Qin, Kaiqing Zhang, Yuxiao Chen, Jingkai Chen, Chuchu Fan [ link ] Ma et al., ELIGN: Expectation Alignment as a Multi-Agent Intrinsic Reward [ link ] Graesser et al., Emergent Linguistic Phenomena in Multi-Agent Communication Games [ link ] Nayebi et al., Barriers and Pathways to Human-AI Alignment: A Game-Theoretic Approach [ link ] Baker et al., Emergent Tool Use from Multi-Agent Autocurricula [ link ] Graesser, Cho, Kiela, Emergent Linguistic Phenomena in Multi-Agent Communication Games [ link ] Chaabouni et al., Emergent Communication at Scale [ link ] Wang et al., Large language models cannot replace human participants because they cannot portray identity groups [ link ] Louie et al., Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles [ link ]
4/2	Assurance	Evaluation and Red Teaming: What can go wrong with evaluation for safety? What about evaluation more broadly? How do we properly evaluate models? Guest Speaker: Nicholas Carlini Suggested Readings to Choose From for Debate (or suggest your own!): Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma. Forecasting Rare Language Model Behaviors [ link ] Evan Miller. Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations [ link ] Joshua Clymer, Nick Gabrieli, David Krueger, Thomas Larsen. Safety Cases: How to Justify the Safety of Advanced AI Systems [ link ] Benjamin Hilton, Marie Davidsen Buhl, Tomek Korbak, Geoffrey Irving. Safety Cases: A Scalable Approach to Frontier AI Safety [ link ] Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell. Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities [ link ] Javier Rando, Jie Zhang, Nicholas Carlini, Florian Tramèr. Adversarial ML Problems Are Getting Harder to Solve and to Evaluate [ link ] (useful background, but not recommended for debate) Sydney M. Katz, Anthony L. Corso, Esen Yel, Mykel J. Kochenderfer. Efficient Determination of Safety Requirements for Perception Systems [ link ] A Holistic Assessment of the Reliability of Machine Learning Systems Anthony Corso, David Karamadian, Romeo Valentin, Mary Cooper, Mykel J. Kochenderfer [ link ]
4/9	Governance	Concentration of Power, Liability, and Regulation Suggested Readings to Choose From for Debate (or suggest your own!): Ketan Ramakrishnan et al., U.S. Tort Liability for Large-Scale Artificial Intelligence Damages: A Primer for Developers and Policymakers, RAND Corp. (Aug. 21, 2024) [link] Cynthia Estlund, What Should We Do After Work? Automation and Employment Law, 128 Yale L.J. 254 (2018) Lina M. Khan, Amazon's Antitrust Paradox, 126 Yale L.J. 710 (2016) United States v. RealPage [pdf] Ganesh Sitaraman, Tejas N. Narechania, An Antimonopoly Approach to Governing Artificial Intelligence, 43 Yale Law and Policy Review 95 (2024) [link] Gabriel Weil, Tort Law as a Tool for Mitigating Catastrophic Risk from Artificial Intelligence [link] Noam Kolt, Governing AI Agents, Notre Dame Law Review, Vol. 101, Forthcoming (2025) [link] Martin Beraja, Andrew Kao, David Y Yang, Noam Yuchtman. AI-tocracy The Quarterly Journal of Economics, Volume 138, Issue 3, August 2023, Pages 1349–1402 [ link ]
4/16	Governance	Economic Impacts Suggested Readings to Choose From for Debate (or suggest your own!): Daron Acemoglu, Asuman Ozdaglar, James Siderius. Artificial Intelligence and Political Economy [ link ] Daron Acemoglu, Todd Lensman. Regulating Transformative Technologies [ link ] American Economic Review: Insights, vol. 6, no. 3, September 2024 (pp. 359–76) Kunal Handa, Alex Tamkin, Miles McCain, Saffron Huang, Esin Durmus, Sarah Heck, Jared Mueller, Jerry Hong, Stuart Ritchie, Tim Belonax, Kevin K. Troy, Dario Amodei, Jared Kaplan, Jack Clark, Deep Ganguli. The Anthropic Economic Index [ link ] Erik Brynjolfsson, Danielle Li, Lindsey Raymond. Generative AI at work The Quarterly Journal of Economics, qjae044, February 2025 [ link ] Daron Acemoglu, Simon Johnson. Learning From Ricardo and Thompson: Machinery and Labor in the Early Industrial Revolution and in the Age of Artificial Intelligence Annual Review of Economics, Vol. 16:597-621, August 2024 [ link ]
4/23	Student Presentations	Final Project Presentations and Wrapup

FAQ

Q: Any tips for preparing for the debate?

Different formats and arrangements of critics and proponents are welcome. You are encouraged to select an approach that best aligns with the structure of the paper.
When determining the format, consider what will facilitate a dynamic and engaging debate among the panelists and audience. For instance, if you plan to present multiple points from both critics and proponents, it may be beneficial to interweave the two perspectives throughout your slides.
Please be mindful of the length of your arguments and prioritize your most significant points. Concise presentations with more interactive discussions between both sides are encouraged.
These discussions do not need to be constrained by your slides, and you are welcome to directly challenge opposing arguments, even if you do not have pre-prepared slides addressing them.
Detailed slides are not required; if you prefer to present key figures, graphs, or tables and discuss them directly—or even utilize the blackboard—that is entirely acceptable.
To ensure a smooth transition between speakers, it would be helpful if all slides are consolidated onto a single computer for the presentation.
After the presentation, please share your slides with Peter, Amaya, and Lucy. Additionally, let us know if you are comfortable with us sharing your slides.