Details

NOTE: Speaker will join online, but audience is encouraged to join moderators and COS 598 students to view in-person at Friend Center 006
ABSTRACT: Ensuring AI models reliably follow human intent, even in situations outside their training distribution, is a challenging problem. In this talk, we will discuss how spending more computation at inference time can be used to improve robust adherence to human-specified policies, specifically using reasoning AI models such as OpenAI’s o1-preview, o1-mini, and o1.In particular I will present Deliberative Alignment: A new safety training paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. Deliberative alignment simultaneously increases robustness to jailbreaks while decreasing over-refusal rates, as well as improving out-of-distribution generalization. I will also discuss results showing how, unlike the case with scaling pretraining compute, adversarial robustness improves with inference-time compute.
Talk will be based on the arXiv preprints 2412.16339 and 2501.18841.
Bio: Boaz Barak is professor of Computer Science at Harvard University, and a member of the Harvard SEAS Theory of Computing group. Previously, a principal researcher at Microsoft Research New England, and before that an associate professor (with tenure) at Princeton University’s Computer Science department. He holds a Ph.D. from the Weizmann Institute of Science and was a postdoctoral fellow at the Institute for Advanced Study in Princeton. He is a theoretical computer scientist, interested in computational complexity, algorithms, cryptography, quantum computing and particularly the foundations of machine learning.
Submit your question for the speaker at Princeton AI Alignment and Safety Seminar (PASS)! We will moderate the questions and ask the speaker during the discussion period: PASS Question Submission