As models grow more powerful and demonstrate new capabilities, it is increasingly important to study how we can align these models and ensure safe behavior. In particular, we want to prevent malicious misuse, align AI actions with ethical standards, and to build public trust in AI technology. This seminar is dedicated to identifying, characterizing, and mitigating unsafe behaviors in large models. 

Princeton AI Alignment and Safety Seminar (PASS) serves as a collaborative platform for researchers to present their findings, participate in insightful conversations, and identify collaboration opportunities for novel solutions to these emerging alignment and safety challenges.

All the seminar talks will be virtual. View our series at our YouTube channel. Join our mailing list and follow us on Twitter X to get notified of speakers and livestream links!

Upcoming Events

No content available to show.

Past Events

PrincetonAdversarial ML: harder than ever
Nicholas Carlini, Anthropic
Wed, Apr 2, 2025, 1:30 pm
AI safety via Inference-time compute
Boaz Barak, Gordon McKay Professor of Computer Science, Harvard University
Wed, Mar 5, 2025
Open-Source and Science in the Era of Foundation Models
Percy Liang, Stanford University and Center for Research Foundation Models
Tue, Dec 3, 2024
Tulu 3: Exploring Frontiers in Open Language Model Post-Training
Nathan Lambert, Allen Institute for AI
Mon, Nov 4, 2024, 2:00 pm
Professor David Krueger
YouTube Link: https://youtube.com/live/zoPkTHheC4Q?feature=share
Tue, Oct 15, 2024, 3:00 pm
Generative models memorize, what that means and why we care
Katherine Lee, Google DeepMind
Fri, Sep 27, 2024, 2:00 pm
Evaluating the Social Impact of Generative AI Systems in Systems and Society
View via Livestream: https://youtube.com/live/pBShTHNDO-w
Tue, May 14, 2024, 2:00 pm
Jacob Steinhardt, UC Berkeley
View via Livestream: https://youtube.com/live/nT_Bhs8Al-Y
Fri, May 3, 2024, 3:00 pm
An Overview of Catastrophic AI Risks Dan Hendrycks, Center for AI Safety
View via Livestream: https://youtube.com/live/TGiAakrtkSA
Tue, Apr 9, 2024, 2:00 pm
AI Preparedness
Aleksander Madry, OpenAI, MIT EECS Department
Tue, Mar 26, 2024, 2:00 pm

Publications

(Bold indicates PLI author)

2024

What is in Your Safe Data? Identifying Benign Data that Breaks Safety

Luxi He, Mangzhou Xia, Peter Henderson

PAPER

 

MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences

Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang

Paper


Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora

Paper    Code


Personalized Language Modeling from Personalized Human Feedback

Xinyu Li, Zachary C. Lipton, Liu Leqi 

Paper


Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang and Peter Henderson

Paper Code Website


2023

Trainable Transformer in Transformer

Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia, Sanjeev Arora

Paper Code


Playing Large Games with Oracles and AI Debate

Xinyi Chen, Angelica Chen, Dean Foster, Elad Hazan

Paper


Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, Prateek Mittal

Paper Code


Detecting Pretraining Data from Large Language Models

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer

paper code website


Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson

paper code website


Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen

paper code website

Organizers