As models grow more powerful and demonstrate new capabilities, it is increasingly important to study how we can align these models and ensure safe behavior. In particular, we want to prevent malicious misuse, align AI actions with ethical standards, and to build public trust in AI technology. This seminar is dedicated to identifying, characterizing, and mitigating unsafe behaviors in large models. 

Princeton AI Alignment and Safety and Seminar (PASS) serves as a collaborative platform for researchers to present their findings, participate in insightful conversations, and identify collaboration opportunities for novel solutions to these emerging alignment and safety challenges.

All the seminar talks will be virtual. View our series at our YouTube channel. Join our mailing list and follow us on Twitter X to get notified of speakers and livestream links!

Upcoming Events

Jacob Steinhardt, UC Berkeley
Tue, Apr 30, 2024, 3:00 pm4:00 pm
Irene Solaiman, Usman Gohar, and Jennifer Mickel
Tue, May 14, 2024, 2:00 pm3:00 pm

Past Events

Publications

(Bold indicates PLI author)

2024

MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences

Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang

Paper


Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora

Paper    Code


Personalized Language Modeling from Personalized Human Feedback

Xinyu Li, Zachary C. Lipton, Liu Leqi 

Paper


Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang and Peter Henderson

Paper Code Website


2023

Trainable Transformer in Transformer

Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia, Sanjeev Arora

Paper Code


Playing Large Games with Oracles and AI Debate

Xinyi Chen, Angelica Chen, Dean Foster, Elad Hazan

Paper


Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, Prateek Mittal

Paper Code


Detecting Pretraining Data from Large Language Models

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer

paper code website


Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson

paper code website


Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen

paper code website

Organizers

Sanjeev Arora
Director of Princeton Language and Intelligence
Charles C. Fitzmorris Professor of Computer Science
Danqi Chen
Associate Director of Princeton Language and Intelligence
Assistant Professor of Computer Science
Elad Hazan
Professor of Computer Science
Peter Henderson
Professor of Computer Science and School of Public and International Affairs
Yangsibo Huang
PLI Graduate Student
Princeton University
Kai Li
Paul M. Wythes '55 P86 and Marcia R. Wythes P86 Professor in Computer Science
Sadhika Malladi
PLI Graduate Student
Princeton University
Prateek Mittal
Professor of Electrical and Computer Engineering
Arvind Narayanan
Professor of Computer Science
Director, Center for Information Technology Policy
Dawn Song
Professor in the Department of Electrical Engineering and Computer Science
UC Berkeley