As models grow more powerful and demonstrate new capabilities, it is increasingly important to study how we can align these models and ensure safe behavior. In particular, we want to prevent malicious misuse, align AI actions with ethical standards, and to build public trust in AI technology. This seminar is dedicated to identifying, characterizing, and mitigating unsafe behaviors in large models. 

Princeton AI Alignment and Safety and Seminar (PASS) serves as a collaborative platform for researchers to present their findings, participate in insightful conversations, and identify collaboration opportunities for novel solutions to these emerging alignment and safety challenges.

All the seminar talks will be virtual. View our series at our YouTube channel. Join our mailing list and follow us on Twitter X to get notified of speakers and livestream links!

Upcoming Events

No content available to show.

Publications

(Bold indicates PLI author)

2024

MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences

Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang

Paper


Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora

Paper    Code


Personalized Language Modeling from Personalized Human Feedback

Xinyu Li, Zachary C. Lipton, Liu Leqi 

Paper


Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang and Peter Henderson

Paper Code Website


2023

Trainable Transformer in Transformer

Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia, Sanjeev Arora

Paper Code


Playing Large Games with Oracles and AI Debate

Xinyi Chen, Angelica Chen, Dean Foster, Elad Hazan

Paper


Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, Prateek Mittal

Paper Code


Detecting Pretraining Data from Large Language Models

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer

paper code website


Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson

paper code website


Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen

paper code website

Organizers