As models grow more powerful and demonstrate new capabilities, it is increasingly important to study how we can align these models and ensure safe behavior. In particular, we want to prevent malicious misuse, align AI actions with ethical standards, and to build public trust in AI technology. This seminar is dedicated to identifying, characterizing, and mitigating unsafe behaviors in large models.
Princeton AI Alignment and Safety Seminar (PASS) serves as a collaborative platform for researchers to present their findings, participate in insightful conversations, and identify collaboration opportunities for novel solutions to these emerging alignment and safety challenges.
All the seminar talks will be virtual. View our series at our YouTube channel. Join our mailing list and follow us on Twitter X to get notified of speakers and livestream links!
Upcoming Events
No content available to show.
Past Events
Publications
(Bold indicates PLI author)
2024
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora
Personalized Language Modeling from Personalized Human Feedback
Xinyu Li, Zachary C. Lipton, Liu Leqi
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang and Peter Henderson
2023
Trainable Transformer in Transformer
Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia, Sanjeev Arora
Playing Large Games with Oracles and AI Debate
Xinyi Chen, Angelica Chen, Dean Foster, Elad Hazan
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, Prateek Mittal
Detecting Pretraining Data from Large Language Models
Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen