Written by
Allison Gasparini, Princeton Language and Intelligence
Feb. 13, 2024

Princeton Language and Intelligence (PLI) announced they’ve awarded seed grant funding to 14 research projects across the Princeton University campus. This first round of grant funding from PLI demonstrates the initiative’s objective to prioritize supporting campus research; this round of funding comes just after the initiative held its kick-off symposium in late Sept. 2023. The funds are meant to enable and aid the use of large AI models in an array of academic disciplines. 

The 14 research projects receiving funding represent 30 different investigators, 16 departments, and all four research divisions on campus — humanities, social sciences, natural sciences, and engineering. The winning proposals were chosen for their quality, originality, and potential impact, among other factors.

“PLI is thrilled to support these outstanding researchers in their endeavors to utilize large AI models for enhancing research across the spectrum of academic disciplines,” said Sanjeev Arora, Director of PLI. “We look forward to watching how each project progresses and to providing updates about the advancements on our PLI Blog.”

The projects which were awarded the seed grant funding include a project which will analyze parliamentary speeches in order to track the political polarization on the topic of immigration over time, a project which aims to build AI models which simulate how children learn in the first years of their life, and an initiative to create infrastructure for modeling handwritten historical texts in order to make them more accessible. In all, PLI distributed $750,000 to kick off the first year of grant funding. 

AI for Humanists / Humanities for AI: Parameterizing Style: Dataset, Workshop, Notebook, Paper
Meredith Martin, Associate Professor of English. Director, Digital Humanities Center
Wouter Haverals, Postdoctoral Research Associate, The Council of the Humanities. Lecturer in Computer Science

Large language models (LLMs) are revolutionary. They can generate text that is not only fluent and grammatically correct but also adheres to a consistent theme and maintains a general style. Additionally, they create vector embeddings that facilitate similarity measures. Yet, these capabilities force us to confront how complicated the terms “similarity” and “style” really are. While we know how to make AI systems talk, we aim to thoroughly explore, articulate, and confront the concepts of “similarity” and “style” from a humanistic perspective, distinguishing clearly between style, genre, and parody. Our goal is to provide a community discussion around the nuanced problem of how to identify and encode literary style and then to provide training data for evaluation and experimentation with open-source large language models.

2024 Princeton ML Theory Summer School
Boris Hanin, Assistant Professor of Operations Research and Financial Engineering
ML Theory Summer School is a program for PhD students from around the world that will showcase, through six main courses, a range of exciting recent developments in theoretical machine learning. The primary focus this year is on theoretical advances in deep learning, optimization/sampling, and interactive decision-making. An important secondary goal is to connect young researchers and foster community within theoretical machine learning.

Impacts of Social Media on Wildlife Conservation
Andrea Lea DiGiorgio, Lecturer in Princeton Writing Program
Cathryn Freund, PhD, Phillip and Patricia Frost Museum of Science, Miami

Social media has the power to be a tremendous help to the efforts of wildlife conservation agencies and researchers. Unfortunately, it can often have unintended and extremely negative consequences for wildlife.  This project will use AI to explore viewer responses to social media posts in order to determine best practices and craft policy for conservation agencies and researchers using wildlife in social media.

Infrastructure for African Languages: Culturally Diverse and Theoretically Sound Benchmarks for Automatic Language Processing
Christiane Fellbaum. Lecturer with the rank of Professor in the Council of the Humanities, the Program in Linguistics, Freshman Seminars, and Computer Science
Happy Buzaaba, Postdoctoral Research Associate, Princeton Institute for International and Regional Studies

The lack of high-quality data currently excludes the African language user community from benefiting and participating in the impressive advancements of language technology and applications driven by LLMs. This research aims to create high-quality human annotated datasets with culturally diverse and theoretically sound, consistent syntactic annotations for typologically diverse African languages for research and numerous NLP applications.

The Polarization of the Immigration Debate: Evidence from 9 National Parliaments
Leah Boustan, Professor of Economics
Rafaela Dancygier, Professor of Politics and SPIA

Classifying and analyzing over a million parliamentary speeches related to immigration from 9 national parliaments in Europe and the United States to track and compare the polarization of the immigration debate across political parties over time. This project seeks to answer how parliamentary speech about immigration has changed across time and space, how European parliamentary speech maps onto the US case, and how mainstream parties adapt their speech when far-right, anti-immigrant parties enter parliament.

LLM-Mediated Human-Robot Collaboration in Construction
Arash Adel, Assistant Professor of School of Architecture
The construction industry employs 300 million people worldwide every year. Worker safety, physically demanding labor resulting in early retirement, low productivity relative to other industries such as manufacturing, and worker shortages remain critical challenges. As a result, the adoption of robotics is gaining momentum in the construction industry to address these challenges. However, construction sites and tasks are highly unstructured compared to manufacturing plants, which complicates the integration process for construction robotics. We see an opportunity for employing large language model (LLM)-mediated communication to facilitate the interaction between construction workers and robots. This research will enable experienced construction workers to collaborate with robotic systems by eliminating their need to program complex control algorithms and, therefore, increase the adoption of robotics in the construction industry at scale.

Human Level Sample-Efficiency of Abstraction and Generalization
Jonathan Cohen, Robert Bendheim and Lynn Bendheim Thoman Professor in Neuroscience. Professor of Psychology and Neuroscience
Nathaniel Daw, Huo Professor in Computational and Theoretical Neuroscience. Professor of Neuroscience and Psychology
Ken Norman, Huo Professor in Computational and Theoretical Neuroscience. Professor of Psychology and Neuroscience. Chair, Department of Psychology

Despite the demonstrated power of current large neural network models used for Al, these continue to be limited in two ways relative to human capabilities: the amount of data required to achieve or exceed human competencies; the range of competencies exhibited by any specific model (e.g., LLMs can't drive cars, and Alpha Go can't speak). This project explores the hypothesis that the greater sample efficiency and flexibility of human cognitive capabilities reflects inductive biases that favor abstraction that is, the efficient learning and flexible use of low dimensional structure that applies widely across application domains.  Specifically, we will test whether inclusion of a "relational bottleneck” — a simple structural bias — that has been shown to promote the discovery of abstract, low dimensional representations and their use in symbolic processing in simple, small scale models, can be brought “to scale” in larger, more complex neural network models.

Toward Foundation Models in Time Series
Vince Poor, Michael Henry Strater University Professor
Hao Huang, Postdoctoral Research Associate, Electrical and Computer Engineering
Yuqi Nie, Graduate Student, Electrical and Computer Engineering

Time series analysis has broad applications in various fields including transportation, energy systems, weather predictions, finance, etc. This research endeavors to establish the first open-source comprehensive foundation model specifically tailored for a broad spectrum of time series applications. This model’s applicability across numerous scientific fields and business sectors has the potential to position it as a focal point of interest with the community and will represent an important step in exploring the potential of multi-modal foundation models, making a significant advancement in the field.

Making Police-Civilian Interactions Safer: Creation of Natural Language Narratives for the Large-Scale Computational Analysis of Police Body-Worn Camera Footage
Brandon Stewart, Associate Professor of Sociology
Jonathan Mummolo, Associate Professor of Politics and Public Affairs, Princeton School of Public and International Affairs
Olga Russakovsky, Associate Professor of Computer Science
Dean Knox, Wharton School of University of Pennsylvania

Project aims at creating and open-source and transparent technology for continuous, large-scale analysis of police-civilian interaction through automated analysis of policy body-worn camera (BWC) footage.  This technology would empower police agencies, policy makers, political science researchers and the public to collaborate towards effective, safe and equitable policing.

Leveraging LLMs for Environmental Research, Education, and Innovation
Jason Ren, Professor of Civil and Environmental Engineering and the Andlinger Center for Energy and the Environment. Associate Director for Research, Andlinger Center for Energy and the Environment
Junjie Zhu, Associate Research Scholar, Civil and Environmental Engineering

Currently, the research team has been actively engaged in environmental data science research, especially natural language processing (NLP) research in the environmental domain.  Project will work to establish a foundation cyberinfrastructure framework for environmental large language models (EnviroLLM) and work with data scientists to advance this critical area of research and education.

Princeton Open HTR Initiative: Creating Infrastructure for Modeling Historical Texts
Marina Rustow, Khedouri A. Zilkha Professor of Jewish Civilization in the Near East. Professor of Near Eastern Studies and History
Helmut Reimitz, Shelby Cullom Davis '30 Professor of European History. Professor of History
Christine Roughan, Postdoctoral Research Associate, Near Eastern Studies

A substantial corpus of historical textual data exists in handwritten format preserved in archives and libraries across the globe.  Even when digitized, the overwhelming majority of this data is left in image format, leaving the text inaccessible to everything from simple search to large scale analysis and computational work.  In collaboration with the Center for Digital Humanities, the goal of this grant is to design and develop a local infrastructure and workflows for handwritten text recognition (HTR) in conversation with the team behind eScriptorium, the open-source leader for HTR technology for humanist scholarship.

The First 1000 Days Project: A Novel Framework for Modeling Child Development
Uri Hasson, Professor of Psychology and Neuroscience
Casey Lew-Williams, Professor of Psychology

The first 1000 days of a child’s life are crucial for their development. Children undergo significant changes during this time, transforming them from infants with limited cognitive and motor abilities to kids who can walk, talk, and reason. The primary goal of this research is to build new Actor-Critic Reinforcement Learning (RL) models. These models are intended to simulate how children learn language through sound, visual stimuli, actions on the environment, and feedback from their family. The dataset generated by this project is poised to become the foundational resource for constructing innovative computational models of child development.

Learning the Language of Mass Spectrometry to Revolutionize Proteomics Quantification
Martin Wühr, Associate Professor of Molecular Biology and the Lewis-Sigler Institute for Integrative Genomics
Michael Skinnider. Assistant Professor of Lewis-Sigler Institute for Integrative Genomics and Ludwig Princeton Branch

Proteins, the primary protagonists of life at the molecular level, remain challenging to identify and quantify.  Mass spectrometry is the sole technique capable of analyzing a cell’s entire protein composition, encompassing over 10,000 proteins.  By modeling the generative process of proteomics data, knowledge can be more cheaply extracted and learning the language of mass spectrometry will revolutionize proteomics quantification and thereby biological research in health and disease.

Improving Reasoning and Argumentation in Legal-like Settings with Foundation Models and Reinforcement Learning
Peter Henderson, Assistant Professor of Computer Science and the Princeton School of Public and International Affairs
Attorneys regularly argue their case before the court or engage in two-sided negotiations. In the context of oral arguments, past studies have suggested (with mixed findings) that these arguments can significantly sway some judges and justices. This research aims to improve and evaluate the abilities of foundation models to reason and argue in legal settings. This will serve as a stepping stone towards aiding lawyers to better prepare for arguments and negotiations. Seek to develop new techniques, powered by advancement in reinforcement learning and foundation models and will release new data/models for future iteration and research.