About Me
I am an AI researcher currently doing a residency at Anthropic. Before this, I completed my PhD at the University of North Carolina at Chapel Hill, where I was advised by Mohit Bansal. My work at UNC was supported by a Google PhD Fellowship and a Royster Fellowship.
My research focuses on AI safety and NLP. Below are some of the main areas I am interested in:
- Interpretability
- Model Editing & Unlearning
- Scalable Oversight
Broadly, I am interested in explaining and controlling the behavior of machine learning models. I see language models as a good object of study since we lack complete explanations for their behavior and human language provides a rich means of interaction with models. I find work on clarifying concepts and developing strong evaluation procedures especially valuable.
Email: peter@cs.unc.edu
News
- 2024 - Work on Open Problems and Fundamental Limitations of RLHF designated as an Outstanding Paper Finalist in TMLR
- 2024 - Two papers accepted to TMLR on (1) fundamental problems in model editing and (2) unlearning for multimodal models
- 2024 - Invited talk at TTIC’s Young Researcher Seminar Series, “AI Safety Through Interpretable and Controllable Language Models” [slides]
- 2024 - New paper on training LLMs to be persuadable only when appropriate
- 2024 - Paper accepted to NeurIPS on calibrating LLMs’ linguistic expressions of confidence
- 2024 - I will serve as a Senior Area Chair for ACL 2025 in the Interpretability and Analysis of Models for NLP track
- 2024 - I am starting a residency at Anthropic! I will be working with Sam Bowman on topics in AI safety.
- 2024 - We have several new papers on (1) controlling how much LLMs verbalize vs. internalize their reasoning, (2), calibrating explicit and implicit confidence markers in LLM outputs, and (3) defining a philosophical basis for epistemic rationality in LLMs.
- 2024 - My last PhD paper is out! “Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?” [pdf] [code]
- 2024 - I graduated! My thesis was on “Interpretable and Controllable Language Models”, and you can watch my defense here. I have to thank a lot of people for this, and hopefully most of them are mentioned in these acknowledgments.
- 2024 - Invited talk at Stanford NLP Seminar on “Controlling and Editing Knowledge in Large Language Models” [slides]
- 2024 - Invited talks at OpenAI and CHAI (UC Berkeley) on “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks “ [slides]
- 2024 - New paper out! “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks” [pdf] [code]
- 2024 - Paper accepted to ICLR with a spotlight: “Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks” [pdf] [code]
- 2023 - Serving as an Area Chair for EACL 2024 in the Interpretability and Analysis of Models for NLP track
- 2023 - New paper out! “Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks” [pdf] [code]
- 2023 - Three papers accepted to NeurIPS 2023! Our work on (1) localization and model editing, (2) mechanistic interpretability for vision models, and (3) LMs explaining tasks to weaker agents (teaching).
- 2023 - Named an Outstanding Area Chair at ACL 2023 (1-1.5% of the pool of reviewers and chairs)
- 2023 - New paper out! “Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Theory of Mind” [pdf] [code]
- 2023 - New paper out! “Adaptive Contextual Perception: How to Generalize to New Backgrounds and Ambiguous Objects” [pdf] [code]
- 2023 - Started summer internship at AI2! Supervised by Sarah Wiegreffe and Peter Clark
- 2023 - New paper out! “Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models” [pdf] [code]
- 2022 - Serving as an Area Chair for ACL 2023 in the Interpretability and Analysis of Models for NLP track
- 2022 - Serving as an Area Chair for the AAAI 2023 Workshop on Representation learning for Responsible Human-Centric AI
- 2022 - Work accepted to EMNLP 2022: “Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations” [pdf]
- 2022 - Work accepted to NeurIPS 2022: “VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives” [pdf]
- 2022 - Serving as an Area Chair for EMNLP 2022 in the Interpretability, Interactivity and Analysis of Models for NLP track
- 2022 - Started summer internship at Google Research! Supervised by Asma Ghandeharioun and Been Kim
- 2022 - Invited talk at the University of Oxford on Explainable Machine Learning in NLP
- 2022 - Paper accepted to ACL 2022 Workshop on Natural Language Supervision! “When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data” [pdf] [code]
- 2022 - Invited talk at NEC Laboratories Europe, on Explainable Machine Learning in NLP
- 2022 - Invited talk at the National Institute for Standards and Technology, on Evaluating Explainable AI
- 2022 - Invited talk at the Allen Institute for AI, on Detecting, Updating, and Visualizing Language Model Beliefs
- 2022 - Invited talk at Uber AI, on The OOD Problem and Search Methods in Explainable ML
- 2021 - New preprint on arxiv! “Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs” [pdf] [code]
- 2021 - Paper accepted to NeurIPS 2021! “The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations” [pdf] [code]
- 2021 - Awarded a Google PhD Fellowship for Natural Language Processing!
- 2021 - Invited talk at CHAI, UC Berkeley, on Evaluating Explainable AI
- 2021 - Paper accepted to EMNLP 2021: “FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging” [pdf] [code]
- 2021 - Named as an outstanding reviewer for ACL-IJCNLP 2021
- 2021 - New paper on arxiv! “Search Methods for Sufficient, Socially-Aligned Feature Importance Explanations with In-Distribution Counterfactuals” [pdf] [code]
- 2021 - Started summer internship at FAIR, supervised by Srini Iyer.
- 2021 - New blog post on the Alignment Forum: “Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers” [link]
- 2021 - New preprint on arxiv: “When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data” [pdf] [code]
- 2020 - New preprint on arxiv! “FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging” [pdf] [code]
- 2020 - Recognized as an Outstanding Reviewer for EMNLP 2020
- 2020 - Paper accepted into Findings of EMNLP, “Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?” [pdf] [code]
- 2020 - Paper accepted into ACL 2020, “Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?” [pdf] [code]
- 2019 - Paper accepted into AAAI-HCOMP 2019, “Interpretable Image Recognition with Hierarchical Prototypes” [pdf] [code]
- 2019 - Joined the UNC NLP lab
- 2019 - Graduated with a B.S. from the Department of Statistical Science at Duke University
- 2019 - Awarded a Royster PhD Fellowship from UNC Chapel Hill