Peter Hase

About Me

I do research in AI safety and interpretability.

I am currently a Visiting Scientist at Schmidt Sciences and a Visiting Researcher at the Stanford NLP Group, working with Chris Potts. Previously, I’ve been at Anthropic, AI2, Google, Meta, and the University of North Carolina at Chapel Hill, where I did my PhD.

Below are some of the main research areas I am interested in:

Interpretability
Model Editing & Unlearning
Scalable Oversight

Broadly, I am interested in explaining and controlling the behavior of machine learning models. I see language models as a good object of study since we lack complete explanations for their behavior and human language provides a rich means of interaction with models. I find work on clarifying concepts and developing strong evaluation procedures especially valuable.

Email: peter@cs.unc.edu

Google Scholar Page

News

2025 - New paper from Anthropic on chain-of-thought faithfulness in reasoning models!
2025 - Two papers accepted at ICLR, NAACL: work on system 1.x reasoning (adjustable inference-time reasoning) accepted to ICLR; persuasion-balanced training for teaching models to be appropriately persuadable accepted to NAACL
2024 - Work on Open Problems and Fundamental Limitations of RLHF designated as an Outstanding Paper Finalist in TMLR
2024 - Two papers accepted to TMLR on (1) fundamental problems in model editing and (2) unlearning for multimodal models
2024 - Invited talk at TTIC’s Young Researcher Seminar Series, “AI Safety Through Interpretable and Controllable Language Models” [slides]
2024 - New paper on training LLMs to be persuadable only when appropriate
2024 - Paper accepted to NeurIPS on calibrating LLMs’ linguistic expressions of confidence
2024 - I will serve as a Senior Area Chair for ACL 2025 in the Interpretability and Analysis of Models for NLP track
2024 - I am starting a residency at Anthropic! I will be working with Sam Bowman on topics in AI safety.
2024 - We have several new papers on (1) controlling how much LLMs verbalize vs. internalize their reasoning, (2), calibrating explicit and implicit confidence markers in LLM outputs, and (3) defining a philosophical basis for epistemic rationality in LLMs.
2024 - My last PhD paper is out! “Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?” [pdf] [code]
2024 - I graduated! My thesis was on “Interpretable and Controllable Language Models”, and you can watch my defense here. I have to thank a lot of people for this, and hopefully most of them are mentioned in these acknowledgments.
2024 - Invited talk at Stanford NLP Seminar on “Controlling and Editing Knowledge in Large Language Models” [slides]
2024 - Invited talks at OpenAI and CHAI (UC Berkeley) on “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks “ [slides]
2024 - New paper out! “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks” [pdf] [code]
2024 - Paper accepted to ICLR with a spotlight: “Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks” [pdf] [code]
2023 - Serving as an Area Chair for EACL 2024 in the Interpretability and Analysis of Models for NLP track
2023 - New paper out! “Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks” [pdf] [code]
2023 - Three papers accepted to NeurIPS 2023! Our work on (1) localization and model editing, (2) mechanistic interpretability for vision models, and (3) LMs explaining tasks to weaker agents (teaching).
2023 - Named an Outstanding Area Chair at ACL 2023 (1-1.5% of the pool of reviewers and chairs)
2023 - New paper out! “Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Theory of Mind” [pdf] [code]
2023 - New paper out! “Adaptive Contextual Perception: How to Generalize to New Backgrounds and Ambiguous Objects” [pdf] [code]
2023 - Started summer internship at AI2! Supervised by Sarah Wiegreffe and Peter Clark
2023 - New paper out! “Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models” [pdf] [code]
2022 - Serving as an Area Chair for ACL 2023 in the Interpretability and Analysis of Models for NLP track
2022 - Serving as an Area Chair for the AAAI 2023 Workshop on Representation learning for Responsible Human-Centric AI
2022 - Work accepted to EMNLP 2022: “Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations” [pdf]
2022 - Work accepted to NeurIPS 2022: “VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives” [pdf]
2022 - Serving as an Area Chair for EMNLP 2022 in the Interpretability, Interactivity and Analysis of Models for NLP track
2022 - Started summer internship at Google Research! Supervised by Asma Ghandeharioun and Been Kim
2022 - Invited talk at the University of Oxford on Explainable Machine Learning in NLP
2022 - Paper accepted to ACL 2022 Workshop on Natural Language Supervision! “When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data” [pdf] [code]
2022 - Invited talk at NEC Laboratories Europe, on Explainable Machine Learning in NLP
2022 - Invited talk at the National Institute for Standards and Technology, on Evaluating Explainable AI
2022 - Invited talk at the Allen Institute for AI, on Detecting, Updating, and Visualizing Language Model Beliefs
2022 - Invited talk at Uber AI, on The OOD Problem and Search Methods in Explainable ML
2021 - New preprint on arxiv! “Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs” [pdf] [code]
2021 - Paper accepted to NeurIPS 2021! “The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations” [pdf] [code]
2021 - Awarded a Google PhD Fellowship for Natural Language Processing!
2021 - Invited talk at CHAI, UC Berkeley, on Evaluating Explainable AI
2021 - Paper accepted to EMNLP 2021: “FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging” [pdf] [code]
2021 - Named as an outstanding reviewer for ACL-IJCNLP 2021
2021 - New paper on arxiv! “Search Methods for Sufficient, Socially-Aligned Feature Importance Explanations with In-Distribution Counterfactuals” [pdf] [code]
2021 - Started summer internship at FAIR, supervised by Srini Iyer.
2021 - New blog post on the Alignment Forum: “Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers” [link]
2021 - New preprint on arxiv: “When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data” [pdf] [code]
2020 - New preprint on arxiv! “FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging” [pdf] [code]
2020 - Recognized as an Outstanding Reviewer for EMNLP 2020
2020 - Paper accepted into Findings of EMNLP, “Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?” [pdf] [code]
2020 - Paper accepted into ACL 2020, “Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?” [pdf] [code]
2019 - Paper accepted into AAAI-HCOMP 2019, “Interpretable Image Recognition with Hierarchical Prototypes” [pdf] [code]
2019 - Joined the UNC NLP lab
2019 - Graduated with a B.S. from the Department of Statistical Science at Duke University
2019 - Awarded a Royster PhD Fellowship from UNC Chapel Hill