Papers are roughly grouped by topic below. For a full list, see my Google Scholar Page.

Mechanistic Interpretability

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun
NeurIPS 2023 (Spotlight). [pdf] [code]

Adaptive Contextual Perception: How to Generalize to New Backgrounds and Ambiguous Objects
Zhuofan Ying, Peter Hase, Mohit Bansal
NeurIPS 2023. [pdf] [code]

Natural Language Explanations

Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization
Swarnadeep Saha, Peter Hase, Mohit Bansal
NeurIPS 2023. [pdf] [code]

Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations
Swarnadeep Saha, Peter Hase, Nazneen Rajani, Mohit Bansal
EMNLP 2022. [pdf] [code]

Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?
Peter Hase, Shiyue Zhang, Harry Xie, Mohit Bansal
Findings of EMNLP 2020. [pdf] [code]

Model Editing

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil,* Peter Hase,* Mohit Bansal
ICLR 2024 (Spotlight). [pdf] [code]

Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs
Peter Hase, Mona Diab, Asli Celikyilmaz, Xian Li, Zornitsa Kozareva, Veselin Stoyanov, Mohit Bansal, Srinivasan Iyer
EACL 2023. [pdf] [code]

Scalable Oversight

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe
Preprint on arXiv. [pdf] [code]

Supervised and Decomposable Reasoning

Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees
Swarnadeep Saha, Shiyue Zhang, Peter Hase, Mohit Bansal
ICLR 2023. [pdf] [code]

VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives
Zhuofan Ying,* Peter Hase,* Mohit Bansal
NeurIPS 2022. [pdf] [code]

When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data
Peter Hase, Mohit Bansal
ACL 2022 Workshop on Natural Language Supervision (Spotlight). [pdf v2] [pdf v1] [code]

XAI Methods & Evaluation

The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations
Peter Hase, Harry Xie, Mohit Bansal
NeurIPS 2021. [pdf] [code]

FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging
Han Guo, Nazneen Fatema Rajani, Peter Hase, Mohit Bansal, Caiming Xiong
EMNLP 2021. [pdf] [code]

Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?
Peter Hase, Mohit Bansal
ACL 2020. [pdf] [code]

Interpretable Image Recognition with Hierarchical Prototypes
Peter Hase, Chaofan Chen, Oscar Li, Cynthia Rudin
AAAI-HCOMP 2019. [pdf] [code]

Additional Topics

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper, Xander Davies, et al. including Peter Hase
TMLR 2023. [pdf]

GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models
Archiki Prasad, Peter Hase, Xiang Zhou, Mohit Bansal
EACL 2023. [pdf] [code]

Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions
Prateek Yadav, Peter Hase, Mohit Bansal
Preprint on arXiv. [pdf] [code]

Shall I Compare Thee to a Machine-Written Sonnet? An Approach to Algorithmic Sonnet Generation
John Benhardt,* Peter Hase,* Liuyi Zhu,* Cynthia Rudin
Preprint on arXiv. [pdf] [code]