Papers + Code

Papers are roughly grouped by topic below. For a full list, see my Google Scholar Page.

Interpretability

System-1.x: Learning to Balance Fast and Slow Planning with Language Models
Swarnadeep Saha, Archiki Prasad, Justin Chih-Yao Chen, Peter Hase, Elias Stengel-Eskin, Mohit Bansal
Preprint on arXiv. [pdf] [code]

Foundational Challenges in Assuring Alignment and Safety of Large Language Models (Sec. 3.4)
Usman Anwar and 37 others including Peter Hase
TMLR 2024. [pdf]

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun
NeurIPS 2023 (Spotlight). [pdf] [code]

Adaptive Contextual Perception: How to Generalize to New Backgrounds and Ambiguous Objects
Zhuofan Ying, Peter Hase, Mohit Bansal
NeurIPS 2023. [pdf] [code]

Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees
Swarnadeep Saha, Shiyue Zhang, Peter Hase, Mohit Bansal
ICLR 2023. [pdf] [code]

VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives
Zhuofan Ying,* Peter Hase,* Mohit Bansal
NeurIPS 2022. [pdf] [code]

Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations
Swarnadeep Saha, Peter Hase, Nazneen Rajani, Mohit Bansal
EMNLP 2022. [pdf] [code]

When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data
Peter Hase, Mohit Bansal
ACL 2022 Workshop on Natural Language Supervision (Spotlight). [pdf v2] [pdf v1] [code]

The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations
Peter Hase, Harry Xie, Mohit Bansal
NeurIPS 2021. [pdf] [code]

FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging
Han Guo, Nazneen Fatema Rajani, Peter Hase, Mohit Bansal, Caiming Xiong
EMNLP 2021. [pdf] [code]

Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?
Peter Hase, Mohit Bansal
ACL 2020. [pdf] [code]

Interpretable Image Recognition with Hierarchical Prototypes
Peter Hase, Chaofan Chen, Oscar Li, Cynthia Rudin
AAAI-HCOMP 2019. [pdf] [code]

Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?
Peter Hase, Shiyue Zhang, Harry Xie, Mohit Bansal
Findings of EMNLP 2020. [pdf] [code]

Model Editing

Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?
Peter Hase, Thomas Hofweber, Xiang Zhou, Elias Stengel-Eskin, Mohit Bansal
TMLR 2024. [pdf] [code]

Rethinking Machine Unlearning for Large Language Models
Sijia Liu, Yuanshun Yao, et al. including Peter Hase
Preprint on arXiv. [pdf]

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil,* Peter Hase,* Mohit Bansal
ICLR 2024 (Spotlight). [pdf] [code]

Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs
Peter Hase, Mona Diab, Asli Celikyilmaz, Xian Li, Zornitsa Kozareva, Veselin Stoyanov, Mohit Bansal, Srinivasan Iyer
EACL 2023. [pdf] [code]

Scalable Oversight

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe
ACL 2024. [pdf] [code]

Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization
Swarnadeep Saha, Peter Hase, Mohit Bansal
NeurIPS 2023. [pdf] [code]

Additional Topics

Teaching Models to Balance Resisting and Accepting Persuasion
Elias Stengel-Eskin, Peter Hase, and Mohit Bansal
Preprint on arXiv. [pdf] [code]

LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models
Elias Stengel-Eskin, Peter Hase, and Mohit Bansal
NeurIPS 2024. [pdf] [code]

Are Language Models Rational? The Case of Coherence Norms and Belief Revision
Thomas Hofweber, Peter Hase, Elias Stengel-Eskin, and Mohit Bansal
Preprint on arXiv. [pdf]

INSPIRE: A Framework for Integrating Individual User Preferences in Recourse
Prateek Yadav, Peter Hase, Mohit Bansal
TMLR 2024. [pdf] [code]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper, Xander Davies, et al. including Peter Hase
TMLR 2023. [pdf]

GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models
Archiki Prasad, Peter Hase, Xiang Zhou, Mohit Bansal
EACL 2023. [pdf] [code]