Papers + Code
Papers are roughly grouped by topic below. For a full list, see my Google Scholar Page.
Interpretability
System-1.x: Learning to Balance Fast and Slow Planning with Language Models
Swarnadeep Saha, Archiki Prasad, Justin Chih-Yao Chen, Peter Hase, Elias Stengel-Eskin, Mohit Bansal
Preprint on arXiv. [pdf] [code]
Foundational Challenges in Assuring Alignment and Safety of Large Language Models (Sec. 3.4)
Usman Anwar and 37 others including Peter Hase
TMLR 2024. [pdf]
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun
NeurIPS 2023 (Spotlight). [pdf] [code]
Adaptive Contextual Perception: How to Generalize to New Backgrounds and Ambiguous Objects
Zhuofan Ying, Peter Hase, Mohit Bansal
NeurIPS 2023. [pdf] [code]
Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees
Swarnadeep Saha, Shiyue Zhang, Peter Hase, Mohit Bansal
ICLR 2023. [pdf] [code]
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives
Zhuofan Ying,* Peter Hase,* Mohit Bansal
NeurIPS 2022. [pdf] [code]
Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations
Swarnadeep Saha, Peter Hase, Nazneen Rajani, Mohit Bansal
EMNLP 2022. [pdf] [code]
When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data
Peter Hase, Mohit Bansal
ACL 2022 Workshop on Natural Language Supervision (Spotlight). [pdf v2] [pdf v1] [code]
The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations
Peter Hase, Harry Xie, Mohit Bansal
NeurIPS 2021. [pdf] [code]
FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging
Han Guo, Nazneen Fatema Rajani, Peter Hase, Mohit Bansal, Caiming Xiong
EMNLP 2021. [pdf] [code]
Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?
Peter Hase, Mohit Bansal
ACL 2020. [pdf] [code]
Interpretable Image Recognition with Hierarchical Prototypes
Peter Hase, Chaofan Chen, Oscar Li, Cynthia Rudin
AAAI-HCOMP 2019. [pdf] [code]
Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?
Peter Hase, Shiyue Zhang, Harry Xie, Mohit Bansal
Findings of EMNLP 2020. [pdf] [code]
Model Editing & Unlearning
Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?
Peter Hase, Thomas Hofweber, Xiang Zhou, Elias Stengel-Eskin, Mohit Bansal
TMLR 2024. [pdf] [code]
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Vaidehi Patil, Yi-Lin Sung, Peter Hase, Jie Peng, Tianlong Chen, Mohit Bansal
TMLR 2024. [pdf] [code]
Rethinking Machine Unlearning for Large Language Models
Sijia Liu, Yuanshun Yao, et al. including Peter Hase
Nature Machine Intelligence. [pdf]
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil,* Peter Hase,* Mohit Bansal
ICLR 2024 (Spotlight). [pdf] [code]
Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs
Peter Hase, Mona Diab, Asli Celikyilmaz, Xian Li, Zornitsa Kozareva, Veselin Stoyanov, Mohit Bansal, Srinivasan Iyer
EACL 2023. [pdf] [code]
Scalable Oversight
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe
ACL 2024. [pdf] [code]
Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization
Swarnadeep Saha, Peter Hase, Mohit Bansal
NeurIPS 2023. [pdf] [code]
Additional Topics
Teaching Models to Balance Resisting and Accepting Persuasion
Elias Stengel-Eskin, Peter Hase, and Mohit Bansal
Preprint on arXiv. [pdf] [code]
LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models
Elias Stengel-Eskin, Peter Hase, and Mohit Bansal
NeurIPS 2024. [pdf] [code]
Are Language Models Rational? The Case of Coherence Norms and Belief Revision
Thomas Hofweber, Peter Hase, Elias Stengel-Eskin, and Mohit Bansal
Preprint on arXiv. [pdf]
INSPIRE: A Framework for Integrating Individual User Preferences in Recourse
Prateek Yadav, Peter Hase, Mohit Bansal
TMLR 2024. [pdf] [code]
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper, Xander Davies, et al. including Peter Hase
TMLR 2023 (Outstanding Paper Finalist). [pdf]
GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models
Archiki Prasad, Peter Hase, Xiang Zhou, Mohit Bansal
EACL 2023. [pdf] [code]