Generating stereotypes from implicitly hateful posts with Influence Functions

Abstract: Substantial progress has been made on detecting explicit forms of hate, while implicitly hateful posts containing, e.g., microaggressions and condescension, still pose a major challenge. In light of high error rates, explanations accompanying model decisions are especially important. Since implicit abuse cannot be put down to the use of an individual slur, but arises out of the wider sentence context, highlighting individual tokens as an explanation is of limited use. In this paper, we generate full-text verbalisations of stereotypes that underlie implicitly hateful posts. We test the hypothesis that providing more context to the model - such as a small set of related samples - will lower the bar for generating the implied stereotype. For a given post, instance attribution methods, such as Influence Functions, are used to source similar examples from the training data. Then BART is trained to generate the underlying stereotype from an original input and its most similar neighbours.

Presenter: Alina Leidinger

Date: 2023-03-21 15:00 (CET)