Interpretation of HuggingFace’s model decision

Transformer-based models have taken a leading role in NLP today. In most cases using pre-trained encoder architectures in solving downstream tasks achieves super high scores. The main idea of ​​this approach is to train the large model on a big amount of unlabeled data and then add few layers to the top of it for text classification, coreference resolution, question answering, and so on. However, although such models give cool results, those models are still black boxes, whereas the interpretability of a model is very important for debugging and understanding how the model makes a decision.

A couple of weeks ago I faced a demo from Allen NLP. I found a cool feature there that highlighted the words that impacted on the model’s decision. Then I wanted to do this for my models, but after spending some time looking for a tutorial on how to put a PyTorch model into the Allen NLP Predictor, I haven’t found anything useful. Then I decided to rewrite the interpreter to PyTorch and it was not difficult because Allen NLP is built on it, so some parts of the code were taken from here.

I will consider mostly on pieces of code that will help you integrate your own model. Also, I will omit the explanation of some common and clear parts to not make this article too long. The full code you can find on my GitHub repository.

Some updates:

There are 3 gradient-based algorithms that are covered by Allen NLP Interpret. I will choose Smooth Grad and a classification task as an example. The idea of this algorithm is simple: we first make a prediction, take argmax (as we usually do in classification), and assuming it as ground truth — make backpropagation. Then we sum the gradients for each embedding but as was shown in the paper the values can have outliers so we calculate gradients few times adding Gaussian noise and then take the average. Let’s start from the first step — making a prediction:

def saliency_interpret(self, test_dataset):

# Convert inputs to labeled instances
predictions = self._get_prediction(test_dataset)

In my case, test_dataset is the instance of the class that I inherited from PyTorch’s Dataset class. I use it to build the dataloader to make predictions. If you are feed examples to the model in another way, you can rewrite this method:

def _get_prediction(self, test_dataset, batch_size=1):    test_dataloader = DataLoader(test_dataset,
model_inputs = []
input_tokens = []
predictions = torch.tensor([], dtype=torch.float)
model =
for inputs, tokens in test_dataloader:
# collecting inputs, as they will be used in _get_gradients
# and tokens to correspond them method output
input_ids = inputs.get('input_ids')
attention_mask = inputs.get("attention_mask")
input_ids =
attention_mask =
with torch.no_grad():
outputs = model(
predictions =
softmax(outputs, dim=-1)
return predictions, model_inputs, input_tokens

The model’s outputs are logits, so we are applying the softmax to get probabilities from it. I also collect model_inputs (tokens ids) that will be used in the next steps as well as input_tokens (tokenized text) that are returned by the dataloader. The reason why I save tokens is that the transformer models use special algorithms (such as bpe) to reduce the vocabulary size and the tokenizers split some words to word pieces, so not always one word is equal to one token. For example:

>>> tokenizer.tokenize('transformer model')
['transform', '##er', 'model']

Then we apply the Smooth Grad for each example (didn’t investigate how to do it by batches):

...predictions = self._get_prediction(test_dataset)    instances_with_grads = dict()
for idx, (prob, inp, tokens) in enumerate(zip(*predictions)):
# Run smoothgrad
label = torch.argmax(prob, axis=0)
grads = self._smooth_grads(label, inp)

Where _smoth_grads method implements the described algorithm above:

def _smooth_grads(self, label, inp):
total_gradients = {}
for _ in range(self.num_samples):
handle = self._register_forward_hook(self.stdev)
grads = self._get_gradients(label, inp)
# Sum gradients
if total_gradients == {}:
total_gradients = grads
for key in grads.keys():
total_gradients[key] += grads[key]
# Average the gradients
for key in total_gradients.keys():
total_gradients[key] /= self.num_samples
return total_gradients

There are some important points in _register_forward_hook and _get_gradients. In the first one, we have to define the embedding layer. Allen NLP has a specific method for it (it covers GPT and Bert models), but I decided to define the layer directly via keyword arguments or bert attribute by default:

...encoder = self.kwargs.get("encoder")
if encoder:
embedding_layer = self.model.__getattr__(encoder).embeddings
embedding_layer = self.model.bert.embeddings

While my model’s class looks:

class DistilBertForSequenceClassification(nn.Module):    def __init__(self, config, num_labels=2):
super(DistilBertForSequenceClassification, self).__init__()
self.num_labels = num_labels
self.config = config
self.bert = DistilBertModel.from_pretrained(
self.dropout = nn.Dropout(config.dropout)
self.classifier = nn.Linear(config.hidden_size, num_labels)
def forward(
last_hidden = self.bert(
pooled_output = torch.mean(last_hidden[0], dim=1)
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
return logits

In the second one, we need to do the same as we do during the training. In my case, I feed to the model input_ids and attention_mask to mask the paddings:

...embedding_gradients = []
hooks = self._register_embedding_gradient_hooks(embedding_gradients)
input_ids = inp.get('input_ids')
attention_mask = inp.get("attention_mask")
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
batch_losses = self.criterion(outputs, label.unsqueeze(0))
loss = torch.mean(batch_losses)

In _register_embedding_gradient_hooks we also need to define the embedding as we did before.

After normalizing the gradients (to make the sum = 1) we are saving gradients, tokens, predicted label and the probability of it to the dictionary:

...instances_with_grads["instance_" + str(idx + 1)] = grads
instances_with_grads["instance_" + str(idx + 1)]['tokens_input_1'] = [t[0] for t in tokens]
instances_with_grads["instance_" + str(idx + 1)]['label_input_1'] = label.item()
instances_with_grads["instance_" + str(idx + 1)]['prob_input_1'] = prob.max().item()

As a result, we get the dictionary with tokens and their weights in the sentence. I add a util method to make a visualization. Let’s have a look at the example of the output.

I trained DistilBert on the subset of Medium Post Titles Dataset with 93 classes which correspond to categories of the articles. The dataset consists of titles and subtitles, so let’s try to test the model on the title of this article plus the first sentence as the subtitle (the example of usage you can find in this notebook):

As we can see, the most impact on the model’s decision makes [CLS] token, which is used for Next Sentence Prediction Objective during pre-training. The embedding of this token is fed to the linear layer with softmax activation to train the model to predict if the sentences separated by [SEP] token are from the same document or not. And the interesting point is that I didn’t use common practice when this token is used for classification (I used mean of the output embeddings). Also, we can emphasize the model gives the right prediction.

To sum up, I think that these methods a great tool for analyzing your NLP models and can help you to find their weaknesses. There is one more amazing tool — BertViz that gives the ability to look inside the model, so I suggest to investigate it if you are interested in exploring your models.

ML Engineer at Simporter Inc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store