KyungHyun Lim

ML/AI/SW Developer

Huggingface- Chapter 2. Pretrained model & tokenizer

Sep 17, 2021

NLP
ML_AI

Chapter 2. Using Transformers

1. Tokenizer

Transformer 모델이 처리할 수 있도록 문장을 전처리
- Split, word, subword, symbol 단위 => token
- token과 integer 맵핑
- 모델에게 유용할 수 있는 추가적인 인풋을 더해줌

AutoTokenizer class

다양한 pretrained 모델을 위한 tokenizer들
Default: distilbert-base-uncased-finetuned-sst-2-english in sentiment-analysis

  from transformers import AutoTokenizer

  checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
  tokenizer = AutoTokenizer.from_pretrained(checkpoint)

  raw_inputs = [
  "I've been waiting for a HuggingFace course my whole life.", 
  "I hate this so much!",
  ]

  inputs = tokenizer(padding=True, truncation=True, return_tensors="pt")
  print(inputs)

  ----------
  OUTPUT
  ----------
  {'input_ids': tensor([
      [ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102],
      [ 101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), 
  'attention_mask': tensor([
      [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
      [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

2. Pretrained Model

AutoModel class

tokenizer와 같이 pretrained model을 다운로드
Batch size: 한번에 처리하는 sequence 수 - 2
Sequence length: sequence의 representation length - 16
Hidden size: 각 input의 vector 차원 - 768

  from transformers import AutoModel

  checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
  model = AutoModel.from_pretrained(checkpoint)

  outputs = model(**inputs)
  print(outputs.last_hidden_state.shape)

  ----------
  OUTPUT
  ----------
  torch.Size([2, 16, 768])

AutoModel + Task

Model (retrieve the hidden states)
ForCausalLM
ForMaskedLM
ForMultipleChoice
ForQuestionAnswering
ForSequenceClassification
ForTokenClassification

  from transformers import AutoModelForSequenceClassification

  model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
  outputs = model(**inputs)

  print(outputs.logits.shape)
  print(outputs.logits)

  ----------
  OUTPUT
  ----------
  torch.Size([2, 2])
  tensor([[-1.5607,  1.6123],
      [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)

3. Models

Config file

모델 구성을 위한 많은 parameter 존재

  BertConfig {
  [...]
  "hidden_size": 768,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  [...]
  }

Python

위에서 정의한 Config를 바탕으로 모델 생성
Model is randomly initialized!

  from transformers import BertConfig, BertModel

  # Building the config
  config = BertConfig()
  # Building the model from the config
  model = BertModel(config)

  # 간단하게 pretrain 모델 불러오기
  # ~/.cache/huggingface/transformers. [다운로드 위치]
  # https://huggingface.co/models?filter=bert 불러올수 있는 모델리스트
  model = BertModel.from_pretrained("bert-base-cased")

Save model
- 간단하게 저장 가능
- config.json pytorch_model.bin 두개의 파일 저장
  - config.json: 모델의 구조
  - pytorch_model.bin: state dictionary
```
  model.save_pretrained("directory_on_my_computer")
```

Inference

torch와 동일하게 사용가능
다양한 arg를 받을 때는, IDs가 필요

  import torch

  model_inputs = torch.tensor(encoded_sequences)
  output = model(model_inputs)

4. Tokenizers

대표적인 알고리즘

Byte-level BPE, as used in GPT-2
WordPiece, as used in BERT
SentencePiece or Unigram, as used in several multilingual models

  from transformers import AutoTokenizer
  tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
  # Encoding
  sequence = "Using a Transformer network is simple"
  tokens = tokenizer.tokenize(sequence)
  # Decoding
  decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
  print(decoded_string)

5. Handling multiple sequences

Tensorflow의 입력은 multiple sequence
Batch를 이용해 multiple sequence 구성
- 다른 길이를 pad를 통해 맞추어 주어야 함

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load tokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# Load model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

# apply tokenizer
tokens = tokenizer.tokenize(sequence)
# token to index
ids = tokenizer.convert_tokens_to_ids(tokens)

#################################################################
# This line will fail.
    # Transformers models은 기본적으로 여러 문장을 입력으로 기대
input_ids = torch.tensor(ids)

# This is ok.
input_ids = torch.tensor([ids])
#################################################################
output = model(input_ids)

Padding 시 주의점

padding 한 문장의 loggit 값이 달라진다.

  sequence1_ids = [[200, 200, 200]] # 문장 1
  sequence2_ids = [[200, 200]] # 문장 2
  batched_ids = [[200, 200, 200], [200, 200, tokenizer.pad_token_id]]  # Batch

  print(model(torch.tensor(sequence1_ids)).logits)
  print(model(torch.tensor(sequence2_ids)).logits)
  print(model(torch.tensor(batched_ids)).logits)

  ----------
  OUTPUT
  ----------
  tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
  tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
  tensor([[ 1.5694, -1.3895],
          [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

Attention mask를 통해 pad 토큰을 무시하게 해주어야 한다.

  batched_ids = [
  [200, 200, 200],
  [200, 200, tokenizer.pad_token_id]
  ]

  attention_mask = [
  [1, 1, 1],
  [1, 1, 0] # 마지막 pad 토큰을 무시하도록 설정
  ]
  outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))

  ----------
  OUTPUT
  ----------
  tensor([[ 1.5694, -1.3895],
          [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)

6. Putting it all together

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = [
  "I've been waiting for a HuggingFace course my whole life.",
  "So have I!"
]

# 여러 가지 model_input 유형!
    # tokenizer가 [CLS]를 문장 처음에, [SEP]를 문장 끝ㄴ에 추가
# 가장 긴 sequence의 길이에 맞추어 padding
model_inputs = tokenizer(sequences, padding="longest")

# model의 max length 까지 padding
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# 특정 max_length 까지 sequences padding
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

# max_length 보다 긴 sequence들 truncate
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

# Inference
output = model(**tokens)

Reference

링크