ML/AI/SW Developer

Huggingface- Chapter 2. Pretrained model & tokenizer

Chapter 2. Using Transformers

1. Tokenizer

  • Transformer 모델이 처리할 수 있도록 문장을 전처리
    • Split, word, subword, symbol 단위 => token
    • token과 integer 맵핑
    • 모델에게 유용할 수 있는 추가적인 인풋을 더해줌
  • AutoTokenizer class
    • 다양한 pretrained 모델을 위한 tokenizer들
    • Default: distilbert-base-uncased-finetuned-sst-2-english in sentiment-analysis
      from transformers import AutoTokenizer
    
      checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
      tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    
      raw_inputs = [
      "I've been waiting for a HuggingFace course my whole life.", 
      "I hate this so much!",
      ]
    
      inputs = tokenizer(padding=True, truncation=True, return_tensors="pt")
      print(inputs)
    
      ----------
      OUTPUT
      ----------
      {'input_ids': tensor([
          [ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102],
          [ 101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), 
      'attention_mask': tensor([
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
    

2. Pretrained Model

  • AutoModel class
    • tokenizer와 같이 pretrained model을 다운로드
    • Batch size: 한번에 처리하는 sequence 수 - 2
    • Sequence length: sequence의 representation length - 16
    • Hidden size: 각 input의 vector 차원 - 768
      from transformers import AutoModel
    
      checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
      model = AutoModel.from_pretrained(checkpoint)
    
      outputs = model(**inputs)
      print(outputs.last_hidden_state.shape)
    
      ----------
      OUTPUT
      ----------
      torch.Size([2, 16, 768])
    
  • AutoModel + Task
    • Model (retrieve the hidden states)
    • ForCausalLM
    • ForMaskedLM
    • ForMultipleChoice
    • ForQuestionAnswering
    • ForSequenceClassification
    • ForTokenClassification
      from transformers import AutoModelForSequenceClassification
    
      model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
      outputs = model(**inputs)
    
      print(outputs.logits.shape)
      print(outputs.logits)
    
      ----------
      OUTPUT
      ----------
      torch.Size([2, 2])
      tensor([[-1.5607,  1.6123],
          [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)
    

3. Models

  • Config file
    • 모델 구성을 위한 많은 parameter 존재
      BertConfig {
      [...]
      "hidden_size": 768,
      "intermediate_size": 3072,
      "max_position_embeddings": 512,
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      [...]
      }
    
  • Python
    • 위에서 정의한 Config를 바탕으로 모델 생성
    • Model is randomly initialized!
      from transformers import BertConfig, BertModel
    
      # Building the config
      config = BertConfig()
      # Building the model from the config
      model = BertModel(config)
    
      # 간단하게 pretrain 모델 불러오기
      # ~/.cache/huggingface/transformers. [다운로드 위치]
      # https://huggingface.co/models?filter=bert 불러올수 있는 모델리스트
      model = BertModel.from_pretrained("bert-base-cased")
    
  • Save model
    • 간단하게 저장 가능
    • config.json pytorch_model.bin 두개의 파일 저장
      • config.json: 모델의 구조
      • pytorch_model.bin: state dictionary
      model.save_pretrained("directory_on_my_computer")
    
  • Inference
    • torch와 동일하게 사용가능
    • 다양한 arg를 받을 때는, IDs가 필요
      import torch
    
      model_inputs = torch.tensor(encoded_sequences)
      output = model(model_inputs)
    

4. Tokenizers

  • 대표적인 알고리즘
    • Byte-level BPE, as used in GPT-2
    • WordPiece, as used in BERT
    • SentencePiece or Unigram, as used in several multilingual models
      from transformers import AutoTokenizer
      tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
      # Encoding
      sequence = "Using a Transformer network is simple"
      tokens = tokenizer.tokenize(sequence)
      # Decoding
      decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
      print(decoded_string)
    

5. Handling multiple sequences

  • Tensorflow의 입력은 multiple sequence
  • Batch를 이용해 multiple sequence 구성
    • 다른 길이를 pad를 통해 맞추어 주어야 함
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load tokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# Load model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

# apply tokenizer
tokens = tokenizer.tokenize(sequence)
# token to index
ids = tokenizer.convert_tokens_to_ids(tokens)

#################################################################
# This line will fail.
    # Transformers models은 기본적으로 여러 문장을 입력으로 기대
input_ids = torch.tensor(ids)

# This is ok.
input_ids = torch.tensor([ids])
#################################################################
output = model(input_ids)
  • Padding 시 주의점
    • padding 한 문장의 loggit 값이 달라진다.
      sequence1_ids = [[200, 200, 200]] # 문장 1
      sequence2_ids = [[200, 200]] # 문장 2
      batched_ids = [[200, 200, 200], [200, 200, tokenizer.pad_token_id]]  # Batch
    
      print(model(torch.tensor(sequence1_ids)).logits)
      print(model(torch.tensor(sequence2_ids)).logits)
      print(model(torch.tensor(batched_ids)).logits)
    
      ----------
      OUTPUT
      ----------
      tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
      tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
      tensor([[ 1.5694, -1.3895],
              [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)
    
    • Attention mask를 통해 pad 토큰을 무시하게 해주어야 한다.
      batched_ids = [
      [200, 200, 200],
      [200, 200, tokenizer.pad_token_id]
      ]
    
      attention_mask = [
      [1, 1, 1],
      [1, 1, 0] # 마지막 pad 토큰을 무시하도록 설정
      ]
      outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
    
      ----------
      OUTPUT
      ----------
      tensor([[ 1.5694, -1.3895],
              [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
    

6. Putting it all together

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = [
  "I've been waiting for a HuggingFace course my whole life.",
  "So have I!"
]

# 여러 가지 model_input 유형!
    # tokenizer가 [CLS]를 문장 처음에, [SEP]를 문장 끝ㄴ에 추가
# 가장 긴 sequence의 길이에 맞추어 padding
model_inputs = tokenizer(sequences, padding="longest")

# model의 max length 까지 padding
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# 특정 max_length 까지 sequences padding
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

# max_length 보다 긴 sequence들 truncate
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

# Inference
output = model(**tokens)

Reference