KyungHyun Lim

ML/AI/SW Developer

허깅 페이스 Trainer를 이용해 학습 중 샘플 출력해보기 (QA task)

Oct 26, 2021

NLP
ML_AI

0. Intro

라이브러리 버전
- transformers == 4.5.0
목표
- 학습 중 정해진 step마다 샘플(모델의 예측) 출력하기

1. Trainer 구조 살펴보기: def train()

transformers에 trainer.py에는 Trainer 클래스가 정의되어 있다.
그리고, 학습을 실행하는 핵심 함수인 train() 함수가 정의되어 있다.
학습과정에서 무엇인가 추가를 하고 싶다면 여기를 이해하고, 오버라이딩을 통해 수정을 해야한다.

def train(
        self,
        resume_from_checkpoint: Optional[Union[str, bool]] = None,
        trial: Union["optuna.Trial", Dict[str, Any]] = None,
        **kwargs,
    ):

굉장히 많은 설정들을 거쳐 쭉~ 스크롤을 내리다 보면, 드디어 학습을 시작하는 부분을 찾을 수 있다.

# Line: 1078
for epoch in range(epochs_trained, num_train_epochs):

# Line: 1101
for step, inputs in enumerate(epoch_iterator):

4.5 버전에서는 한 step을 진행하는 함수가 다음과 같이 짜여저 있다. 모델의 output을 출력하고 싶은데, 없다. loss만 반환하고 있다. 당황하지 말고 training_step으로 이동해 보자

# Line 1120
else:
    tr_loss += self.training_step(model, inputs)
self._total_flos += float(self.floating_point_ops(inputs))

또 굉장히 긴 함수 하나를 발견 할 수 있다. 어려울 수도 있지만, torch를 어느정도 다루어 봤으면 이 함수의 역활은 모델에 input을 주고, loss를 계산해 반환해 준다는 것만 알고 있으면 된다. 그럼 어디서 loss를 계산하는지부터 찾아 보자!

# Line 1495:
def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:
    # Line 1520:
    if self.use_amp:
        with autocast():
            loss = self.compute_loss(model, inputs)
    else:
        loss = self.compute_loss(model, inputs)

찾았다. 로스를 계산하는 부분이다. 또 함수로 들어가야 할 것 같다. 이제 찾아야 할 것은 model의 output 이란것만 기억하자

# Line 1546:
def compute_loss(self, model, inputs, return_outputs=False):

바로 아래 정의 되어 있기 때문에 비교적 찾기가 쉬웠다. 눈에 띄는 점이 있다. 필요한 output 반환이 False로 설정 되어 있다. 이부분을 오버라이딩해서 내용은 그대로 사용하고, 옵션만 True로 바꾸어 주자!

# Line 1546: False -> True
def compute_loss(self, model, inputs, return_outputs=True):

그러면, 이제 compute_loss 함수는 loss와 output을 반환하게 된다. 때문에 training_step함수도 오버라이딩을 통해 약간의 수정을 해주어야한다. 아까 기본 함수구현은 loss만 받기 때문에, output도 반환 받을 수 있도록 수정을 해주자. 그리고나서, 함수 끝에 반환값에도 output을 추가해 주자.

# Line 1495:
def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:
    # Line 1520:
    if self.use_amp:
        with autocast():
            loss, output = self.compute_loss(model, inputs)
    else:
        loss, output = self.compute_loss(model, inputs)
    (생략)
    # Line 1544:
    return loss.detach(), output 

이제 필요한 output을 def train() 함수 안에서 사용할 수 있게 되었다. 하지만 수정이 좀필요하다. output을 training_step 함수로 부터 반환을 받아야 하기 때문에 loss를 바로 더해줄 수 가 없어 아래와 같이 수정을 해야한다.
사실 고버전에서는 (4.11) 이미 로스를 따로 반환 받고 그아래서 처리해주기 때문에 변경이 더 쉽다.

# 원래 코드
# Line 1120
else:
    tr_loss += self.training_step(model, inputs)
self._total_flos += float(self.floating_point_ops(inputs))

# 수정할 코드
else:
    tr_loss_step, output_step = self.training_step(model, inputs)
tr_loss += tr_loss_step
self._total_flos += float(self.floating_point_ops(inputs))

2. output 활용하기

열심히 찾아온 output을 활용할 차례이다!
좀 길기도 한것 같은데 어려울 것 없다.
- output_step: 열심히 찾아온 model의 output
- 이것만 기억하자!

if step % 50 == 0: # 50 step 마다 샘플 출력
    amount = 10 # 출력할 샘플의 개수 설정
    
    # QA 모델이기 때문에 반환 값에 answer의 위치를 나타내는 start index와 end index를 예측하는 logit이 존재한다. torch.argmax를 통해 인덱스로 변환!
    start_idxs = torch.argmax(output_step['start_logits'], dim=1).detach().cpu().numpy()
    end_idxs = torch.argmax(output_step['start_logits'], dim=1).detach().cpu().numpy()
    
    # Ground_truth의 인덱스 추출!
    gt_start_idxs = inputs['start_positions'].detach().cpu().numpy()
    gt_end_idxs = inputs['end_positions'].detach().cpu().numpy()

    # 출력할 샘플의 개수 만큼 반복문 실행
    for idx in range(amount):  
        # tokenizer를 이용해 decoding 하자!
        question_context = self.tokenizer.decode(inputs['input_ids'][idx], skip_special_tokens=True)
        prediction = self.tokenizer.decode(inputs['input_ids'][idx][start_idxs[idx]:end_idxs[idx]+1], skip_special_tokens=True)
        ground_truth = self.tokenizer.decode(inputs['input_ids'][idx][gt_start_idxs[idx]:gt_end_idxs[idx]+1], skip_special_tokens=True)
        
        # decoding한 결과를 출력
        logging_console(question_context, prediction, ground_truth)

logging_console은 간단한 함수이다. 저기에 써놓으면 뭔가 지저분해 보여서 따로 함수로 작성해 주었다.

def logging_console(question_context, predictions, ground_truth):
    print('Question and Context: ')
    print(question_context)
    print('Prediction: ')
    print(predictions)
    print('Answer: ')
    print(ground_truth)

3. Custom trainer 만들기

Trainer 클래스를 상속받는 MYQuestionAnsweringTrainer를 만들어 주자

class MYQuestionAnsweringTrainer(Trainer):

이제 이안에서 1번에서 알아본 것과 같이 오버라이딩 해주자!

class MYQuestionAnsweringTrainer(Trainer):
    def __init__(self, *args, model_tokenizer=None, **kwargs):
        super().__init__(*args, **kwargs)
        # 여기선 특별하게 deconding을 위해 토크나이저를 받아야한다.
        self.tokenizer = model_tokenizer

    def train():
        ...

    def training_step():
        ...

    def compute_loss():
        ...

이제 학습 중 모델이 제대로 학습을 하고 있는지 아닌지 확인 할 수 있다!