colab에서 AttributeError #17

dlttmd · 2023-11-30T07:52:08Z

안녕하세요! 파이썬을 공부중인 학생입니다!
KoBert 모델을 사용하려던 도중 토크나이저를 불러오는데 에러가 발생하여 질문을 남깁니다!

tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert')

코드 실행 시

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BertTokenizer'.
The class this function is called from is 'KoBertTokenizer'.

AttributeError Traceback (most recent call last)
in <cell line: 1>()
----> 1 tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert')

5 frames
/content/drive/MyDrive/tokenization_kobert.py in get_vocab(self)
123
124 def get_vocab(self):
--> 125 return dict(self.token2idx, **self.added_tokens_encoder)
126
127 def getstate(self):

AttributeError: 'KoBertTokenizer' object has no attribute 'token2idx'

라는 에러가 발생하는데 혹시 해결방법이 있나요?!

55soup · 2024-01-15T09:10:42Z

혹시 해결 하셨을까요???

55soup · 2024-03-27T01:16:29Z

해결방법

모델 가져오기

from transformers import BertModel, DistilBertModel

distilbert_model = DistilBertModel.from_pretrained('monologg/distilkobert')

tokenizer 가져오기

tokenizer를 가져오는 과정에서 오류가 나기에, 기존 kobert tokenizer을 가져왔습니다.

from kobert_tokenizer import KoBERTTokenizer

tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1')

DistilBERTClassifier

def forward에서 기존 kobert에 있던 segement_ids를 제거합니다.

class DistilBERTClassifier(nn.Module):
    def __init__(self,
                 distilbert,
                 hidden_size=768,
                 num_classes=6, ## 조정필요 ##
                 dr_rate=None,
                 params=None):
        super(DistilBERTClassifier, self).__init__()
        self.distilbert = distilbert
        self.dr_rate = dr_rate

        self.classifier = nn.Linear(hidden_size, num_classes)
        if dr_rate:
            self.dropout = nn.Dropout(p=dr_rate)

    def gen_attention_mask(self, token_ids, valid_length):
        attention_mask = torch.zeros_like(token_ids)
        for i, v in enumerate(valid_length):
            attention_mask[i][:v] = 1
        return attention_mask.float()

    def forward(self, token_ids, valid_length):
        attention_mask = self.gen_attention_mask(token_ids, valid_length)

        pooler = self.distilbert(input_ids=token_ids, attention_mask=attention_mask.float().to(token_ids.device)).last_hidden_state.mean(dim=1)
        # 수정된 부분: pooler를 직접 계산하여 하나의 값만 반환

        if self.dr_rate:
            out = self.dropout(pooler)
        return self.classifier(out)

모델 불러오기

model = DistilBERTClassifier(distilbert_model, dr_rate=0.5).to(device)

학습

out = model(token_ids, valid_length)
학습에서 기존에 model에 넣었던 segement_ids를 제거합니다.

for e in range(start_ephochs, num_epochs):
    train_acc = 0.0
    test_acc = 0.0
    model.train()
    for batch_id, (token_ids, valid_length, segment_ids, label) in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
        optimizer.zero_grad()
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        # print(f"token_ids: {token_ids}\n valid_length: {valid_length}\n segment_ids: {segment_ids}\n label: {label}")
        out = model(token_ids, valid_length)
        loss = loss_fn(out, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        scheduler.step()  # Update learning rate schedule
        train_acc += calc_accuracy(out, label)
        if batch_id % log_interval == 0:
            print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
    torch.save(model.state_dict(), f'/content/gdrive/MyDrive/ColabNotebooks/checkpoint/checkpoint_epoch_{e+1}.ckpt')
    # torch.save(model.state_dict(), f'/content/gdrive/MyDrive/ColabNotebooks/checkpoint/checkpoint_epoch_5.1.ckpt')
    print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1)))
    for batch_id, (token_ids, valid_length, segment_ids, label) in tqdm(enumerate(test_dataloader), total=len(test_dataloader)):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length)
        test_acc += calc_accuracy(out, label)
    print("epoch {} test acc {}".format(e+1, test_acc / (batch_id+1)))

dohyun1411 · 2024-10-23T14:01:48Z

Downgrading transformers solved the problem: pip install transformers==4.20.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colab에서 AttributeError #17

colab에서 AttributeError #17

dlttmd commented Nov 30, 2023 •

edited

Loading

55soup commented Jan 15, 2024

55soup commented Mar 27, 2024

dohyun1411 commented Oct 23, 2024

colab에서 AttributeError #17

colab에서 AttributeError #17

Comments

dlttmd commented Nov 30, 2023 • edited Loading

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'BertTokenizer'. The class this function is called from is 'KoBertTokenizer'.

55soup commented Jan 15, 2024

55soup commented Mar 27, 2024

해결방법

모델 가져오기

tokenizer 가져오기

DistilBERTClassifier

모델 불러오기

학습

dohyun1411 commented Oct 23, 2024

dlttmd commented Nov 30, 2023 •

edited

Loading

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BertTokenizer'.
The class this function is called from is 'KoBertTokenizer'.