Tokenizer vocabulary

#28

by DjTobalito - opened Jul 11, 2024

Jul 11, 2024

•

edited Jul 11, 2024

Hi,
Using the XLM Roberta for multilanguage classification with success. I am trying to understand a bit better the tokenizer.
Naively, I expected that common words of small size in the languages present in the dataset to be present in the tokenizer.vocab dictionary.
But it seems that for French for example, the word "oui" (yes in French) is not in the tokenizer.vocab dictionary.

Am I misunderstanding the tokenizer.vocab dictionary ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment