Generative AI models process text otherwise than humans. Understanding their “token-based” internal environments can explain a few of their strange behaviors—and stubborn limitations.
Most models, from small on-device models like Gemma to OpenAI's industry-leading GPT-4o, are built on an architecture called a transformer. Because of the best way transformers make associations between text and other data types, they’ll't ingest or output raw text – at the least not with no huge amount of computation.
For pragmatic and technical reasons, today's Transformer models work with text that has been broken down into smaller, bite-sized pieces called tokens – a process called tokenization.
Tokens might be words, like “implausible”. Or they might be syllables, like “fan”, “tas” and “tic”. Depending on the tokenizer – the model that performs the tokenization – they’ll even be individual letters in words (e.g. “f”, “a”, “n”, “t”, “a”, “s”, “t”, “i”, “c”).
This method allows transformers to soak up more information (within the semantic sense) before reaching an upper limit called the context window. However, tokenization can even introduce bias.
Some tokens have an unusual spacing, which derail a transformer. For example, a tokenizer might encode “Once upon a time” as “Once,” “Up,” “Once,” “Time,” while encoding “Once” (with a trailing space) as “Once,” “Up,” “Once,” “Once.” Depending on how a model is prompted—with “Once” or “Once”—the outcomes might be completely different since the model doesn’t understand (as a human does) that the meaning is similar.
Tokenizers also handle case otherwise. “Hello” just isn’t necessarily the identical as “HELLO” for a model; “Hello” is frequently one token (depending on the tokenizer), while “HELLO” might be up to a few tokens (“HE”, “El” and “O”). This is why many transformers fail to Capital letter test.
“It's pretty difficult to get across the query of what exactly a 'word' needs to be for a language model, and even when we got human experts to agree on an ideal token vocabulary, models would probably still find it useful to 'chunk' things even further,” Sheridan Feucht, a PhD student studying the interpretability of enormous language models at Northeastern University, told TechCrunch. “I might guess that there isn’t a such thing as an ideal tokenizer due to this sort of fuzziness.”
This “blurriness” creates much more problems in languages aside from English.
Many tokenization methods assume that an area in a sentence denotes a brand new word. That's because they were designed for the English language. But not all languages use spaces to separate words. Chinese and Japanese don't – nor do Korean, Thai, or Khmer.
An Oxford of 2023 study found that on account of differences in tokenization of non-English languages, it will possibly take twice as long for a transformer to finish a task in a language aside from English. The same study – and one other – found that users of less “token-efficient” languages are more likely to experience worse model performance and yet pay more for usage, as many AI providers charge per token.
Tokenizers in logographic writing systems—systems through which printed symbols represent words with none relation to pronunciation, as in Chinese—often treat each character as a separate token, leading to high token counts. Similarly, tokenizers processing agglutinative languages—languages through which words are made up of small meaningful word elements called morphemes, as in Turkish—are likely to convert each morpheme right into a token, increasing the entire variety of tokens. (The corresponding word for “hello” in Thai, สวัสดี, consists of six tokens.)
In 2023, Google DeepMind AI researcher Yennie Jun carried out an evaluation comparing the tokenization of various languages and its downstream effects. Using a dataset of parallel texts translated into 52 languages, Jun showed that some languages require as much as 10 times more tokens to capture the identical meaning in English.
Beyond linguistic inequalities, tokenization could explain why today’s models are poor at mathematics.
Rarely are digits tokenized uniformly. Not really knowing what numbers are, tokenizers may treat “380” as one token, but represent “381” as a pair (“38” and “1”) – effectively destroy relationships between digits and leads to equations and formulas. The result’s transformer confusion; a current Paper showed that models have difficulty understanding repetitive numerical patterns and contexts, especially temporal data. (See: GPT-4 thinks 7,735 is larger than 7,926).
This can be the explanation why models will not be good at solving anagram problems or Reverse words.
Tokenization clearly presents challenges for generative AI. Can these be solved?
Perhaps.
Feucht refers to “byte-level” state space models resembling MambaBytewhich may handle much more data than Transformer without sacrificing performance since it completely avoids tokenization. MambaByte, which works directly with raw bytes representing text and other data, can sustain with some Transformer models in language evaluation tasks, while higher handling “noise” resembling words with transposed characters, spaces, and uppercase letters.
However, models like MambaByte are within the early research phase.
“It's probably best if models take a look at the characters directly without imposing tokenization, but without delay that's just not computationally feasible for transformers,” Feucht said. “In transformer models specifically, the computation scales quadratically with sequence length, and that's why we would like to make use of really short text representations.”
Unless there may be a breakthrough in tokenization, recent model architectures appear to be the important thing.