top of page

Tokenization

Here's a further explanation of tokenization, the process of identifying character sequences in unstructured text. Identifying tokens on the basis of whitespace and/or all non-alphanumeric characters will not always work well. For example, in this sentence:


In New York, Sean O'Shea can't get enough sleep.


There are two words for which the tokenization could vary



shea

oshea

o'shea

o' shea

o shea


can't

cant

can t


. . . and we would not want to use whitespace to separate, 'New York', which should be a single token.


Some words, such as 'bona fides', may or may not use a space. It may be clear that a hyphenated word like 'e-discovery' should be one token, a phrase like 'poorly-thought-out strategy' should consist of four tokens, but it's unclear whether or not a company name like 'Mercedes-Benz' should be one token or two.


Lexeme is the term used to identify a sequence of characters from source data that matches a token. Tokenization often works by using regular expressions to find lexemes in a stream of text, which are then categorized as tokens.




Comments


bottom of page