top of page

Tokenization

Here's a further explanation of tokenization, the process of identifying character sequences in unstructured text. Identifying tokens on the basis of whitespace and/or all non-alphanumeric characters will not always work well. For example, in this sentence:


In New York, Sean O'Shea can't get enough sleep.


There are two words for which the tokenization could vary



shea

oshea

o'shea

o' shea

o shea


can't

cant

can t


. . . and we would not want to use whitespace to separate, 'New York', which should be a single token.


Some words, such as 'bona fides', may or may not use a space. It may be clear that a hyphenated word like 'e-discovery' should be one token, a phrase like 'poorly-thought-out strategy' should consist of four tokens, but it's unclear whether or not a company name like 'Mercedes-Benz' should be one token or two.


Lexeme is the term used to identify a sequence of characters from source data that matches a token. Tokenization often works by using regular expressions to find lexemes in a stream of text, which are then categorized as tokens.




1 Comment


AVXJ KAZD
AVXJ KAZD
an hour ago

代发外链 提权重点击找我;

google留痕 google留痕;

Fortune Tiger Fortune Tiger;

Fortune Tiger Fortune Tiger;

Fortune Tiger Slots Fortune…

站群/ 站群;

万事达U卡办理 万事达U卡办理;

VISA银联U卡办理 VISA银联U卡办理;

U卡办理 U卡办理;

万事达U卡办理 万事达U卡办理;

VISA银联U卡办理 VISA银联U卡办理;

U卡办理 U卡办理;

온라인 슬롯 온라인 슬롯;

온라인카지노 온라인카지노;

바카라사이트 바카라사이트;

EPS Machine EPS Machine;

EPS Machine EPS Machine;

EPS Machine EPS Machine;

EPS Machine EPS Machine;

Like
bottom of page