Tokenization
top of page

Tokenization

Here's a further explanation of tokenization, the process of identifying character sequences in unstructured text. Identifying tokens on the basis of whitespace and/or all non-alphanumeric characters will not always work well. For example, in this sentence:


In New York, Sean O'Shea can't get enough sleep.


There are two words for which the tokenization could vary



shea

oshea

o'shea

o' shea

o shea


can't

cant

can t


. . . and we would not want to use whitespace to separate, 'New York', which should be a single token.


Some words, such as 'bona fides', may or may not use a space. It may be clear that a hyphenated word like 'e-discovery' should be one token, a phrase like 'poorly-thought-out strategy' should consist of four tokens, but it's unclear whether or not a company name like 'Mercedes-Benz' should be one token or two.


Lexeme is the term used to identify a sequence of characters from source data that matches a token. Tokenization often works by using regular expressions to find lexemes in a stream of text, which are then categorized as tokens.




Sean O'Shea has more than 20 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

If you have a question or comment about this blog, please make a submission using the form to the right. 

Your details were sent successfully!

© 2015 by Sean O'Shea . Proudly created with Wix.com

bottom of page