top of page

LITIGATION SUPPORT TIP OF THE NIGHT

Featured on the ACEDS blog.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer. All content provided on this blog is for informational purposes only. The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of this information. This policy is subject to change at any time. The owner is not an attorney, and nothing posted on this site should be construed as legal advice. Litigation Support Tip of the Night does not provide confirmation that any e-discovery technique or conduct is compliant with legal, regulatory, contractual or ethical requirements.

See my post on Running Regex Searches With a Grep Utility on the ILTA litigation support blog.

New tips for paralegals and litigation support profesionals are posted to this site each week. Click on the blog headings for better detail.

See How-To Videos on my YouTube channel.

Nov 6, 2020

Tokenization

Here's a further explanation of tokenization, the process of identifying character sequences in unstructured text. Identifying tokens on the basis of whitespace and/or all non-alphanumeric characters will not always work well. For example, in this sentence:

In New York, Sean O'Shea can't get enough sleep.

There are two words for which the tokenization could vary

shea

oshea

o'shea

o' shea

o shea

can't

cant

can t

. . . and we would not want to use whitespace to separate, 'New York', which should be a single token.

Some words, such as 'bona fides', may or may not use a space. It may be clear that a hyphenated word like 'e-discovery' should be one token, a phrase like 'poorly-thought-out strategy' should consist of four tokens, but it's unclear whether or not a company name like 'Mercedes-Benz' should be one token or two.

Lexeme is the term used to identify a sequence of characters from source data that matches a token. Tokenization often works by using regular expressions to find lexemes in a stream of text, which are then categorized as tokens.

1 Comment

AVXJ KAZD

an hour ago

代发外链提权重点击找我;

google留痕 google留痕;

Fortune Tiger Fortune Tiger;

Fortune Tiger Fortune Tiger;

Fortune Tiger Slots Fortune…

站群/ 站群;

万事达U卡办理万事达U卡办理;

VISA银联U卡办理 VISA银联U卡办理;

U卡办理 U卡办理;

万事达U卡办理万事达U卡办理;

VISA银联U卡办理 VISA银联U卡办理;

U卡办理 U卡办理;

온라인 슬롯 온라인 슬롯;

온라인카지노 온라인카지노;

바카라사이트 바카라사이트;

EPS Machine EPS Machine;

EPS Machine EPS Machine;

EPS Machine EPS Machine;

EPS Machine EPS Machine;

Like

bottom of page