Unicode compliance

May 27, 2020

When considering which electronic discovery software to use, confirm that its processing is unicode compliant - that it can search and display foreign language documents which use Asian characters or the Cyrillic alphabet. Processing which is not unicode compliant may generate text with boxes or random symbols - something most of us have had the misfortune of encountering before.

UNICODE

ASCII

The ASCII character encoding only supports the Latin alphabet, and is limited to 128 characters. The UTF-8 unicode character encoding can support more than a million different characters and also covers Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Chinese, Japanese, Korean, and other major languages. The first 128 characters of UTF-8 are the same as the 128 characters of ASCII. Most text on internet web pages uses the UTF-8 encoding. It's important to confirm with an e-discovery vendor that the tools they are using for data collection and processing support unicode encoding.

Not all data is in unicode. Email systems may use their own proprietary encoding systems. In Japan the Shift-JIS format is widely used for email text. MS Exchange uses unicode for PST archives. However individual email messages saved with a .msg extension don't use unicode for the email header fields. Tools which collect local .msg files may garble the text of email headers unless an adjustment is made for the Outlook encoding.

Even if processing software is unicode compliant, it will still be necessary to use separate language detection software to determine which languages are present. Identifying the encoding can determine the alphabet, but not necessarily the language.

LITIGATION SUPPORT TIP OF THE NIGHT

New tips for paralegals and litigation support profesionals are posted to this site each week. Click on the blog headings for better detail.

See How-To Videos on my YouTube channel.

Unicode compliance