top of page

Searching for Non-ASCII Text in Text Files

Here's a supplement to the Tip of the Night for January 22, 2022 which discussed how to format a text file correctly so that it will load without errors in Lexis TextMap. When preparing a text file to load in a deposition transcript review application, be sure to remove any text which is not ASCII text, the general encoding standard widely used for transcripts.


Commonly used characters such as:

  1. A dash – [which can be replaced with a hyphen - ]

  2. Curley quotes “ [which can be replaced with straight quotes "]

  3. Smart apostrophes ‘ [which can be replaced with a straight apostrophe ' ]

. . . are not ASCII text. When a platform like TextMap loads a file with these characters, they will not be converted correctly and result in garbled text such as;

Let’s

. . . instead of:

Let's


So if you see an em dash in a text editor like this:



. . . in TextMap it will display like this:



You can find non-ASCII characters in NotePad ++ by going to Search . . . Find characters in range



A dialog box will open that will give you the option to search for non-ASCII characters.



. . . this will allow you to jump to each non-ASCII character in the text file one by one.


If you're not using NotePad++ you can run this regular expression search to find any non-ASCII characters.

[^\x00-\x7F]+






Commenti


bottom of page