Searching for Non-ASCII Text in Text Files
Here's a supplement to the Tip of the Night for January 22, 2022 which discussed how to format a text file correctly so that it will load without errors in Lexis TextMap. When preparing a text file to load in a deposition transcript review application, be sure to remove any text which is not ASCII text, the general encoding standard widely used for transcripts.
Commonly used characters such as:
A dash – [which can be replaced with a hyphen - ]
Curley quotes “ [which can be replaced with straight quotes "]
Smart apostrophes ‘ [which can be replaced with a straight apostrophe ' ]
. . . are not ASCII text. When a platform like TextMap loads a file with these characters, they will not be converted correctly and result in garbled text such as;
Let’s
. . . instead of:
Let's
So if you see an em dash in a text editor like this:
. . . in TextMap it will display like this:
You can find non-ASCII characters in NotePad ++ by going to Search . . . Find characters in range
A dialog box will open that will give you the option to search for non-ASCII characters.
. . . this will allow you to jump to each non-ASCII character in the text file one by one.
If you're not using NotePad++ you can run this regular expression search to find any non-ASCII characters.
[^\x00-\x7F]+
Commenti