There’s been a number of times in my career where I’ve needed to convert files to plain text. That means plain. No smart quotes, Unicode, extended ASCII characters, or other funny business. Here’s how to use Notepad++ to quickly remove all of these types of characters from a text file. Here’s what to do when your plaintext isn’t plain enough.
Filter Unicode and extended ASCII: Notepad++ to the rescue
First, open the file in Notepad++, the open source text editor for Windows. You can do this job with a simple search and replace. You can pull up search and replace from the menu, or just hit control h. Control h is one of my favorite keyboard shortcuts. Remember it as hunt.
Notepad++ has several search options. Be sure to choose regular expression search mode. Then copy and paste this into the Find what field: [^\x00-\x7F]+
Under Replace with, enter something suitable. If you want to go through and manually clean up afterward, and your purpose is just to make sure that you removed all of the problematic characters that are preventing you from using the file, use a character sequence that isn’t likely to show up elsewhere in the text, like two ampersands. If you just want to clean up the file and be done with it, I suggest you use a space.
Click replace all, and Notepad++ will remove all of those problematic characters, replacing them with what you specified.
If you have a small quantity of files to clean up, Notepad++ is a convenient way to handle the problem and get you out of the bind.
Where problematic characters come from
Some software likes to insert problematic characters like that. Smart quotes are a common culprit, but the Unicode invisible space, u200b, is another notorious example. And I recently found non-standard-width spaces in, of all things, a vulnerability scan.
I was processing a CSV file at work in Python and getting errors like UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\u0147’ in position 1297: character maps to <undefined>. Extended ASCII, ANSI, and Unicode characters in plaintext files sometimes cause problems like that.
I used Notepad++, did a find/replace to replace [^\x00-\x7F]+ with a space character to clean up the problematic characters so the files would parse.