u200b: What it is, and why it messes up your code or data

Last Updated on September 6, 2023 by Dave Farquhar

I was pushing some old data through an API at work when I received a weird error message. The API coughed up a hairball. It responded that I had u200b at position 154, and if I needed that character, I’d have to encode it. But I looked at position 154 and it was a number. Nothing weird. Some APIs render this problem character as u+200b. So what’s u200b, why does a problematic invisible character exist, and how do you clean it up?

U200b is a Unicode non-printing space character. It’s meant to assist typographers in creating page layouts, and it’s extremely useful in certain languages that don’t use the Roman alphabet. But many of us who use the Roman alphabet may go a lifetime without needing it.

Table of contents

  1. u200b or u+200b character: the invisible space
  2. What ligatures are and what they have to do with non-printing spaces
  3. The problem with u200b or u+200b
  4. Filtering non-printing spaces
  5. Filtering u200b in Python

u200b or u+200b character: the invisible space

u200b, alias u+200b
Combining characters like the above makes text look nicer, but it also can change meanings, especially in certain languages. To keep the letter combination “fl” from looking like a capital “A,” you insert the character u200b or u+200b between them. Or tricksters can inject it to cause problems for computers.

u200b, sometimes rendered as u+200b, is an invisible space. It hadn’t been invented yet when I was in journalism school, but it would have neatly solved the toughest problem they ever gave me in my magazine design class. They handed me some text that just looked awful on a printed page, and we had to make it flow neatly. Strategic use of a non-printing space would have allowed me to give the page layout software some hints where it could break up words and numbers so my right margins didn’t look like a worn-out saw blade.

When u200b occurs at the end of a line, the computer treats it like a space and wraps it. When u200b occurs in the middle of the line, the computer just skips it.

Our software is better now than it was in the early 90s, so designers rarely have to use manual hints now.

But u200b still solves a problem with ligatures. Ligatures are another obscure topic from my j-school days that you probably see all the time, but may have never noticed or thought about.

Back to the table of contents

What ligatures are and what they have to do with non-printing spaces

When certain characters occur next to each other, you can blend their elements into a single compound character. One of the most common examples is when the letters “f” and “i” appear next to each other. The dot in the i blends into the tip of the curve of the f, and the cross stroke from the f blends into the i’s downstroke. Scribes in ancient times found that ligatures saved time, and after the invention of movable type, typesetters continued using them. Ligatures are a hallmark of traditional, high-end typography. One of the main reasons a printed page from Adobe InDesign looks better than the same page from Microsoft Publisher is Adobe’s use of ligatures.

The problem with ligatures is that in some languages, they change the meaning. In the Roman alphabet, the ligature you get from combining “f” and “l” illustrates the problem. In most fonts, it looks an awful lot like the capital letter “A.” Since “flame” is a word and “Aame” isn’t, we can read that without getting confused, at least in English. But I know of a situation in another language where use of ligatures changes the word for “Wednesday” into “environment.” A comedy writer could have all sorts of fun with conditions like that, but in normal situations where you want meanings to be precise, u200b solves the problem. By inserting u200b to prevent the ligature, you can let Wednesday be Wednesday, and environment be environment, without forcing the reader to get it from context.

Every language has situations like that. Shakespeare makes more sense once you know he pronounced the words “oar,” “hour,” and “whore” the same way. Yes, I’m telling you Shakespeare’s plays weren’t rated G. Some languages’ issues are more prominent when spoken, and some are more prominent when written.

But while u200b can solve problems for humans, it can make problems for computers.

Back to the table of contents

The problem with u200b or u+200b

The problem with u+200b is it’s a character that computers see and humans don’t. If I type the word “flower” with and without u200b, it still looks like the same word to you, but to the computer it isn’t. This confuses find and replace, for example. Tricksters can use this to prevent matches where things should match. And that’s what happened to me with that data I was pushing around. The conditions that existed three years ago when the data was created ignored u200b. Under the conditions I’m using today, that u200b character that slipped into the data caused problems.

What a u200b character was doing in that data, I don’t know or care. I just needed to be able to push that data before close of business. So here’s how I got there.

Back to the table of contents

Filtering non-printing spaces like u200b or u+200b

I cleaned up my data using a tool to filter out nonprintable characters like u200b. A better method would be for my program to filter the data, just in case. As a security guy, I should know people are going to do weird things to be cute, subversive, or just because they can.

If the data resides on your system, here’s how you can filter problem characters in Notepad++.

I use and recommend Python, so I’ll also present a way to filter u200b characters in Python, which can be helpful when dealing with an API.

Back to the table of contents

Filtering u200b in Python

Nothing works like an example. So here’s a snippet of code that shows an example of filtering the data in Python before passing it to a routine that makes the API call.

# loop through each element of my data
for i in data:
  # delete all instances of u200b/u+200b
  i.replace('\u200b','')
  call_api(key, i)

The key is line 4, which replaces all instances of u200b in the variable in question (my index variable in the loop, in this case), with nothing.

The same concept works in other scripting or programming languages, of course.

Back to the table of contents

If you found this post informative or helpful, please share it!