u200b: What it is, and why it messes up your code or data

I was pushing some old data through an API at work when I received a weird error message. The API coughed up a hairball. It responded that I had u200b at position 154, and if I needed that character, I’d have to encode it. But I looked at position 154 and it was a number. Nothing weird. So what’s u200b, why does a problematic character exist, and how do you clean it up?

U200b is a Unicode non-printing space. It’s meant to assist typographers in doing page layouts, and it’s extremely useful in certain languages that don’t use the Roman alphabet. But those of us who use the Roman alphabet may go a lifetime without needing it.

What u200b does

Combining characters like the above makes text look nicer, but it also can change meanings, especially in certain languages. To keep the letter combination “fl” from looking like a capital “A,” you insert the character u200b between them. But this can also cause problems for computers.

u200b is an invisible space. It hadn’t been invented yet when I was in journalism school, but it would have neatly solved the toughest problem they ever gave me in my magazine design class. They handed me some text that just looked awful on a printed page, and we had to make it flow neatly. Strategic use of a non-printing space would have allowed me to give the page layout software some hints where it could break up words and numbers so my right margins didn’t look like a worn-out saw blade.

When u200b occurs at the end of a line, the computer treats it like a space and wraps it. When u200b occurs in the middle of the line, the computer just skips it.

Our software is better now than it was in the early 90s, so designers rarely have to use manual hints now.

But u200b still solves a problem with ligatures. Ligatures are another obscure topic from my j-school days that you probably see all the time, but may have never noticed or thought about.

What ligatures are and what they have to do with non-printing spaces

When certain characters occur next to each other, you can blend their elements into a single compound character. One of the most common examples is when the letters “f” and “i” appear next to each other. The dot in the i blends into the tip of the curve of the f, and the cross stroke from the f blends into the i’s downstroke. Scribes in ancient times found that ligatures saved time, and after the invention of movable type, typesetters continued using them. Ligatures are a hallmark of traditional, high-end typography. One of the main reasons a printed page from Adobe InDesign looks better than the same page from Microsoft Publisher is Adobe’s use of ligatures.

The problem with ligatures is that in some languages, they change the meaning. In the Roman alphabet, the ligature you get from combining “f” and “l” illustrates the problem. In most fonts, it looks an awful lot like the capital letter “A.” Since “flame” is a word and “Aame” isn’t, we can read that without getting confused, at least in English. But I know of a situation in another language where use of ligatures changes the word for “Wednesday” into “environment.” A comedy writer could have all sorts of fun with conditions like that, but in normal situations where you want meanings to be precise, u200b solves the problem. By inserting u200b to prevent the ligature, you can let Wednesday be Wednesday, and environment be environment, without forcing the reader to get it from context.

Every language has situations like that. Shakespeare makes more sense once you know he pronounced the words “oar,” “hour,” and “whore” the same way. Yes, I’m telling you Shakespeare’s plays weren’t rated G. Some languages’ issues are more prominent when spoken, and some are more prominent when written.

But while u200b can solve problems for humans, it can make problems for computers.

The problem with u200b

The problem with u200b is it’s a character that computers see and humans don’t. If I type the word “flower” with and without u200b, it still looks like the same word to you, but to the computer it isn’t. This confuses find and replace, for example. Tricksters can use this to prevent matches where things should match. And that’s what happened to me with that data I was pushing around. The conditions that existed three years ago when the data was created ignored u200b. Under the conditions I’m using today, that u200b that slipped into the data caused problems.

What u200b was doing in that data, I don’t know or care. I just needed to be able to push that data before close of business. So here’s how I got there.

Filtering non-printing spaces

I cleaned up my data using a tool to filter out nonprintable characters like u200b. A better method would be for my program to filter the data, just in case. As a security guy, I should know people are going to do weird things to be cute, subversive, or just because they can. I use and recommend Python, so here’s how to filter u200b in Python.

Filtering u200b in Python

Nothing works like an example. So here’s a snippet of code that shows an example of filtering the data in Python before passing it to a routine that makes the API call.

# loop through each element of my data
for i in data:
# delete all instances of u200b
call_api(key, i)

The key is line 4, which replaces all instances of u200b in the variable in question (my index variable in the loop, in this case), with nothing.

The same concept works in other scripting or programming languages, of course.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this:
WordPress Appliance - Powered by TurnKey Linux