Understanding and avoiding visually ambiguous characters in IDs

gajus | 279 points

I had this exact situation at work when they shipped millions of devices with serial numbers, and didn't leave out any letter or number. Customers had so much trouble reading them accurately, I had to make a regex script that generated every possible typ0 permutation of what the customer said, and then it would list only matches from the factory database. From there, folks would try to correlate other info like dates to figure out what their real serial number probably was. It was a nightmare. Ironically several of the digits never changed, and some were just 0 1 or 2 to represent which factory made it, so there was no need for the entire character set in the first place. They seem to have been convinced we'd produce 8 quadrillion devices.

geor9e | 11 days ago

Encoding should also depend on the user. Base 32 (crockford & rfc 4648) has a nice unambiguous alphabet for compact representation and explanation of why. However if your users are speaking aloud you might want a word list representation, “TIDE ITCH SLOW REIN RULE MOT”, like s/key rfc 1751. DO NOT invent your own word lists; there are an infinite number of dragons lying in wait for idioms, homophones, dialects, etc. Dont be like me and unintentionally create a major incident like “wet clam butterfly.”

donavanm | 11 days ago

This brings up memories.

One day while sick, I distracted myself from being sick by writing up a silly module to do arithmetic in arbitrary bases. And, because it was easy I stuck it on CPAN. https://metacpan.org/pod/Math::Fleximal is the module.

Of all of the silly things I'd done, I would have sworn that this is the one that should never generate a support request. But it did! Why? Well I'd included a demonstration of how to turn hexadecimal into an alphanumeric code. And someone had the bright idea of using the same thing to turn long numbers into readable codes!

My module worked, but I was still a bit flabbergasted that THIS wound up in production somewhere!!

btilly | 11 days ago

The author makes a point of avoiding letters that are hard to distinguish even when spelled out in handwriting, but the example table includes the number 7. I can not count the number of times I have found it hard to distinguish between someone's 7 and 1.

It helps if you draw a horizontal bar on the 7 but many don't, so you can never really be sure if a 7 is in fact a 1 with the serif or vice versa.

vesinisa | 11 days ago

If you use both upper and lower case, you are likely to eventually be surprised by some third party system or protocol that is case insensitive. I even found a commercial system which allowed users to choose IDs with case sensitivity (iD and id being distinct) but if you query it for one which does not exist they do case insensitive matching and return the wrong data.

When I reported this bug they said it was for convenience!

jzwinck | 11 days ago

I thought this was good neat UX: on the Nintendo Switch I was entering a serial number for some DLC, and the on-screen keyboard had all the ambiguous character keys disabled, which means that the serial numbers are generated without any ambiguous characters.

I'm not sure if this UX was built into the OS, or just part of the game I was playing (Mario + Rabbids Sparks of Hope).

waltbosz | 10 days ago

KeepassXC (open source password manager application) uses colour to make passwords more readable. They use one color for each "class" of character: uppercase, lowercase, numbers, symbols, ...

This is a extremely simple idea, but especially with random passwords this helps a lot even if the font is already hyperlegible.

atoav | 11 days ago

So cool to read an article discussing a problem I run into on a regular basis.

Whenever I'm creating a 2FA backup on a piece of paper, anxiety hits me every time I cross over certain characters, o/0, v/u, 5/S, etc. I've come to add some fanciness to how I write these characters for this exact reason.

On "Phonetic similarity", reminds me of how I chose my wifi password. I wanted a common word with multiple consonants that a 3rd grader could spell, so I could share the password with a single phrase and have it be unambiguous. Ended up choosing "vacation".

matthewtse | 10 days ago

I love conversations like this. These are arguably not the most cutting edge or exciting topics but hold a lot of significance and power to make life easier for humans (and machines too).

Some of these are areas of best practices that, when done really well -- people may not even notice it. That's an unfortunate fact of life that comes up often -- where the attention to detail and sincerity that people bring to the table often gets lumped under "obviously it should be that way, nothing special to see or applaud here".

albert_e | 11 days ago

As long as we are pointing out mistakes in the article:

9qg6G8B2Z5SIl170O (ariel)

The name of the font is Arial, not Ariel. (No mermaids here, move along)

loloquwowndueo | 10 days ago

Other prior art is the use of a modified base 58 encoding in Bitcoin addresses.

https://en.bitcoin.it/wiki/Base58Check_encoding

nemoniac | 10 days ago

> not only to avoid visually ambiguous characters, but also to avoid spelling words in common languages.

Or you should do the opposite - use real dates/words in ID and your visual confusion almost disappears (though there is a bunch of ambiguity here as well in similar pronunciation, so also not perfect). Humans aren't robots, so shouldn't be forced to read meaningless list of random letters

(example of geospatial system of coordinates based on that is what3words)

eviks | 11 days ago

This post has some overlap with work I did a while back on a "coupon code" system that is optimised for users taking a code printed on paper and entering it into a web form. A number of measures were employed to avoid/correct transcription errors.

Example, docs and links here: https://www.mclean.net.nz/cpan/couponcode/

grantmnz | 11 days ago

I wish my parents had access to this when they chose to call me Iain Dooley.

The world has almost unanimously decided my name is now Lain.

dools | 11 days ago

I'm an American living in Germany. When I first arrived, the way Germans write the digit 1 surprised me. They write it with the upper hook thing very long, almost like a capital lambda (Λ), which sometimes makes 1 and A visually ambiguous. This isn't really a problem, just something funny about moving to a new country.

NickHoff | 10 days ago

This seems slightly flawed in that it completely removes all members of a similar set rather than normalizing to a single element per similar set.

Thus after normalization, '1lI' would become '111'. This allows you to add seven characters back to the author's code generation alphabet without re-introducing any ambiguity.

shkkmo | 11 days ago

Years ago I worked support at an ISP who had usernames which was a 12 digit number. Most regular users and 1st level support do no know the NATO phonetic alphabet. An easy trick is trick is then to read back the number for confirmation but use another grouping of digits. Most users read 1 digit at a time so I would read back 2. One-Two becomes twelve. If they used 2 digits I would for ease use 3 rather then 1. This is a very easy way to do a fake "checksumming" regular people.

Tangent: All number started with 12 which in effect made them 10 digits. They worked together with a banking system and the bank folks thought 10 digits was not secure enough so they complied and added 12 in front of everything.

clan | 10 days ago

I have realized that there is a big design space here, as I recently did a write-up of my take, Id30. 30 bits of information encoded base 32 into six chars, eg bpv3uq, zvaec2 or rfmbyz, with some handling of ambiguous chars on decoding.

https://magnushoff.com/blog/id30/

maggit | 10 days ago

Related reading, from the font designer's side: “Oh, oh, zero!” by Charles Bigelow (of Bigelow and Holmes, makers of typefaces like Lucida and Wingdings), published in TUGboat the journal of the TeX users group: https://tug.org/TUGboat/tb34-2/tb107bigelow-zero.pdf

(There's also a “footnote” by Donald Knuth: https://www.tug.org/TUGboat/tb35-3/tb111knut-zero.pdf, and follow-up by Bigelow: https://tug.org/TUGboat/tb36-3/tb114bigelow.pdf)

svat | 10 days ago

An alternative would be to print IDs using https://en.wikipedia.org/wiki/FE-Schrift, which was specifically designed to make normally similar characters to look different.

jeroen | 10 days ago

> In some cases, you might also want to avoid characters that sound similar when spoken. For example, b and p can sound similar when spoken out loud. This can be especially important in situations where IDs are communicated verbally.

In many cases these kinds of IDs are just an encoding of a ground-truth that is a big integer or a sequence of bytes, and that mean we don't have to use ASCII-character granularity, we can also use words.

True, that creates a certain cultural bias for wherever you get the words from, but it opens up new possibilities for error correction and detection, both by the computer and also by the humans transcribing things.

Terr_ | 11 days ago

On linux you can use Theodore Ts'o pwgen tool with the -B arg.

-B, --ambiguous Don't use characters that could be confused by the user when printed, such as 'l' and '1', or '0' or 'O'. This reduces the number of possible passwords significantly, and as such reduces the quality of the passwords. It may be useful for users who have bad vision, but in general use of this option is not recommended.

>pwgen -B 32 oos9upoVieghuew7aeb3iev3jiequeiw acohthahpie7ae4aeboshahWiengieth yahW3qua3atheeP9jo4aiY3zeepoosh3 Noh4ooth4ohzeec4zug3ephoo7meich7 oozae9Eireix4Chaiboz9dofie4Xunof Mohj3uupee9ahngahh9on9sujee9ehae weimah9aiXeis3owaexei4uh3ibeecai PaeV7eeChaezahruNgeequoh7zok7thi eeJieyah4exiephaiPootei4dokoojoh fohhah3Eec3bah7aeR9iedah7Ve3ea7o vahs4eich4pheisoug9aiR3ohChoh7Ch eth9KaeLahdie7ahy9ohCiebohphuse9 ieye3udumaengai9ies7kae4geeque9T iesoh9eosohthoongaeroo4ehiishohY mee4ohjei4ohmika3taijei3Yaixosei ohWoo4eapid7miebee9pooKai3oofeis Eechook9quohp7se7ees9thaefahb9an aht3quooV4eiph9ap7aiw4wee7oi7eij ishep3weeh7Eero9ohdohth9MietooJ4 Kai9aich9Jee9Angeihee9eehei9esie toonaix4xe3Moob3zaic3Eesahs9ahy3 gaey9doozee7sei9quuPae3vohph4Huo ouYaephahcog3peiw7iecoo7eetheeph eeNgiezae7oongi7uena7eenaezuT7co tai9vuace9eV7Paih7ieN3Ahghiegh3v VaeteeMoobeixai9ingeyahYuzaipaht eeng7vei7pho4Ahpoa4kahgheethahz7 phas4theiThu4uqu7iCh3Aepha3shae3 ieRep3kaideeHeekiNgequieng9raeYo eegahsh9aizooshee9too9oojiox4Lei ovohcaePahM9thaebajuChoo3pipheej oowaimeiWahf4Neighoo3Eeyah3uvi4v vi4choiThei3eisohw4iP9huehohs4oe ukuchiethaquax3hieChouMahpooy4ee aegheeyeemeNeevehud9ohng3dai4jai eth3iedah9Tee3wohneisoo4aicuToos iecap7EeJ7raixiuseesiNou9ooT9fie ied3ooveingu7fu7dahdaaYe9tai7ien eijee7iKighaingaiChei7giemu4chi3 Thie3faih3ahshooRunohwoaghoh4Aev

TwoNineFive | 8 days ago

Also avoid lowercase rn which can be mistaken for m.

And avoiding vowels can help avoid offensive words within a generated code:

FUKFUK9 - https://www.replacements.com/china-fukagawa-fuk9/c/27446

KUNT1 - https://id.made-in-china.com/co_gzberlin/product_Power-Steer...

base32 removes the I,O,U but other words with A,E need to be avoided too - no vowels helps avoid words in English.

robocat | 11 days ago

I'm a fan of z-base-32 for this.

https://philzimmermann.com/docs/human-oriented-base-32-encod...

Command line tool at https://github.com/tv42/zbase32

    $ echo hello, world | zbase32-encode
    pb1sa5dxfoo8q551pt1yw

    $ entropy 16 | zbase32-encode
    y64s31aq6cgjoko9fwbuasf4ce
yencabulator | 10 days ago

Doesn't help when you have to match the person's name and they have these characters in them. My name contains the letter "o" and I once had a lot of trouble getting something done at the bank. Multiple staff had to crowd around the computer to figure it out. Eventually somebody discovered that when I had opened my account, that o had been entered as a 0 for some reason and the font they were using, also for printing, showed them looking almost identical.

EnigmaFlare | 10 days ago

"visually unambiguous dictionary" to the author. It's well known that some people have a hard time distinguishing p/b/d/q.

bdjsiqoocwk | 10 days ago

The Latin/English alphabet is common but not universal. I believe this challenge is why TOTP codes use Arabic numerals. The user's keyboard can type these reasonably. Spoken is always a challenge. Even an English speaking audience will pronounce "0" as zero, oh, or zed.

8organicbits | 10 days ago

In handwriting there is a difference between European and American. In Europe we don't really have problem with 1 vs 7 or g vs 9. But our nines and ones do look like gs and sevens to Americans.

I heard an American making a joke that

"I have gg problems but European handwriting ain't 7 of them."

kuboble | 10 days ago

A few years ago, I created a system that generates a serial number from a prefix and a 32-bit unsigned integer and fixes up this kind of input error when passing the serial.

https://github.com/pallas/gubbins

ThePallas | 11 days ago

I came up with base24[1] for this. There are some letter that can be ambiguous but I kept them to make it case insensitive.

[1]: https://www.kuon.ch/post/2020-02-27-base24/

kuon | 11 days ago

See also Douglas Crockford's Base 32: https://www.crockford.com/base32.html

This takes the approach of allowing ambiguous characters by decoding them to the same value, and also considers the problem of accidental obscenities.

re | 11 days ago

UuidExtensions[1], a C# library, has a way of generating / encoding IDs that has several useful properties:

1. IDs can be generated anywhere (client-side, server-side, etc.) and are still unique 2. IDs are ordered by time 3. IDs don't use L and O because those can be confused for other characters

I've found it very handy in my travels.

[1] https://github.com/stevesimmons/uuid7-csharp?tab=readme-ov-f...

octopoc | 9 days ago

Modern bitcoin addresses use a base-32 character set that leaves out some of the most ambiguous pairs and also permutes the address ordering so that the most visually similar remaining characters produce single bit errors which are better handled by the addresses error detecting (and potentially correcting) code.

https://github.com/bitcoin/bips/blob/master/bip-0173.mediawi...

nullc | 10 days ago

Recently I came up with something similar: https://gist.github.com/ceving/cb68c8f2392255c5ed4ea65a6a199...

But I use a alphabet with 32 characters: abcdefghikmnopqrstuvwxyz23456789

I prefer 32 characters, because that makes it possible to pack 5 random bytes into a token with 8 characters.

ceving | 8 days ago

“Oh By”[1], The universal shortener, has had protections for this built in from the very beginning.

Since the whole point is the ability to convey a message in the physical world end with chalk or pencil or whatever – we needed to make sure that characters were unambiguous.

So there are no zeros or ‘o’ characters or ones or ‘l’ characters… I think there were one or two other rules that govern this but I can’t think of them right now…

[1] https://0x.co

rsync | 10 days ago

Honestly, stuff like this is why I stick with (case-insensitive) hexadecimal for user-facing IDs. I find hex to be the sweet spot between "decently sized alphabet to keep ID lengths down" and "easy to read, communicate, and enter manually". It's also fairly resistant to accidentally generating IDs which will offend your users (unless your users are 1337-speaking time-traveling pre-teens from 2002 who are going to snicker at "b00b5"), which is a nice perk.

kibwen | 11 days ago

Also do not use the same character repeated in a "long" sequence. I hate this with IBANs. Too often there's something like '000000' right in the middle of an IBAN and in case copy and paste is not possible I end up counting the number of zeroes at least thrice. Groups of four characters separated by spaces would help in this case but that's another topic.

junga | 10 days ago

I did my PhD on (malicious) visual impersonation of domain names using many of the techniques described here. There are many references to other visual doppelganger techniques included in my paper here: https://par.nsf.gov/servlets/purl/10256904

My research focused solely on the .com domain name space, so our character set was limited.

geoffreysimpson | 10 days ago

An approach we are trying is speakable IDs. Three characters for the type of thing, then four random words from a list of clean words with 5 characters:

xxx_flown-moons-deary-flake

bckr | 11 days ago

> I would be wary of excluding characters just because they look like other characters when combined

I wish the author would have said more about this. Why be wary?

criddell | 10 days ago

This is why I only ever use xterm with the default bitmap font, it's literally the only one where I'm absolutely sure which character is which.

dusted | 10 days ago

Telephone equipment avoids the letters i and o in the alphabetical designation sequence for this reason, they look like numerals 1 and 0.

myself248 | 11 days ago

And TFA doesn't even mention Unicode, scripts, ASCII, Latin, nothing. As you can imagine it all gets much worse with Unicode (though through no fault of the Unicode Consortium). See Unicode TR#39 [0].

  [0] https://unicode.org/reports/tr39/
cryptonector | 10 days ago
[deleted]
| 11 days ago

It would be helpful to also add a screenshot for that font overview, because: https://imgur.com/a/h7Ks1Qj

And even on systems which do have these fonts, they may not always be exactly the same.

arp242 | 11 days ago

> However, as the number of members in the set increases, the number of possible IDs increases exponentially. Case-sensitive: 53^8 = 62,259,690,411,361 Case-insensitive: 22^8 = 54,875,873,536

Nitpick, but isn't this polynomial to the members of the set?

benaubin | 10 days ago

>Avoiding Confusion With Alphanumeric Characters

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3541865/

croes | 10 days ago

I suppose the first line of defense is a QR code URL. I don't think anyone really enjoys typing long codes.

After that there's ECC. A few extra bytes for a reed-solomon code will fix a lot of issues.

eternityforest | 10 days ago

Letters l and I are visually indistinguishable when written in Arial.

p0w3n3d | 10 days ago

We could always use 1s and 0s, maybe group them in eights. Tongue in cheek, but I guess that would be a valid (even if extreme) solution.

thih9 | 10 days ago

How come neither v nor u are in the final set?

They’re not even mentioned and don’t look like a thing else, except maybe each other in some typefaces.

jonplackett | 11 days ago

> When it matters?

This applies to usernames too! It's easy to phish if platforms render capital I and lowercase l the same

jadengeller | 11 days ago

If we include handwriting then lowercase n and u get be hard to distinguish if written in cursive

croes | 10 days ago

A friend told me about how his work had some senior IT mgrs, who'd clearly been playing with their iPhones too long, decide that the firm shouldn't use Ids at all any more, and started pushing this without consulting the business, even though it was totally inappropriate given how widely they were needed... Caused mayhem and needles arguments!

nmstoker | 10 days ago

My OCD approves of this idea. Let’s also add, IDs cannot start with 0 or O.

iblaine | 11 days ago

just use numbers and crossbar your 7s - problem gone.

if someone's writing is incompetent tell them. if you can't then they ruined it for themselves by being shit at writing the number 7.

jheriko | 10 days ago

my work id has a 0 and a O in it and it drives me crazy. i only remember it due to muscle memory on the keyboard

denimnerd42 | 10 days ago

Another confusing thing is doing this:

    xxxxx-xxxxx-xxxxx-xxxxx
Instead of something like this:

    xxxxx-xx-xxxxx-xxx-xxxxx
Something could also be said about such scheme lacking the embedding of a checksum.

Here's an IBAN (bank account number) in the EU (which thankfully are using a checksum as part of the account number):

    LU29 0022 1712 5582 7000
      ^^
      ||
      two checkdigits
Also some companies think they're "smart" because they pick numbers like this:

    LU29 002 0000 0001 8000
Repeating the same digit, usually a zero, a shitload of time ain't smart. It's fucking dumb.
TacticalCoder | 10 days ago

I guess I better stop using Bozos_Gismos

jgbmlg | 11 days ago

Four quick thoughts:

- We haven't solved this already? Who hasn't tried to read some code and couldn't tell O from 0 or l from 1, etc.?

- Aside from ambiguous characters you have to be aware of spelling and leet spelling. e.g., 53X, S3X, 5EX, etc.

- FFS stop with the 10+ character strings without spaces or hyphens. There's no reason for that.

- Not everyone has perfect vision. Ambiguous characters *and* less than perfect vision (often with not spaces / hyphens) is a mortal UX sin.

We've all been on the wrong end of these, and yet they are common enough - in 2024??!!? - that they need to be mentioned here.

chiefalchemist | 10 days ago

cl looks like d in some fonts or with bad kerning

branon | 10 days ago
[deleted]
| 10 days ago

Out of curiosity, anyone knows why would this post be removed from the front page?

I was excited see that the post is getting engagement. I saw it in 3 position. Then checked an hour later and it is nowhere to be seen.

I am assuming this is some sort of opportunistic algorithm at play that gives a chance to a post, but removes it if it is not performing, but curious if anyone has more details.

gajus | 11 days ago