Plus a tool and tips for defenders.
In this article, I will describe how Unicode — the encoding standard behind emojis and other special text characters — can be used to craft payloads for various exploits such as punycode attacks, Case Mapping Collisions and Cross-Site Scripting (XSS). I will also share steps that developers can take to mitigate such attacks.
From 🥺 to 😂, you probably send dozens of emojis every day while communicating with your family and friends. Even though emojis look like pictures, you can select, copy and paste them just like normal text. Why? That is because emojis are characters in the Unicode text encoding standard. The Unicode text encoding standard encodes and represents many types of characters: the Latin alphabet, Arabic numerals, and of course emojis.
From ASCII to Unicode
Originally, digital text was mostly encoded in ASCII, which encodes a character for each hexadecimal number from 00 to FF. For example, the letter ‘A’ is represented by the hexadecimal number 41 in ASCII. In theory, this means that ASCII supports up to 16 * 16 = 256 characters. However, in practice, many slots are reserved for control or extended characters such as the backspace character. As such, only 128 characters are used as standard ASCII characters, which is just enough to cover the Latin alphabet, punctuation, and special characters like newlines.
In the long run, it became impractical to rely on ASCII as international users needed to write in different languages and other character sets. Unicode thus emerged as a new standard for encoding text. UTF-8, the main Unicode encoding format used by the web, supports up to 2²¹ “code points” or characters. The latest Unicode 13.0 standard was released just a few months ago and contains up to 143,924 characters. In fact, some of these characters are yet to be supported by most browsers and operating systems . To prove my point, you can set a reminder now to check this article again in a couple months. By then, your device might be able to render this: 🥲. According to Emojipedia, Unicode 13.0 defines code point U+1F972 as the “Smiling Face with Tear” emoji.
Example #1: Punycode Attacks
One of the earliest malicious uses of Unicode was in punycode phishing campaigns. Attackers relied on the fact that network protocols linking domain names to IP addresses are limited to the ASCII character set, while most browsers and operating systems support Unicode. As such, when a victim enters a Unicode URL in the browser’s address bar, the URL is rendered as Unicode but the request is actually sent to the “punycode” ASCII representation of the URL. In one example, copying and pasting “аpple.com” into your browser would have sent you to “xn–pple-43d.com” instead. This is because the above “аpple.com” uses the Cyrillic alphabet’s “а” (U+0430) rather than the Latin alphabet’s “a” (U+0061). These two characters are visually identical on most default fonts, making it hard for individuals to detect the difference. On browsers that are not secured against punycode attacks, the user could easily fall victim to phishing attacks.
Example #2: Case Mapping Collisions
There are other more impactful ways to weaponise Unicode. In the article Hacking GitHub with Unicode’s dotless ‘i’, the author lays out one attack scenario exploited by security researcher John Gracey. Some Unicode characters are vulnerable to Case Mapping Collisions, when two different characters are uppercased or lowercased into the same character. For example, you can open your browser console right now and enter this Javascript code:
> "ß".toUpperCase() == "SS"
> true
Unfortunately for developers everywhere, Unicode allows certain characters like the German grapheme “ß” (pronounced “eszett”), which looks nothing like “SS”, to be upper- or lowercased into characters of the Latin alphabet. Due to this strange behaviour, Gracey was able to hack Github’s password reset flow. Here’s a brief sketch of the flow:
var resetUser = database.findUserWithEmail(attackerInput.lowercase())
if resetUser != null: sendResetEmail(resetUser.resetToken(), attackerInput)
- An attacker enters an email address with Unicode characters that when lowercased, would match the victim’s email address. GitHub lowercased any user input when searching the database for a matching email — a reasonable step to allow for consistency and uniqueness checks.
- Once GitHub found a user that matched the lowercase email, it sent a password reset token email to the original, non-lowercased email address entered by the attacker. Since the attacker has access to this email, the attacker receives the victim user’s reset account link, allowing them to take over the account.
There are a few caveats here, such as the fact that the attacker requires an email host that supports Unicode characters. Nevertheless, the aforementioned example demonstrates that Unicode case mapping collisions can be exploited for some interesting attacks.
Example #3: Cross-Site Scripting
Building off on the previous example, case mapping collisions can also be exploited to allow attackers to bypass Web Application Firewalls (WAF) to inject XSS payloads.
Take this simple toy example of a page with the following Javascript code:
document.location = getQueryParam('redirect').toUpperCase();
Typically, entering a query parameter of redirect=javascript:alert()
would be enough to trigger this DOM XSS. However, most WAFs would block such an obvious XSS payload. However, if you replace javascript
with JAVAſCRıPT
, you might just be able to slip through depending on how the WAF parses the input. The page itself will restore the Unicode payload back to javascript
after uppercasing it.
There are even more ways to abuse and exploit Unicode collisions. To help you quickly generate Unicode collisions, I have written a tool called Unicollider. Simply enter any plaintext you want to collide with, and the tool will generate an equivalent uppercase and lowercase collision.
Given the wide variety of ways attackers can use and abuse Unicode, it is not possible to recommend a catch-all solution. However, the golden rule always applies: never blindly trust user input. The GitHub bug occurred because the server did not check the user’s input before sending the password reset email. For WAFs and other server-side defences, watch out for Unicode edge cases that could slip through a regex filter. Validate and sanitise input before performing any risky operations.
By creating countless new character sets and inputs, Unicode inadvertently widened the attack surface for malicious actors. This has led to numerous innovative exploits that bypass seemingly reasonable checks and defences. Developers should stay cognisant of the risks of handling user input and err on the side of caution.