layout | title | permalink |
---|---|---|
default |
Character Transformations |
character-transformations/ |
This section attempts to explore the various ways in which characters and strings can be transformed by software processes. Such transformations are not vulnerabilities necessarily, but could be exploited by clever attackers.
As an example, consider an attacker trying to inject script (i.e. cross-site scripting, or XSS attack) into a Web-application which utilizes a defensive input filter. The attacker finds that the application performs a lowercase operation on the input after filtering, and by injecting special characters they can exploit that behavior. That is, the string "script" is prevented by the filter, but the string "scrİpt" is allowed.
- Round Trip
- Best Fit Mappings
- Charset Transcoding and Character Mappings
- Normalization
- Canonicalization of Non-Shortest Form UTF-8
- Over-Consumption
- Handling the Unexpected
- Upper and Lower Casing
- Buffer Overflows
- Controlling Syntax
- Charset Mismatch
In practice, globalized software must be capable of handling many different character sets, and converting data between them. The process for supporting this requirement can generally look like the following:
- Accept data from any character set, e.g. Unicode, shift_jis, ISO-8859-1.
- Transform, or convert, data to Unicode for processing and storage.
- Transform data to original or other character set for output and display.
In this pattern, Unicode is used as the broker. With support for such a large character repertoire, Unicode will often have a character mapping for both sides of this transaction. To illustrate this, consider the following Web application transaction.
- An application end-user inputs their full name using characters encoded from the shift_jis character set.
- Before storing in the database, the application converts the user-input to Unicode's UTF-8 format.
- When visiting the Web page, the user's full name will be returned in UTF-8 format, unless other conditions cause the data to be returned in a different encoding. Such conditions may be based on the Web application's configuration or the user's Web browser language and encoding settings. Under these types of conditions, the Web application will convert the data to the requested encoding.
The round-trip conversions illustrated here can lead to numerous issues that will be further discussed. While it serves as a good example, this isn't the only case where such issues can arise.
The "best-fit" phenomena occurs when a character X gets transformed to an entirely different character Y. This can occur for reasons such as:
- A framework API transforms input to a different character encoding by default.
- Data is marshalled from a wide string type (multi-byte character representation) such as UTF-16, to a non-wide string (single-byte character representation) such as US-ASCII.
- Character X in the source encoding doesn't exist in the destination encoding, so the software attempts to find a 'best-fit' match.
In general, best-fit mappings occur when characters are transcoded between Unicode and another encoding. It's often the case that the source encoding is Unicode and the destination is another charset such as shift_jis, however, it could happen in reverse as well. Best-fit mappings are different than character set transcoding which is discussed in another section of this guide.
Software vulnerabilities may arise when best-fit mappings occur. To name a few:
- Best-fit mappings are often not reversible, so data is irrevocably lost. For example, a common best-fit operation would transform a U+FF1C FULLWIDTH LESS-THAN SIGN < to the U+003C LESS-THAN SIGN, or the ASCII < used in HTML. Once converted down to the ASCII <, there’s no reliable way to convert back to the FULLWIDTH source.
- Characters can be manipulated to bypass string handling filters, such as cross-site scripting (XSS) filters, WAF's, and IDS devices.
- Characters can be manipulated to abuse logic in software. Such as when the characters can be used to access files on the file system. In this case, a best-fit mapping to characters such as ../ or file:// could be damaging.
For example, consider a Web-application that’s implemented a filter to prevent XSS (cross-site scripting) attacks. The filter attempts to block most dangerous characters, and operates at an outermost layer in the application. The implementation might look like:
- An input validation filter rejects characters such as <, >, ', and " in a Web-application accepting UTF-8 encoded text.
- An attacker sends in a U+FF1C FULLWIDTH LESS-THAN SIGN < in place of the ASCII <.
- The attacker’s input looks like: <script>
- After passing through the XSS filter unchanged, the input moves deeper into the application.
- Another API, perhaps at the data access layer, is configured to use a different character set such as windows-1252.
- On receiving the input, a data access layer converts the multi-byte UTF-8 text to the single-byte windows-1252 code page, forcing a best-fit conversion to the dangerous characters the original XSS filter was trying to block.
- The attacker’s input successfully persists to the database.
Shawn Steele describes the security issues well on his blog, it's a highly recommended short read for the level of coverage he provides regarding Microsoft's API's:
Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided. Windows and the .Net Framework have the concept of "best-fit" behavior for code pages and encodings. Best fit can be interesting, but often its not a good idea. In WideCharToMultiByte() this behavior is controlled by a WC_NO_BEST_FIT_CHARS flag. In .Net you can use the EncoderFallback to control whether or not to get Best Fit behavior. Unfortunately in both cases best fit is the default behavior. In Microsoft .Net 2.0 best fit is also slower.
As a software engineer, it's important to understand the API's being used directly, and in some cases indirectly (by other processing on the stack). The following table of common library API's lists known behaviors:
Library | API | Best-fit default | Can override | Guidance |
.NET 2.0 | System.Text.Encoding | Yes | Yes | Specify EncoderReplacementFallback in the Encoding constructor. |
.NET 3.0 | System.Text.Encoding | Yes | Yes | Specify EncoderReplacementFallback in the Encoding constructor. |
.NET 3.0 | DllImport | Yes | Yes | To properly and more safely deal with this, you can use the MarshallAsAttribute class to specify a LPWStr type instead of a LPStr. [MarshalAs(UnmanagedType.LPWStr)] |
Win32 | WideCharToMultiByte | Yes | Yes | Set the WC_NO_BEST_FIT_CHARS flag. |
Java | TBD | ... | ||
ICU | TBD | ... |
Another important note Shawn Steel tells us on his blog is that Microsoft does not intend to maintain the best-fit mappings. For these and other security reasons it's a good idea to avoid best-fit type of behavior.
The following table lists test cases to run from a black-box, external perspective. By interpreting the output/rendered data, a tester can determine if a best-fit conversion may be happening. Note that the mapping tables for best-fit conversions are numerous and large, leading to a nearly insurmountable number of permutations. To top it off, the best-fit behavior varies between vendors, making for an inconsistent playing field that does not lend well to automation. For this reason, focus here will be on data that is known to either normalize or best-fit. The table below is not comprehensive by any means, and is only being provided with the understanding that something is better than nothing.
Target char | Target code point | Test code point | Name |
o | \u006F | \u2134 | SCRIPT SMALL O |
o | \u006F | \u014D | LATIN SMALL LETTER O WITH MACRON |
s | \u0073 | \u017F | LATIN SMALL LETTER LONG S |
I | \u0049 | \u0131 | LATIN SMALL LETTER DOTLESS I |
i | \u0069 | \u0129 | LATIN SMALL LETTER I WITH TILDE |
K | \u004B | \u212A | KELVIN SIGN |
k | \u006B | \u0137 | LATIN SMALL LETTER K WITH CEDILLA |
A | \u0041 | \uFF21 | FULLWIDTH LATIN CAPITAL LETTER A |
a | \u0061 | \u03B1 | GREEK SMALL LETTER ALPHA |
" | \u0022 | \u02BA | MODIFIER LETTER DOUBLE PRIME |
" | \u0022 | \u030E | COMBINING DOUBLE VERTICAL LINE ABOVE |
" | \u0027 | \uFF02 | FULLWIDTH QUOTATION MARK |
' | \u0027 | \u02B9 | MODIFIER LETTER PRIME |
' | \u0027 | \u030D | COMBINING VERTICAL LINE ABOVE |
' | \u0027 | \uFF07 | FULLWIDTH APOSTROPHE |
< | \u003C | \uFF1C | FULLWIDTH LESS-THAN SIGN |
< | \u003C | \uFE64 | SMALL LESS-THAN SIGN |
< | \u003C | \u2329 | LEFT-POINTING ANGLE BRACKET |
< | \u003C | \u3008 | LEFT ANGLE BRACKET |
< | \u003C | \u00AB | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK |
> | \u003E | \u00BB | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK |
> | \u003E | \u3009 | RIGHT ANGLE BRACKET |
> | \u003E | \u232A | RIGHT-POINTING ANGLE BRACKET |
> | \u003E | \uFE65 | SMALL GREATER-THAN SIGN |
> | \u003E | \uFF1E | FULLWIDTH GREATER-THAN SIGN |
: | \u003A | \u2236 | RATIO |
: | \u003A | \u0589 | ARMENIAN FULL STOP |
: | \u003A | \uFE13 | PRESENTATION FORM FOR VERTICAL COLON |
: | \u003A | \uFE55 | SMALL COLON |
: | \u003A | \uFF1A | FULLWIDTH COLON |
These test cases are largely derived from the public best-fit mappings provided by the Unicode Consortium. These are provided to software vendors but do not necessarily they were implemented as documented. In fact, any software vendor such as Microsoft, IBM, Oracle, can implement these mappings as they desire.
Sometimes characters and strings are transcoded from a source character set into a destination character set. On the surface this phenomena may seem similar to best-fit mappings, but the process is quite different. In general, when software transcodes data from source charset X to destination charset Y, it follows either a data-driven mapping table or an algorithmic formula.
For the most part this process is data-driven. While these tables are standardized somewhere there may be differences between vendors. ICU maintains a list of its character set mapping tables online. Also, ICU's Converter Explorer tool lets you browse the maintained charset mapping tables.
Data may be transcoded directly from a source charset to a destination charset, however it's also common to use Unicode as the broker. In the latter case the software will first transcode the source charset to Unicode, and from there to the destination charset. Some vendors such as Microsoft are known to leverage the Private Use Area (PUA) when transcoding to Unicode, when a direct mapping cannot be found or when a source byte sequence is invalid or illegal. It's important to be aware of a few pitfalls during the transcoding process.
- When data is transcoded to the PUA, converting it again from the PUA may have unexpected consequences.
- Data can change length, particularly if transcoding to/from a single-byte charset leads to a multi-byte character in the other charset.
As a software engineer building a mechanism for transcoding data between charsets, it's important to understand these pitfalls and handle these unexpected cases gracefully.
Software vulnerabilities can arise through charset transcodings. To name a few:
- Transcoding data is not always reversible, so data can be irrevocably lost.
- Characters can be manipulated to bypass string handling filters, such as cross-site scripting (XSS) filters, WAF's, and IDS devices.
- Characters can be manipulated to abuse logic in software. For example, characters transcoded into ../ or file:// would prove detrimental in file handling operations.
In Unicode, Normalization of characters and strings follows a specification defined in the Unicode Standard Annex #15: Unicode Normalization Forms. The details of Normalization are not for the faint of heart and will not be discussed in this guide. For engineers and testers, it's at least important to understand that there are four Normalization forms defined:
- NFC - Canonical Decomposition
- NFD - Canonical Decomposition, followed by Canonical Composition
- NFKC - Compatibility Decomposition
- NFKD - Compatibility Decomposition,followed by Canonical Composition
When testing for security vulnerabilities, we're often most interested in the compatibility decomposition forms (NFKC, NFKD), but occassionally the canonical decomposition forms will produce interesting transformations as well. Cases where characters, and sequences of characters, transform into something different than the original source, might be used to bypass filters or produce other exploits. Consider the following image, which depicts the result of normalizing with either NFKC or NFKD for the character U+FE64 SMALL LESS-THAN SIGN.
In the above example, the character U+FE64 will transform into U+003C, which might lead to security vulnerability in HTML applications. Consider the next example which shows the result of either NFD or NFKD decomposition applied to the "Turkish I" character U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE.
As a software engineer, it becomes evident that Unicode normalization plays an important role, and that it is not always an explicit choice. Often times normalization is applied implicitly by the underlying framework, platform, or Web browser. It's important to understand the API's being used directly, and in some cases indirectly (by other processing on the stack).
The following table of common library API's lists known behaviors:
Library | API | Default | Can override | Notes |
.NET | System.Text.Encoding | NFC | Yes | |
Win32 | ||||
Java | ||||
ICU | ||||
Ruby | ||||
Python | ||||
PHP | ||||
Perl | ||||
JavaScript | String.prototype.normalize() | NFC | Yes | MDN |
The following table captures how Web browsers normalize URLs. Differences in normalization and character transformations can lead to incompatibility as well as security vulnerability. source
Description | MSIE 9 | FF 5.0 | Chrome 12 | Safari 5 | Opera 11.5 |
Applies normalization in the path | No | No | No | Yes - NFC | No |
Applies normalization in the query | No | No | No | Yes - NFC | No |
Applies normalization in the fragment | No | No | Yes - NFC | Yes - NFC | No |
The following table lists test cases to run from a black-box, external perspective. By interpreting the output/rendered data, a tester can determine if a normalization transformation may be happening.
Target char | Target code point | Test code point | Name |
o | \u006F | \u2134 | SCRIPT SMALL O |
s | \u0073 | \u017F | LATIN SMALL LETTER LONG S |
K | \u004B | \u212A | KELVIN SIGN |
A | \u0041 | \uFF21 | FULLWIDTH LATIN CAPITAL LETTER A |
" | \u0027 | \uFF02 | FULLWIDTH QUOTATION MARK |
' | \u0027 | \uFF07 | FULLWIDTH APOSTROPHE |
< | \u003C | \uFF1C | FULLWIDTH LESS-THAN SIGN |
< | \u003C | \uFE64 | SMALL LESS-THAN SIGN |
> | \u003E | \uFE65 | SMALL GREATER-THAN SIGN |
> | \u003E | \uFF1E | FULLWIDTH GREATER-THAN SIGN |
: | \u003A | \uFE13 | PRESENTATION FORM FOR VERTICAL COLON |
: | \u003A | \uFE55 | SMALL COLON |
: | \u003A | \uFF1A | FULLWIDTH COLON |
TODO If you've determined that input is being normalized but need different characters to exploit the logic, you may use the accompanying test case database.
The UTF-8 encoding algorithm allows for a single code point to be represented in multiple ways. That is, while the Latin letter 'A' is normally represented using the byte 0x41 in UTF-8, it's non-shortest form, or overlong, encoding would be any of the following:
- 0xC1 0x81
- 0xE0 0x81 0x81
- 0xF0 0x80 0x81 0x81
- etc...
Earlier versions of the Unicode Standard applied Postel's law, or, the robustness principle of 'be conservative in what you do, be liberal in what you accept from others.' While the 'generation' of non-shortest form UTF-8 was forbidden, the 'interpretation' of was allowed. That changed with Unicode Standard version 3.0, when the requirement changed to prohibit both interpretation and generation. In fact, both the 'generation' and 'interpretation' of non-shortest form UTF-8 are currently prohibited by the standard, with one exception - that 'interpretation' only applies to the Basic Multilingual Plane (BMP) code points between U+0000 and U+FFFF. In terms of the common security vulnerabilities discussed in this document, that exception has no bearing, as the ASCII range of characters are not exempt.
Given the history of security vulnerabilities around overlong UTF-8, many frameworks have defaulted to a more secure position of disallowing these forms to be both generated and interpreted. However, it seems that some software still interprets non-shortest form UTF-8 for BMP characters, including ASCII. A common pattern in software follows:
Process A performs security checks, but does not check for non-shortest forms.
Process B accepts the byte sequence from process A, and transforms it into UTF-16 while interpreting non-shortest forms.
The UTF-16 text may then contain characters that should have been filtered out by process A. source
The overlong form of UTF-8 byte sequences is currently considered an illegal byte sequence. It's therefore a good test case to attempt in software such as Web applications, browsers, and databases.
Some notes about canonicalization and UTF-8 encoded data.
- The ASCII range (0x00 to 0x7F) is preserved in UTF-8.
- UTF-8 can encode any Unicode character U+000000 through U+10FFFF using any number of bytes, thus leading to the non-shortest form problem.
- The Unicode standard (3.0 and later) requires that a code point be serializd in UTF-8 using a byte sequence of one to four bytes in length. The Corrigendum #1: UTF-8 Shortest Form introduced this conformance requirement.
Non-shortest form UTF-8 has been the vector for critical vulnerabilities in the past. From the Microsoft IIS 4.0 and 5.0 directory traversal vulnerability of 2000, which was rediscovered in the product's WebDAV component in 2009.
Some of the common security vulnerabilities that use non-shortest form UTF-8 as an attack vector include:
- Directory/folder traversal.
- Bypassing folder and file access filters.
- Bypassing HTML and XSS filters.
- Bypassing WAF and NID's type devices.
As a developer trying to protect against this, it becomes important to understand the API's being used directly, and in some cases indirectly (by other processing on the stack). The following table of common library API's lists known behaviors:
Library | API | Allows non-shortest UTF8 | Can override | Notes |
.NET 2.0 | System.Text.Encoding | |||
.NET 3.0 | System.Text.Encoding | |||
ICU | System.Text.Encoding |
As a tester/bug hunter looking for the vulnerabilities, the following table lists test cases to run from a black-box, external perspective. The data in this table presents the first few non-shortest forms (NSF) UTF-8 as URL encoded data %NN. If you need raw bytes instead, these same hex values apply. All of the target chars in the first column are ASCII.
Target | NSF 1 | NSF 2 | NSF 3 | Notes | |
A | %C1%81 | %E0%81%81 | %F0%80%81%81 | Latin A useful as a base test case. | |
" | %C0%A2 | %E0%80%A2 | %F0%80%80%A2 | Double quote | |
' | %C0%A7 | %E0%80%A7 | %F0%80%80%A7 | Single quote | |
< | %C0%BC | %E0%80%BC | %F0%80%80%BC | Less-than sign | |
> | %C0%BE | %E0%80%BE | %F0%80%80%BE | Greater-than sign | |
. | %C0%AE | %E0%80%AE | %F0%80%80%AE | Full stop | |
/ | %C0%AF | %E0%80%AF | %F0%80%80%AF | Solidus | |
\ | %C1%9C | %E0%81%9C | %F0%80%81%9C | Reverse solidus |
The Unicode Transformation Formats (e.g. UTF-8 and UTF-16) serialize code points into legal, or well-formed, byte sequences, also called code units. For example, consider the following code points and their corresponding well-formed code units in UTF-8 format.
Code point | Description | UTF-8 byte sequence |
U+0041 | LATIN CAPITAL LETTER A | 0x41 |
U+FF21 | FULLWIDTH LATIN CAPITAL LETTER A | 0xEC 0xBC 0xA1 |
U+00C0 | LATIN CAPITAL LETTER A WITH GRAVE | 0xC3 0x80 |
And following are the same code points in their corresponding well-formed UTF-16 (little endian) format.
Code point | Description | UTF-16LE byte sequence |
U+0041 | LATIN CAPITAL LETTER A | 0x00 0x41 |
U+FF21 | FULLWIDTH LATIN CAPITAL LETTER A | 0xFF 0x21 |
U+00C0 | LATIN CAPITAL LETTER A WITH GRAVE | 0x00 0xC0 |
Consider a UTF-8 decoder consuming a stream of data from a file. It encounters a well-formed byte sequence like:
<41 C3 80 41>
This sequence is made up of three well-formed sub-sequences. First is the <41>, second is the <C3 80>, and third is the <41>. The second subsequence <C3 80> is a two-byte sequence. The lead byte C3 indicates a two-byte sequence, and the trailing byte 80 is a valid trailing byte. The table below indicates these relationahips. Now consider that the UTF-8 decoder encounters an ill-formed byte sequence:
<41 C2 C3 80 41>
Taken apart, there are three minimally well-formed subsequences <41>, <C3 80>, and <41>. However, the <C2> is ill-formed because it doesn't have a valid trailing byte, which would be required per the table below.
Code point | First byte | Second byte | Third byte | Fourth byte |
U+0000..U+007F | 00..7F | |||
U+0080..U+07FF | C2..DF | 80..BF | ||
U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
U+D000..U+D7FF | ED | 80..9F | 80..BF | |
U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
U+100000..U+10FFFF | F4 | 80..BF | 80..BF | 80..BFsource |
The table above shows the legal and valid UTF-8 byte sequences, as defined by the Unicode Standard 5.0. The lower ASCII range 00..7F has always been preserved in UTF-8. Multi-byte sequences start at code point U+0080 and continue from two to four bytes. For example, code point U+0700 would be encoded in UTF-8 as a two byte sequence, with the lead byte somewhere in the range of C2..DF.
Over-consumption of well-formed byte sequences has been the vector for critical vulnerabilities. These generally expose widespread issues when they affect a widely used library. One example can be found in the Internationalization Components for Unicode (ICU) in 2009, which would leave almost any Web-application exposed to cross-site scripting (XSS) threats since software such as Apple's Safari Web browser exposed the flaw. Even Web-applications with strong HTML/XSS filters can be vulnerable when the Web browser is non-conformant.
The following input illustrates the over-consumption attack vector, where an attacker controls the img element's src attribute, followed by a text fragment in the HTML. The [0xC2] represents the attacker's UTF-8 lead byte with an invalid trailing byte, the double quote " which gets consumed in the resultant string. The HTML text portion including the onerror text is also attacker-controlled input. The entire payload becomes:
<img src="#[0xC2]"> " onerror="alert(1)"</ br>
The resultant string after over-consumption:
<img src="#> " onerror="alert(1)"</ br>
Although the above is a broken fragment of HTML because the img element is not properly closed, most browsers will render it as an img element with an onerror event handler.
Some of the common security vulnerabilities that exploit an over-consumption flaw as an attack vector include:
- Bypassing folder and file access filters.
- Bypassing parser-based filters such as HTML and XSS filters.
- Bypassing detection signatures in WAF and NID's type devices.
As a developer trying to protect against this, it again becomes important to understand how the API's being used will handle ill-formed byte sequences. The following table of common library API's lists known behaviors:
Library | API | Allows ill-formed UTF8 | Can override | Notes |
.NET 2.0 | System.Text.Encoding | No | No | |
.NET 3.0 | UTF8Encoding | No | No | |
ICU | System.Text.Encoding | Yes | Yes |
As a tester/bug hunter looking for the vulnerabilities, the following table lists test cases to run from a black-box, external perspective. The data in this table presents byte sequences that could elicit over-consumption. You can substitute a % before each byte value to create a URL-encoded value for use in testing. This would be applicable for passing ill-formed byte sequences in a Web-application.
Source bytes | Expected safe result | Desired unsafe result | Notes |
C2 22 3C | 22 3C | 3C | Error handling of C2 overconsumed the trailing 22. |
" | %C0%A2 | %E0%80%A2 | Double quote |
Over-consumption typically happens at a layer lower than most developers work at. It's more likely to be in the frameworks, the browsers, the database, etc. If designing a character set or Unicode layer, be sure to include an error condition for cases where valid lead bytes are followed by invalid trailing bytes.
Through error handling, filtering, or other cases of input validation, problematic characters or raw bytes might be replaced or deleted. In these cases, it's important that the resultant string or byte sequence does not introduce a vulnerability. This problem is not specific to Unicode by any means, and can occur with any character set. However as will be discussed, Unicode has a good solution.
TODO
U+2073
e.g. half of a surrogate pair
The following input illustrates a dangerous character substitution. In this case, the application uses input validation to detect when a string contains characters such as < and then sanitizes such character’s by replacing them with a . period, or full stop. Internally, the application fetches files from a file share in the form:
file://sharename/protected/user-01/files
By exploiting the character substitution logic, an attacker could perform directory traversal attacks on the application:
file://sharename/protected/user-01/../user-002/files
An application may choose to delete characters when invalid, illegal, or unexpected data is encountered. This can also be problematic if not handled carefully. In general, it's safer to replace with Unicode's REPLACEMENT CHARACTER U+FFFD than it is to delete.
Consider a Web-browser that deletes certain special characters such as a mid-stream Unicode BOM when encountered in its HTML parsing. An attacker injects the following HTML which includes the Unicode BOM represented by U+FEFF. The existence of this character allows the attacker's input to bypass the Web-application's cross-site scripting filter, which rejects an occurrence of <script>.
<scr[U+FEFF]ipt>
The Unicode BOM has special meaning in the standard, and in most software. The following image illustrates some of the special properties associated with this character:
TODO add image
The Unicode BOM is recommend input for most software test cases, and can be especially useful when test text parsers such as HTML and XML.
Handle error conditions securely by replacing with the Unicode REPACEMENT CHARACTER U+FFFD. If that's impractical for some reason then choose a safe replacement that doesn't have syntactical meaning in the protocol being used. Some common examples include ? and #.
Strings are transformed through upper and lower casing operations, and sometimes in ways that weren't intended. This behavior can be exploited if performed at the wrong time. For example, if a casing operation is performed anywhere in the stack after a security check, then a special character like U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE could be used to bypass a cross-site scripting filter.
toLower("İ") == "i"
Another aspect of casing operations is that the length of characters and strings can change, depending on the input. The following should never be assumed:
toLower("scrİpt") == "script"
Another aspect of casing operations is that the length of characters and strings can change, depending on the input. The following should never be assumed:
len(x) != len(toLower(x))
Common frameworks handle string comparison in different ways. The following table captures the behavior of classes intended for case-sensitive and case-insensitive string comparison.
Library | API | Is Dangerous | Can override | Notes |
.NET 1.0 | StringComparer | |||
.NET 2.0 | StringComparer | |||
.NET 3.0 | StringComparer | |||
Win32 | CompareStringOrdinal | |||
Win32 | lstrcmpi | |||
Win32 | CompareStringEx | |||
ICU C | ucol_strcoll | |||
ICU C | ucol_strcollIter | Allows for comparing two strings that are supplied as character iterators (UCharIterator). This is useful when you need to compare differently encoded strings using strcoll | ||
ICU C++ | Collator::Compare | |||
ICU C | u_strCaseCompare | Compare two strings case-insensitively using full case folding. | ||
ICU C | u_strcasecmp | Compare two strings case-insensitively using full case folding. | ||
ICU C | u_strncasecmp | Compare two strings case-insensitively using full case folding. | ||
ICU Java | caseCompare | Compare two strings case-insensitively using full case folding. | ||
ICU Java | Collator.compare | |||
POSIX | strcoll |
Buffer overflows can occur through improper assumptions about characters versus bytes, and also about string sizes after casing and normalization operations.
The following table from UTR 36 illustrates the maximum expansion factors for casing operations on the edge-case characters in Unicode. These inputs make excellent test cases.
Operation | UTF | Factor | Sample | |
Lower | 8 | 1.5 | Ⱥ | U+023A |
16, 32 | 1 | A | U+0041 | |
Upper | 8, 16, 32 | 3 | ΐ | U+0390 |
source: Unicode Technical Report #36
The following table from UTR 36 illustrates the maximum expansion factors for normalization operations on the edge case characters in Unicode. These inputs make excellent test cases.
Operation | UTF | Factor | Sample | |
NFC | 8 | 3X | 𝅘𝅥𝅮 | U+1D160 |
16, 32 | 3X | שּׁ | U+FB2C | |
NFD | 8 | 3X | ΐ | U+0390 |
16, 32 | 4X | ᾂ | U+1F82 | |
NFKC/NFKD | 8 | 11X | ﷺ | U+FDFA |
16, 32 | 18X |
White space and line feeds affect syntax in parsers such as HTML, XML and javascript. By interpreting characters such as the 'Ogham space mark' and 'Mongolian vowel separator' as whitespace software can allow attacks through the system. This could give attackers control over the parser, and enable attacks that might bypass security filters. Several characters in Unicode are assigned the 'white space' category and also the 'white space' binary property. Depending on how software is designed, these characters may literally be treated as a space character U+0020.
For example, the following illustration shows the special white space properties associated with the U+180E MONGOLIAN VOWEL SEPARATOR character.
TODO: add image
If a Web browser interprets this character as white space U+0020, then the following HTML fragment would execute script:
<a href=#[U+180E]onclick=alert()>
When software cannot accurately determine the character set of the text it is dealing with, then it must decide to either error or make an assumption. User-agents most commonly must deal with this problem, as they’re faced with interpreting data from a large assortment of character sets. There are no standards that define how to handle situations of character set mismatch, and vendor implementations vary greatly.
Consider the following diagram, in which a Web browser receives an HTTP response with an HTTP charset of ISO-8859-1 defined, and a meta tag charset of shift_jis defined in the HTML.
TODO add image
When an attacker can exploit can control charset declarations, they can control the software’s behavior and in some cases setup an attack.