UTF-8 is a character encoding used to digitally store and exchange text. It is a standard compatible with Unicode and can represent virtually all the world's written characters. Its efficient storage and wide adoption make it the most widely used encoding on the Internet and software applications.
UTF-8 is designed as a variable-length encoding, meaning that some characters take up fewer bytes than others. This makes it compatible with older systems that support ASCII, while allowing it to accommodate a wide range of special and international characters.
The emergence of Unicode and UTF-8 has made software and websites more globally accessible. Without a universal encoding such as UTF-8, systems would experience problems correctly representing different languages and symbols.
Some reasons why UTF-8 is the preferred choice for text encoding:
Global support: All languages and symbols can be stored without the need for separate encodings.
Compatibility with ASCII: Old systems that only support ASCII can still read UTF-8-encoded text correctly.
Efficiency: Frequently used characters take up less storage space, which helps with fast processing.
Web standard: Browsers and Web servers use UTF-8 by default, reducing compatibility issues.
Unicode is a standard that assigns unique numeric values (code points) to characters from different languages and symbol sets. UTF-8 is a way of storing these codepoints in a computer-friendly format.
For example:
The letter A has Unicode codepoint U+0041 and is stored in UTF-8 as 1 byte (0x41).
The symbol € has Unicode code point U+20AC and is stored in UTF-8 as 3 bytes (0xE2 0x82 0xAC).
This makes UTF-8 a flexible and scalable solution for modern software and Web development.
UTF-8 was developed in 1992 by Ken Thompson and Rob Pike, two engineers at Bell Labs. They designed this character encoding as a more efficient way to store and process Unicode characters, with a focus on compatibility with ASCII and saving space.
The original idea was to create a variable-length encoding that:
Would be backward compatible with the existing ASCII format.
Would handle frequently used characters more efficiently by making them take up less storage space.
Would not cause byte conflicts on older systems that were not Unicode-compatible.
In 1993 UTF-8 was included in the Unicode 2.0 standard, and in 1996 it was further refined in RFC 2044. Later, in 2003, it was finally specified in RFC 3629, which limited the valid characters within Unicode to 0 through 0x10FFFF (the first 1,114,112 characters of Unicode).
Before the advent of Unicode and UTF-8, there were many character encodings such as ISO 8859-1 (Latin-1), Shift-JIS, and Windows-1252. This caused major compatibility problems in international communication and file exchange.
Problems with older encodings:
Limited character sets – Each encoding could only support a limited number of languages.
Incompatibility between systems – A text file encoded in Windows-1252 could be unreadable on a system that used ISO 8859-1.
Different byte lengths – Some encodings used fixed length byte encodings, which was inefficient for languages with many symbols.
Unicode offered a solution by introducing one universal character set. However, the first Unicode encodings such as UTF-16 and UTF-32 used 2 or 4 bytes per character, which was inefficient for English and other Latin-based texts.
Since its introduction, UTF-8 has spread rapidly and become the dominant character encoding on the Internet and in software applications.
Key developments:
2008: Google switched its search engine completely to UTF-8.
2010: More than 50% of all websites used UTF-8 as the default encoding.
2019: More than 95% of all websites were encoded in UTF-8.
Now: UTF-8 is the standard encoding for most operating systems, databases, and programming languages.
Thanks to its wide support and efficiency, UTF-8 is the most widely used character encoding in the world.
UTF-8 is variable-length character encoding, meaning that some characters require fewer bytes than others. This ensures that the encoding is efficient while remaining compatible with ASCII.
Each Unicode code point is stored in UTF-8 as a sequence of one to four bytes.
ASCII characters (0-127) are stored as one byte (compatible with ASCII).
Other Unicode characters take up two, three or four bytes, depending on their codepoint.
Here is an overview of how characters are stored in UTF-8:
This means that commonly used characters such as letters and numbers take up little storage space, while rare symbols or non-Latin characters require more bytes.
Let's look at some Unicode characters and see how they are stored in UTF-8:
As you can see, the letter A uses only 1 byte, the euro sign uses 3 bytes, and an emoji such as 😀 uses 4 bytes.
A major advantage of UTF-8 is that ASCII characters remain exactly the same. This means that a file with only ASCII characters in UTF-8 will be read the same by older systems that understand only ASCII.
For example, the ASCII string:
Hello
is stored the same in UTF-8 as it is in ASCII:
0x48 0x65 0x6C 0x6C 0x6F
But if you add a Unicode character, such as an é (U+00E9), the encoding will change to:
0x48 0x65 0x6C 0x6C 0xC3 0xA9
Here you can see that é is stored as 2 bytes (0xC3 0xA9), while the rest remains unchanged.
Overlong encodings are incorrect UTF-8 representations of characters that use unnecessarily many bytes.
For example, the ASCII character A can be encoded correctly as 0x41 (1 byte). But in an overlong encoding, the same character can be stored as 11000000 10000001 (2 bytes), which is unnecessary and unsafe.
Why are overlong encodings a problem?
They can be used in security exploits to hide unwanted characters.
They are not allowed by the Unicode standard.
Modern browsers and software reject overlong encodings
What happens when an application encounters an invalid UTF-8 byte?
There are three possible approaches:
Ignore error – The corrupt byte is skipped.
Replace with a standard character – Often the “replacement character” � (U+FFFD) is used.
Give error message – Some systems reject the input completely.
Example:
If a byte is missing from a 2-byte character, the software may replace it with � to indicate that the text is corrupted.
Surrogates are special code points used in UTF-16 to encode characters outside the base plan. But in UTF-8, surrogates are invalid and should not be used.
Byte Order Mark (BOM) is an optional marker (U+FEFF) that some systems place at the beginning of a file to let you know that the content is encoded in UTF-8.
Not usually needed in UTF-8, because UTF-8 has no byte sequence issues.
Can cause problems if software does not process a BOM correctly.
Several character encodings are available, but UTF-8 has set the standard because of its flexibility and broad support. Nevertheless, other encodings such as UTF-16 and UTF-32 are still used in specific situations. Let's look at the main differences.
When do you use which encoding?
Use UTF-8 as the default choice.
It is the most efficient and compatible option for almost all applications.
Suitable for the Web, databases, and operating systems.
Use UTF-16 if you handle many Asian characters.
Some older Windows and Java applications work with UTF-16 by default.
Not efficient for ASCII texts.
Use UTF-32 if storage space is not an issue, and you need direct access to characters.
Rarely used outside specialized applications, such as certain internal data structures in software.
Before Unicode became popular, many computers, and systems used locale-specific encodings, such as ISO 8859-1 (Latin-1) and Windows-1252.
Why is UTF-8 better than older encodings?
Universal language support: One encoding for all languages and symbols.
No character corruption: No “mojibake” (unreadable characters) when misinterpreted.
Standardized: Supported by all modern operating systems and software.
UTF-8 has emerged as the most widely used character encoding in the world, mainly due to its versatility and efficiency. Here are the main advantages of UTF-8:
UTF-8 supports all Unicode characters, meaning it can be used for any script in the world. This makes it ideal for international communications, websites and software.
One of the biggest advantages of UTF-8 is that all ASCII characters (0-127) remain the same. This means that:
Existing ASCII-based systems can use UTF-8 without problems.
Old software can continue to process ASCII content without conversion.
For example, the string:
Hello
is stored the same in both ASCII and UTF-8:
0x48 0x65 0x6C 0x6C 0x6F
This avoids compatibility issues with older systems.
Because UTF-8 uses variable length, it takes up less space for commonly used characters than other Unicode encodings such as UTF-16 or UTF-32.
Storage size comparison:
ASCII characters take up only 1 byte.
Chinese, Japanese, and Arabic characters may require 2 to 4 bytes.
UTF-16 and UTF-32 always take up more space for English-language texts.
According to W3Techs, more than 95% of all websites today are encoded in UTF-8. This is because:
W3C and modern browsers use UTF-8 as the standard.
It is the most efficient encoding for mixed languages and symbols.
Most databases and programming languages support UTF-8 by default:
MySQL & PostgreSQL use UTF-8 as standard for text fields.
JSON and XML are almost always stored in UTF-8.
The broad support makes UTF-8 the safest choice for text storage and data exchange.
Although UTF-8 has become the standard for character encoding, it also has some disadvantages, especially in specific situations. Here are the main limitations:
Although UTF-8 is efficient for ASCII characters (1 byte per character), some Unicode characters can take up more space.
Comparison of characters in UTF-8 vs. other encodings:
Chinese, Japanese, and Arabic characters take 3 to 4 bytes in UTF-8, while in UTF-16 they can use only 2 bytes.
For English-language texts, UTF-8 is more efficient, but for large amounts of Asian characters, UTF-16 can be more compact.
Because characters have variable length (1 to 4 bytes), it can be more difficult to work with UTF-8-encoded text in programming languages and databases.
Examples of complications:
String length calculation: A character is not always 1 byte, so length() functions can give unexpected results.
Substring operation: Cutting a UTF-8 string without regard to byte length can result in corrupted characters.
Retrieving the nth character: Because characters have variable length, you often need to iterate them instead of applying direct indexing.
Although UTF-8 is the standard in modern systems, old programs and devices may still expect ISO 8859-1 or Windows-1252. This can lead to:
Unreadable characters (mojibake) if a system does not decode correctly.
Problems with file conversion when migrating older databases.
Some binary-oriented protocols do not work as well with UTF-8. For example:
System logs, and low-level network information may contain extra bytes due to UTF-8.
Special characters such as 💜 or 😃 can take up more bytes, affecting file formats such as JSON.
Unlike UTF-32, where each character is always 4 bytes, the size in UTF-8 varies. This can lead to:
Slower processing in random access (for example, in databases and search indexes).
More complex processing in programming languages where characters must be searched or replaced quickly.
Although UTF-8 has some disadvantages, the advantages outweigh the benefits:
Most efficient encoding for mixed languages
Web standard and universal support
Backwards compatibility with ASCII
For most applications, UTF-8 is the best choice. Other encodings such as UTF-16 and UTF-32 are used only in very specific cases.
UTF-8 is used in virtually all modern technologies. From websites and databases to programming languages and operating systems, UTF-8 is the standard encoding because of its flexibility and broad support. Here are some of the main applications.
The Web runs on UTF-8. HTML, CSS, and JavaScript files are UTF-8 encoded by default, and modern browsers expect this encoding.
To ensure that a Web page correctly uses UTF-8, add the following meta tag in the <head> section of your HTML document:
<meta charset=“UTF-8”>.
This will display characters correctly regardless of language.
Prevents wrong characters (mojibake) in HTML.
Ensures that special characters such as €, ñ, ä, 和 are displayed correctly.
Improves SEO because search engines support UTF-8 by default.
Modern databases such as MySQL, PostgreSQL, and SQLite support UTF-8 as a standard.
Supports multiple languages in the same database.
Prevents foreign characters from being lost in storage.
Compatible with Web applications that use UTF-8.
When creating a database or table in MySQL, you can set UTF-8 as the default:
CREATE DATABASE my_database CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
MySQL's older utf8 implementation does not support 4-byte Unicode characters such as emojis (😀). Always use utf8mb4 for full Unicode support.
Most modern programming languages support UTF-8 as a standard.
Example: UTF-8 strings in Python
text = “Hello, world! ”
print(text.encode(“utf-8”)) # Output: b'Hello, world! \[“b”].
Here the Unicode emoji 🌍 is correctly converted to UTF-8 bytes.
Operating systems such as Windows, macOS, and Linux support UTF-8 for file names, terminal display and applications.
Older Windows systems used Windows-1252 or UTF-16, but modern versions fully support UTF-8 in cmd and PowerShell.
To enable UTF-8 in Windows terminal:
chcp 65001
This switches the terminal to UTF-8 mode, displaying special characters correctly.
Almost all modern APIs, JSON files and XML files use UTF-8 as standard encoding.
Example: JSON with UTF-8
{
"naam": "Jörg Müller",
"stad": "München",
"emoji": "😀"
}
JSON is always UTF-8, making it easy to exchange data globally without character corruption.
Websites: HTML, CSS, and JavaScript use UTF-8 by default.
Databases: MySQL and PostgreSQL support UTF-8 (utf8mb4).
Programming languages: Almost every language uses UTF-8 as a standard for strings.
Operating systems: macOS, Linux and Windows support UTF-8.
APIs and JSON: Data exchange via JSON and XML is standard in UTF-8.
UTF-8 is the most versatile and efficient encoding for modern technologies and remains the best choice for any application.
UTF-8 has not just become popular; it is an officially recognized and widely supported standard within various industries. From international organizations to programming languages and operating systems, UTF-8 is used almost everywhere.
UTF-8 is officially enshrined in several standards and specifications.
These standards ensure that UTF-8 is consistently used worldwide in software, hardware, and network protocols.
Virtually all modern programming languages support UTF-8 directly or provide native support for Unicode.
Many frameworks such as Django, React, Angular and Node.js use UTF-8 standard to ensure compatibility.
UTF-8 is the recommended encoding for text in databases because it is compatible with multiple languages and prevents character loss.
Notice:
MySQL's UTF-8 does NOT support 4-byte Unicode characters such as emojis. Always use utf8mb4!
PostgreSQL and SQLite support UTF-8 by default without additional configuration.
Operating systems support UTF-8 to correctly process file names, text input and applications.
Windows users sometimes need to manually switch to UTF-8, for example in the terminal with:
chcp 65001
However, since Windows 10, UTF-8 is better supported by default.
All modern web browsers support UTF-8 and use it by default for web pages.
Web pages without a specific encoding setting are usually interpreted as UTF-8 by browsers, indicating how universal the standard is.
Many network and file formats support UTF-8 to ensure global compatibility.
Important for developers:
JSON and XML support UTF-8 by default, so manual conversions are not required.
Emails with international characters must use UTF-8 to be displayed correctly in all clients.
UTF-8 is the most widely used character encoding in the world. Its flexibility, efficiency, and universal compatibility have made it the standard for Web development, databases, programming languages and operating systems.
Widest support:Programming languages, databases, operating systems and networks all support UTF-8.
Global compatibility: Supports all languages and symbols without conversion problems.
Standardized by Unicode and W3C: UTF-8 is the recommended encoding for the Web and software.
Due to its universal adoption and efficient storage, UTF-8 remains the best choice for word processing, storage and data exchange.
UTF-8 is a character encoding that can store all Unicode characters at 1 to 4 bytes per character. It is the standard encoding for the Web and modern software.
A character's UTF-8 code is its binary or hexadecimal representation. For example, the A has the UTF-8 code 0x41, and € has 0xE2 0x82 0xAC.
UTF-8 supports all 1,114,112 Unicode characters (from U+0000 to U+10FFFF), although not all code points are in use.
All characters can be encoded in UTF-8, but incorrectly encoded characters or old encodings such as Windows-1252 can cause problems.