What is UTF-8? How it works and why it is the standard

What is UTF-8?

UTF-8 is a character encoding used to digitally store and exchange text. It is a standard compatible with Unicode and can represent virtually all the world's written characters. Its efficient storage and wide adoption make it the most widely used encoding on the Internet and software applications.

UTF-8 is designed as a variable-length encoding, meaning that some characters take up fewer bytes than others. This makes it compatible with older systems that support ASCII, while allowing it to accommodate a wide range of special and international characters.

Why is UTF-8 important?

The emergence of Unicode and UTF-8 has made software and websites more globally accessible. Without a universal encoding such as UTF-8, systems would experience problems correctly representing different languages and symbols.

Some reasons why UTF-8 is the preferred choice for text encoding:

Global support: All languages and symbols can be stored without the need for separate encodings.
Compatibility with ASCII: Old systems that only support ASCII can still read UTF-8-encoded text correctly.
Efficiency: Frequently used characters take up less storage space, which helps with fast processing.
Web standard: Browsers and Web servers use UTF-8 by default, reducing compatibility issues.

The relationship between UTF-8 and Unicode

Unicode is a standard that assigns unique numeric values (code points) to characters from different languages and symbol sets. UTF-8 is a way of storing these codepoints in a computer-friendly format.

For example:

The letter A has Unicode codepoint U+0041 and is stored in UTF-8 as 1 byte (0x41).
The symbol € has Unicode code point U+20AC and is stored in UTF-8 as 3 bytes (0xE2 0x82 0xAC).

This makes UTF-8 a flexible and scalable solution for modern software and Web development.

History of UTF-8

UTF-8 was developed in 1992 by Ken Thompson and Rob Pike, two engineers at Bell Labs. They designed this character encoding as a more efficient way to store and process Unicode characters, with a focus on compatibility with ASCII and saving space.

The original idea was to create a variable-length encoding that:

Would be backward compatible with the existing ASCII format.
Would handle frequently used characters more efficiently by making them take up less storage space.
Would not cause byte conflicts on older systems that were not Unicode-compatible.

In 1993 UTF-8 was included in the Unicode 2.0 standard, and in 1996 it was further refined in RFC 2044. Later, in 2003, it was finally specified in RFC 3629, which limited the valid characters within Unicode to 0 through 0x10FFFF (the first 1,114,112 characters of Unicode).

Why was UTF-8 needed?

Before the advent of Unicode and UTF-8, there were many character encodings such as ISO 8859-1 (Latin-1), Shift-JIS, and Windows-1252. This caused major compatibility problems in international communication and file exchange.

Problems with older encodings:

Limited character sets – Each encoding could only support a limited number of languages.
Incompatibility between systems – A text file encoded in Windows-1252 could be unreadable on a system that used ISO 8859-1.
Different byte lengths – Some encodings used fixed length byte encodings, which was inefficient for languages with many symbols.

Unicode offered a solution by introducing one universal character set. However, the first Unicode encodings such as UTF-16 and UTF-32 used 2 or 4 bytes per character, which was inefficient for English and other Latin-based texts.

How has UTF-8 evolved?

Since its introduction, UTF-8 has spread rapidly and become the dominant character encoding on the Internet and in software applications.

Key developments:

2008: Google switched its search engine completely to UTF-8.
2010: More than 50% of all websites used UTF-8 as the default encoding.
2019: More than 95% of all websites were encoded in UTF-8.
Now: UTF-8 is the standard encoding for most operating systems, databases, and programming languages.

Thanks to its wide support and efficiency, UTF-8 is the most widely used character encoding in the world.

How does UTF-8 work?

UTF-8 is variable-length character encoding, meaning that some characters require fewer bytes than others. This ensures that the encoding is efficient while remaining compatible with ASCII.

The basic principle of variable-length encoding

Each Unicode code point is stored in UTF-8 as a sequence of one to four bytes.

ASCII characters (0-127) are stored as one byte (compatible with ASCII).
Other Unicode characters take up two, three or four bytes, depending on their codepoint.

Here is an overview of how characters are stored in UTF-8:

This means that commonly used characters such as letters and numbers take up little storage space, while rare symbols or non-Latin characters require more bytes.

Examples of UTF-8 encoding

Let's look at some Unicode characters and see how they are stored in UTF-8:

As you can see, the letter A uses only 1 byte, the euro sign uses 3 bytes, and an emoji such as 😀 uses 4 bytes.

ASCII compatibility of UTF-8

A major advantage of UTF-8 is that ASCII characters remain exactly the same. This means that a file with only ASCII characters in UTF-8 will be read the same by older systems that understand only ASCII.

For example, the ASCII string:

Hello

is stored the same in UTF-8 as it is in ASCII:

0x48 0x65 0x6C 0x6C 0x6F

But if you add a Unicode character, such as an é (U+00E9), the encoding will change to:

0x48 0x65 0x6C 0x6C 0xC3 0xA9

Here you can see that é is stored as 2 bytes (0xC3 0xA9), while the rest remains unchanged.

Overlong encodings

Overlong encodings are incorrect UTF-8 representations of characters that use unnecessarily many bytes.

For example, the ASCII character A can be encoded correctly as 0x41 (1 byte). But in an overlong encoding, the same character can be stored as 11000000 10000001 (2 bytes), which is unnecessary and unsafe.

Why are overlong encodings a problem?

They can be used in security exploits to hide unwanted characters.
They are not allowed by the Unicode standard.
Modern browsers and software reject overlong encodings

Error handling in UTF-8

What happens when an application encounters an invalid UTF-8 byte?

There are three possible approaches:

Ignore error – The corrupt byte is skipped.
Replace with a standard character – Often the “replacement character” � (U+FFFD) is used.
Give error message – Some systems reject the input completely.

Example:

If a byte is missing from a 2-byte character, the software may replace it with � to indicate that the text is corrupted.

Surrogates and Byte Order Mark (BOM)

Surrogates are special code points used in UTF-16 to encode characters outside the base plan. But in UTF-8, surrogates are invalid and should not be used.

Byte Order Mark (BOM) is an optional marker (U+FEFF) that some systems place at the beginning of a file to let you know that the content is encoded in UTF-8.

Not usually needed in UTF-8, because UTF-8 has no byte sequence issues.
Can cause problems if software does not process a BOM correctly.

Difference between UTF-8 and other encodings

Several character encodings are available, but UTF-8 has set the standard because of its flexibility and broad support. Nevertheless, other encodings such as UTF-16 and UTF-32 are still used in specific situations. Let's look at the main differences.

UTF-8 vs. UTF-16 vs. UTF-32

When do you use which encoding?

Use UTF-8 as the default choice.
- It is the most efficient and compatible option for almost all applications.
- Suitable for the Web, databases, and operating systems.
Use UTF-16 if you handle many Asian characters.
- Some older Windows and Java applications work with UTF-16 by default.
- Not efficient for ASCII texts.
Use UTF-32 if storage space is not an issue, and you need direct access to characters.
- Rarely used outside specialized applications, such as certain internal data structures in software.

Comparison with older encodings such as ISO 8859-1 and Windows-1252

Before Unicode became popular, many computers, and systems used locale-specific encodings, such as ISO 8859-1 (Latin-1) and Windows-1252.

Why is UTF-8 better than older encodings?

Universal language support: One encoding for all languages and symbols.
No character corruption: No “mojibake” (unreadable characters) when misinterpreted.
Standardized: Supported by all modern operating systems and software.

Advantages of UTF-8

UTF-8 has emerged as the most widely used character encoding in the world, mainly due to its versatility and efficiency. Here are the main advantages of UTF-8:

Universal compatibility.

UTF-8 supports all Unicode characters, meaning it can be used for any script in the world. This makes it ideal for international communications, websites and software.

Compatibility with ASCII

One of the biggest advantages of UTF-8 is that all ASCII characters (0-127) remain the same. This means that:

Existing ASCII-based systems can use UTF-8 without problems.
Old software can continue to process ASCII content without conversion.

For example, the string:

Hello

is stored the same in both ASCII and UTF-8:

0x48 0x65 0x6C 0x6C 0x6F

This avoids compatibility issues with older systems.

Efficient storage and processing

Because UTF-8 uses variable length, it takes up less space for commonly used characters than other Unicode encodings such as UTF-16 or UTF-32.

Storage size comparison:

ASCII characters take up only 1 byte.

Chinese, Japanese, and Arabic characters may require 2 to 4 bytes.

UTF-16 and UTF-32 always take up more space for English-language texts.

Standard encoding on the web

According to W3Techs, more than 95% of all websites today are encoded in UTF-8. This is because:

W3C and modern browsers use UTF-8 as the standard.
It is the most efficient encoding for mixed languages and symbols.

Support in databases and programming languages

Most databases and programming languages support UTF-8 by default:

MySQL & PostgreSQL use UTF-8 as standard for text fields.
Python, JavaScript, Java, and C# support UTF-8 directly.
JSON and XML are almost always stored in UTF-8.

The broad support makes UTF-8 the safest choice for text storage and data exchange.

Disadvantages of UTF-8

Although UTF-8 has become the standard for character encoding, it also has some disadvantages, especially in specific situations. Here are the main limitations:

Higher storage space for certain characters.

Although UTF-8 is efficient for ASCII characters (1 byte per character), some Unicode characters can take up more space.

Comparison of characters in UTF-8 vs. other encodings:

Chinese, Japanese, and Arabic characters take 3 to 4 bytes in UTF-8, while in UTF-16 they can use only 2 bytes.
For English-language texts, UTF-8 is more efficient, but for large amounts of Asian characters, UTF-16 can be more compact.

Processing complexity

Because characters have variable length (1 to 4 bytes), it can be more difficult to work with UTF-8-encoded text in programming languages and databases.

Examples of complications:

String length calculation: A character is not always 1 byte, so length() functions can give unexpected results.
Substring operation: Cutting a UTF-8 string without regard to byte length can result in corrupted characters.
Retrieving the nth character: Because characters have variable length, you often need to iterate them instead of applying direct indexing.

Problems with older systems

Although UTF-8 is the standard in modern systems, old programs and devices may still expect ISO 8859-1 or Windows-1252. This can lead to:

Unreadable characters (mojibake) if a system does not decode correctly.
Problems with file conversion when migrating older databases.

Overhead with binary-oriented formats.

Some binary-oriented protocols do not work as well with UTF-8. For example:

System logs, and low-level network information may contain extra bytes due to UTF-8.
Special characters such as 💜 or 😃 can take up more bytes, affecting file formats such as JSON.

No fixed byte size per character

Unlike UTF-32, where each character is always 4 bytes, the size in UTF-8 varies. This can lead to:

Slower processing in random access (for example, in databases and search indexes).
More complex processing in programming languages where characters must be searched or replaced quickly.

Disadvantages outweigh advantages

Although UTF-8 has some disadvantages, the advantages outweigh the benefits:

Most efficient encoding for mixed languages
Web standard and universal support
Backwards compatibility with ASCII

For most applications, UTF-8 is the best choice. Other encodings such as UTF-16 and UTF-32 are used only in very specific cases.

Use of UTF-8 in practice

UTF-8 is used in virtually all modern technologies. From websites and databases to programming languages and operating systems, UTF-8 is the standard encoding because of its flexibility and broad support. Here are some of the main applications.

Use of UTF-8 in web development

The Web runs on UTF-8. HTML, CSS, and JavaScript files are UTF-8 encoded by default, and modern browsers expect this encoding.

How do you set UTF-8 in HTML?

To ensure that a Web page correctly uses UTF-8, add the following meta tag in the <head> section of your HTML document:

<meta charset=“UTF-8”>.

This will display characters correctly regardless of language.

Why is this important?

Prevents wrong characters (mojibake) in HTML.
Ensures that special characters such as €, ñ, ä, 和 are displayed correctly.
Improves SEO because search engines support UTF-8 by default.

Use of UTF-8 in databases

Modern databases such as MySQL, PostgreSQL, and SQLite support UTF-8 as a standard.

Why is UTF-8 important in databases?

Supports multiple languages in the same database.
Prevents foreign characters from being lost in storage.
Compatible with Web applications that use UTF-8.

Configuring MySQL for UTF-8

When creating a database or table in MySQL, you can set UTF-8 as the default:

CREATE DATABASE my_database CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Why utf8mb4 and not utf8?

MySQL's older utf8 implementation does not support 4-byte Unicode characters such as emojis (😀). Always use utf8mb4 for full Unicode support.

Use of UTF-8 in programming languages.

Most modern programming languages support UTF-8 as a standard.

Example: UTF-8 strings in Python

text = “Hello, world! ”
print(text.encode(“utf-8”))  # Output: b'Hello, world! \[“b”].

Here the Unicode emoji 🌍 is correctly converted to UTF-8 bytes.

Using UTF-8 in operating systems

Operating systems such as Windows, macOS, and Linux support UTF-8 for file names, terminal display and applications.

Windows and UTF-8

Older Windows systems used Windows-1252 or UTF-16, but modern versions fully support UTF-8 in cmd and PowerShell.

To enable UTF-8 in Windows terminal:

chcp 65001

This switches the terminal to UTF-8 mode, displaying special characters correctly.

Use of UTF-8 in APIs and data exchange

Almost all modern APIs, JSON files and XML files use UTF-8 as standard encoding.

Example: JSON with UTF-8

{
    "naam": "Jörg Müller",
    "stad": "München",
    "emoji": "😀"
}

JSON is always UTF-8, making it easy to exchange data globally without character corruption.

UTF-8 is everywhere

Websites: HTML, CSS, and JavaScript use UTF-8 by default.
Databases: MySQL and PostgreSQL support UTF-8 (utf8mb4).
Programming languages: Almost every language uses UTF-8 as a standard for strings.
Operating systems: macOS, Linux and Windows support UTF-8.
APIs and JSON: Data exchange via JSON and XML is standard in UTF-8.

UTF-8 is the most versatile and efficient encoding for modern technologies and remains the best choice for any application.

Standards and support for UTF-8

UTF-8 has not just become popular; it is an officially recognized and widely supported standard within various industries. From international organizations to programming languages and operating systems, UTF-8 is used almost everywhere.

Standards and specifications

UTF-8 is officially enshrined in several standards and specifications.

These standards ensure that UTF-8 is consistently used worldwide in software, hardware, and network protocols.

Support in programming languages and frameworks

Virtually all modern programming languages support UTF-8 directly or provide native support for Unicode.

Many frameworks such as Django, React, Angular and Node.js use UTF-8 standard to ensure compatibility.

Support in databases

UTF-8 is the recommended encoding for text in databases because it is compatible with multiple languages and prevents character loss.

Notice:

MySQL's UTF-8 does NOT support 4-byte Unicode characters such as emojis. Always use utf8mb4!
PostgreSQL and SQLite support UTF-8 by default without additional configuration.

Support in operating systems

Operating systems support UTF-8 to correctly process file names, text input and applications.

Windows users sometimes need to manually switch to UTF-8, for example in the terminal with:

chcp 65001

However, since Windows 10, UTF-8 is better supported by default.

Support in web browsers

All modern web browsers support UTF-8 and use it by default for web pages.

Web pages without a specific encoding setting are usually interpreted as UTF-8 by browsers, indicating how universal the standard is.

Support in network protocols and files

Many network and file formats support UTF-8 to ensure global compatibility.

Important for developers:

JSON and XML support UTF-8 by default, so manual conversions are not required.
Emails with international characters must use UTF-8 to be displayed correctly in all clients.

UTF-8 is the global standard

UTF-8 is the most widely used character encoding in the world. Its flexibility, efficiency, and universal compatibility have made it the standard for Web development, databases, programming languages and operating systems.

Widest support:Programming languages, databases, operating systems and networks all support UTF-8.
Global compatibility: Supports all languages and symbols without conversion problems.
Standardized by Unicode and W3C: UTF-8 is the recommended encoding for the Web and software.

Due to its universal adoption and efficient storage, UTF-8 remains the best choice for word processing, storage and data exchange.

Frequently Asked Questions

What is UTF-8 encoding?

UTF-8 is a character encoding that can store all Unicode characters at 1 to 4 bytes per character. It is the standard encoding for the Web and modern software.

What is the UTF-8 code?

A character's UTF-8 code is its binary or hexadecimal representation. For example, the A has the UTF-8 code 0x41, and € has 0xE2 0x82 0xAC.

How many UTF-8 characters are there?

UTF-8 supports all 1,114,112 Unicode characters (from U+0000 to U+10FFFF), although not all code points are in use.

What are non-UTF-8 characters?

All characters can be encoded in UTF-8, but incorrectly encoded characters or old encodings such as Windows-1252 can cause problems.

Articles you might enjoy

ASCII

ASCII (American Standard Code for Information Interchange) is a text coding standard that forms the basis of digital communication. It is used worldwide to encode text data consistently, allowing computers and other devices to exchange information seamlessly. In this blog, we explain what ASCII is, why it's important and how it works, with practical examples and insights for IT professionals, developers, entrepreneurs and tech enthusiasts.

CSS

CSS (Cascading Style Sheets) utilises a straightforward syntax and structure to define the visual presentation of HTML elements within a web page. The basic structure comprises style rules that consist of selectors and declarations.

CSS Box Model: Understanding & Mastering Web Layouts

The push towards cloud computing has become more than just a trend—it’s a necessary shift for businesses seeking to remain competitive and innovative. Cloud migration—moving data, applications, and other business elements from an on-premise infrastructure to the cloud—is a pivotal strategy that companies of all sizes are adopting. This move enhances operational efficiency and offers unparalleled scalability, flexibility, and cost-effectiveness.

Front-end Development

Web Development

CSS Selectors: The Basics Every Developer Should Know

CSS selectors are one of the building blocks of web development. They act as a bridge between your HTML and the styles you want to apply, telling the browser what to style and how to do it. Whether you would like to change the colour of all your headings, add a hover effect to buttons, or style only specific parts of a page, selectors make it all possible.

Milo Tolboom

Software Engineering Consultant

UTF-8