Tuple Logo
utf-8

SHARE

UTF-8

What is UTF-8?

UTF-8 is a character encoding used to digitally store and exchange text. It is a standard compatible with Unicode and can represent virtually all the world's written characters. Its efficient storage and wide adoption make it the most widely used encoding on the Internet and software applications.

UTF-8 is designed as a variable-length encoding, meaning that some characters take up fewer bytes than others. This makes it compatible with older systems that support ASCII, while allowing it to accommodate a wide range of special and international characters.

Why is UTF-8 important?

The emergence of Unicode and UTF-8 has made software and websites more globally accessible. Without a universal encoding such as UTF-8, systems would experience problems correctly representing different languages and symbols.

Some reasons why UTF-8 is the preferred choice for text encoding:

The relationship between UTF-8 and Unicode

Unicode is a standard that assigns unique numeric values (code points) to characters from different languages and symbol sets. UTF-8 is a way of storing these codepoints in a computer-friendly format.

For example:

This makes UTF-8 a flexible and scalable solution for modern software and Web development.

History of UTF-8

UTF-8 was developed in 1992 by Ken Thompson and Rob Pike, two engineers at Bell Labs. They designed this character encoding as a more efficient way to store and process Unicode characters, with a focus on compatibility with ASCII and saving space.

The original idea was to create a variable-length encoding that:

In 1993 UTF-8 was included in the Unicode 2.0 standard, and in 1996 it was further refined in RFC 2044. Later, in 2003, it was finally specified in RFC 3629, which limited the valid characters within Unicode to 0 through 0x10FFFF (the first 1,114,112 characters of Unicode).

Why was UTF-8 needed?

Before the advent of Unicode and UTF-8, there were many character encodings such as ISO 8859-1 (Latin-1), Shift-JIS, and Windows-1252. This caused major compatibility problems in international communication and file exchange.

Problems with older encodings:

Unicode offered a solution by introducing one universal character set. However, the first Unicode encodings such as UTF-16 and UTF-32 used 2 or 4 bytes per character, which was inefficient for English and other Latin-based texts.

How has UTF-8 evolved?

Since its introduction, UTF-8 has spread rapidly and become the dominant character encoding on the Internet and in software applications.

Key developments:

Thanks to its wide support and efficiency, UTF-8 is the most widely used character encoding in the world.

How does UTF-8 work?

UTF-8 is variable-length character encoding, meaning that some characters require fewer bytes than others. This ensures that the encoding is efficient while remaining compatible with ASCII.

The basic principle of variable-length encoding

Each Unicode code point is stored in UTF-8 as a sequence of one to four bytes.

Here is an overview of how characters are stored in UTF-8:

This means that commonly used characters such as letters and numbers take up little storage space, while rare symbols or non-Latin characters require more bytes.

Examples of UTF-8 encoding

Let's look at some Unicode characters and see how they are stored in UTF-8:

As you can see, the letter A uses only 1 byte, the euro sign uses 3 bytes, and an emoji such as 😀 uses 4 bytes.

ASCII compatibility of UTF-8

A major advantage of UTF-8 is that ASCII characters remain exactly the same. This means that a file with only ASCII characters in UTF-8 will be read the same by older systems that understand only ASCII.

For example, the ASCII string:

Hello

is stored the same in UTF-8 as it is in ASCII:

0x48 0x65 0x6C 0x6C 0x6F

But if you add a Unicode character, such as an é (U+00E9), the encoding will change to:

0x48 0x65 0x6C 0x6C 0xC3 0xA9

Here you can see that é is stored as 2 bytes (0xC3 0xA9), while the rest remains unchanged.

Overlong encodings

Overlong encodings are incorrect UTF-8 representations of characters that use unnecessarily many bytes.

For example, the ASCII character A can be encoded correctly as 0x41 (1 byte). But in an overlong encoding, the same character can be stored as 11000000 10000001 (2 bytes), which is unnecessary and unsafe.

Why are overlong encodings a problem?

Error handling in UTF-8

What happens when an application encounters an invalid UTF-8 byte?

There are three possible approaches:

Example:

If a byte is missing from a 2-byte character, the software may replace it with � to indicate that the text is corrupted.

Surrogates and Byte Order Mark (BOM)

Surrogates are special code points used in UTF-16 to encode characters outside the base plan. But in UTF-8, surrogates are invalid and should not be used.

Byte Order Mark (BOM) is an optional marker (U+FEFF) that some systems place at the beginning of a file to let you know that the content is encoded in UTF-8.

Difference between UTF-8 and other encodings

Several character encodings are available, but UTF-8 has set the standard because of its flexibility and broad support. Nevertheless, other encodings such as UTF-16 and UTF-32 are still used in specific situations. Let's look at the main differences.

UTF-8 vs. UTF-16 vs. UTF-32

When do you use which encoding?

Comparison with older encodings such as ISO 8859-1 and Windows-1252

Before Unicode became popular, many computers, and systems used locale-specific encodings, such as ISO 8859-1 (Latin-1) and Windows-1252.

Why is UTF-8 better than older encodings?

Advantages of UTF-8

UTF-8 has emerged as the most widely used character encoding in the world, mainly due to its versatility and efficiency. Here are the main advantages of UTF-8:

Universal compatibility.

UTF-8 supports all Unicode characters, meaning it can be used for any script in the world. This makes it ideal for international communications, websites and software.

Compatibility with ASCII

One of the biggest advantages of UTF-8 is that all ASCII characters (0-127) remain the same. This means that:

For example, the string:

Hello

is stored the same in both ASCII and UTF-8:

0x48 0x65 0x6C 0x6C 0x6F

This avoids compatibility issues with older systems.

Efficient storage and processing

Because UTF-8 uses variable length, it takes up less space for commonly used characters than other Unicode encodings such as UTF-16 or UTF-32.

Storage size comparison:

ASCII characters take up only 1 byte.

Chinese, Japanese, and Arabic characters may require 2 to 4 bytes.

UTF-16 and UTF-32 always take up more space for English-language texts.

Standard encoding on the web

According to W3Techs, more than 95% of all websites today are encoded in UTF-8. This is because:

Support in databases and programming languages

Most databases and programming languages support UTF-8 by default:

The broad support makes UTF-8 the safest choice for text storage and data exchange.

Disadvantages of UTF-8

Although UTF-8 has become the standard for character encoding, it also has some disadvantages, especially in specific situations. Here are the main limitations:

Higher storage space for certain characters.

Although UTF-8 is efficient for ASCII characters (1 byte per character), some Unicode characters can take up more space.

Comparison of characters in UTF-8 vs. other encodings:

Processing complexity

Because characters have variable length (1 to 4 bytes), it can be more difficult to work with UTF-8-encoded text in programming languages and databases.

Examples of complications:

Problems with older systems

Although UTF-8 is the standard in modern systems, old programs and devices may still expect ISO 8859-1 or Windows-1252. This can lead to:

Overhead with binary-oriented formats.

Some binary-oriented protocols do not work as well with UTF-8. For example:

No fixed byte size per character

Unlike UTF-32, where each character is always 4 bytes, the size in UTF-8 varies. This can lead to:

Disadvantages outweigh advantages

Although UTF-8 has some disadvantages, the advantages outweigh the benefits:

For most applications, UTF-8 is the best choice. Other encodings such as UTF-16 and UTF-32 are used only in very specific cases.

Use of UTF-8 in practice

UTF-8 is used in virtually all modern technologies. From websites and databases to programming languages and operating systems, UTF-8 is the standard encoding because of its flexibility and broad support. Here are some of the main applications.

Use of UTF-8 in web development

The Web runs on UTF-8. HTML, CSS, and JavaScript files are UTF-8 encoded by default, and modern browsers expect this encoding.

How do you set UTF-8 in HTML?

To ensure that a Web page correctly uses UTF-8, add the following meta tag in the <head> section of your HTML document:

<meta charset=“UTF-8”>.

This will display characters correctly regardless of language.

Why is this important?

Use of UTF-8 in databases

Modern databases such as MySQL, PostgreSQL, and SQLite support UTF-8 as a standard.

Why is UTF-8 important in databases?

Configuring MySQL for UTF-8

When creating a database or table in MySQL, you can set UTF-8 as the default:

CREATE DATABASE my_database CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Why utf8mb4 and not utf8?

MySQL's older utf8 implementation does not support 4-byte Unicode characters such as emojis (😀). Always use utf8mb4 for full Unicode support.

Use of UTF-8 in programming languages.

Most modern programming languages support UTF-8 as a standard.

Example: UTF-8 strings in Python

text = “Hello, world! ”
print(text.encode(“utf-8”))  # Output: b'Hello, world! \[“b”].

Here the Unicode emoji 🌍 is correctly converted to UTF-8 bytes.

Using UTF-8 in operating systems

Operating systems such as Windows, macOS, and Linux support UTF-8 for file names, terminal display and applications.

Windows and UTF-8

Older Windows systems used Windows-1252 or UTF-16, but modern versions fully support UTF-8 in cmd and PowerShell.

To enable UTF-8 in Windows terminal:

chcp 65001

This switches the terminal to UTF-8 mode, displaying special characters correctly.

Use of UTF-8 in APIs and data exchange

Almost all modern APIs, JSON files and XML files use UTF-8 as standard encoding.

Example: JSON with UTF-8

{
    "naam": "Jörg Müller",
    "stad": "München",
    "emoji": "😀"
}

JSON is always UTF-8, making it easy to exchange data globally without character corruption.

UTF-8 is everywhere

UTF-8 is the most versatile and efficient encoding for modern technologies and remains the best choice for any application.

Standards and support for UTF-8

UTF-8 has not just become popular; it is an officially recognized and widely supported standard within various industries. From international organizations to programming languages and operating systems, UTF-8 is used almost everywhere.

Standards and specifications

UTF-8 is officially enshrined in several standards and specifications.

These standards ensure that UTF-8 is consistently used worldwide in software, hardware, and network protocols.

Support in programming languages and frameworks

Virtually all modern programming languages support UTF-8 directly or provide native support for Unicode.

Many frameworks such as Django, React, Angular and Node.js use UTF-8 standard to ensure compatibility.

Support in databases

UTF-8 is the recommended encoding for text in databases because it is compatible with multiple languages and prevents character loss.

Notice:

Support in operating systems

Operating systems support UTF-8 to correctly process file names, text input and applications.

Windows users sometimes need to manually switch to UTF-8, for example in the terminal with:

chcp 65001

However, since Windows 10, UTF-8 is better supported by default.

Support in web browsers

All modern web browsers support UTF-8 and use it by default for web pages.

Web pages without a specific encoding setting are usually interpreted as UTF-8 by browsers, indicating how universal the standard is.

Support in network protocols and files

Many network and file formats support UTF-8 to ensure global compatibility.

Important for developers:

UTF-8 is the global standard

UTF-8 is the most widely used character encoding in the world. Its flexibility, efficiency, and universal compatibility have made it the standard for Web development, databases, programming languages and operating systems.

Due to its universal adoption and efficient storage, UTF-8 remains the best choice for word processing, storage and data exchange.

Frequently Asked Questions
What is UTF-8 encoding?

UTF-8 is a character encoding that can store all Unicode characters at 1 to 4 bytes per character. It is the standard encoding for the Web and modern software.


What is the UTF-8 code?

A character's UTF-8 code is its binary or hexadecimal representation. For example, the A has the UTF-8 code 0x41, and € has 0xE2 0x82 0xAC.


How many UTF-8 characters are there?

UTF-8 supports all 1,114,112 Unicode characters (from U+0000 to U+10FFFF), although not all code points are in use.


What are non-UTF-8 characters?

All characters can be encoded in UTF-8, but incorrectly encoded characters or old encodings such as Windows-1252 can cause problems.


Articles you might enjoy

Piqued your interest?

We'd love to tell you more.

Contact us
Tuple Logo
Veenendaal (HQ)
De Smalle Zijde 3-05, 3903 LL Veenendaal
info@tuple.nl‭+31 318 24 01 64‬
Quick Links
Customer Stories