Posted 2025-05-26Tutorial3 min read

Unicode and Character Sets: The Essential Guide for Beginners in Web Engineering

The Problem: Computers Only Understand Numbers

Computers don’t understand letters, symbols, or emojis. They only understand numbers (0s and 1s). So how does your computer display text? It uses a system that assigns each character a unique number.

Before Unicode: The Mess

In the early days, different systems used different number-to-character mappings:

ASCII: Used numbers 0-127 for basic English letters, numbers, and symbols
Various code pages: Each country/language had their own system for characters beyond ASCII

This created chaos. A document written in Russian might display as gibberish on a computer set up for Arabic text.

Enter Unicode: The Solution

Unicode is a universal standard that assigns a unique number (called a “code point”) to every character in every language, plus emojis, symbols, and more.

Examples:

Letter “A” = U+0041 (65 in decimal)
Greek letter “α” = U+03B1 (945 in decimal)
Emoji “😀” = U+1F600 (128512 in decimal)

Encoding: How Unicode Numbers Become Bytes

Unicode defines what number each character gets, but computers need to store these numbers as bytes. This is where encoding comes in.

UTF-8: The Most Common Encoding

UTF-8 is the dominant way to encode Unicode text:

Uses 1-4 bytes per character
English letters use 1 byte (efficient for English text)
Other languages may use 2-4 bytes
Can represent all Unicode characters

Other Encodings

UTF-16: Uses 2 or 4 bytes per character
UTF-32: Uses exactly 4 bytes per character
Legacy encodings: ISO-8859-1, Windows-1252, etc. (avoid these!)

Critical Rules for Web Developers

1. Always Specify Encoding in HTML

1	<meta charset="UTF-8">

Put this in your HTML <head> section. Without it, browsers guess the encoding and often guess wrong.

2. Use UTF-8 Everywhere

Save your HTML, CSS, and JavaScript files as UTF-8
Configure your database to use UTF-8
Set your web server to serve UTF-8

3. Handle Form Submissions Properly

When users submit forms, make sure your server interprets the data as UTF-8:

1	<form accept-charset="UTF-8">

4. Database Storage

Configure your database columns to store UTF-8:

-- MySQL example
CREATE TABLE users (
    name VARCHAR(255) CHARACTER SET utf8mb4
);

Common Mistakes and Solutions

Mistake: “Mojibake” (Garbled Text)

Problem: Text displays as “Ã¡Ã©ÃÃ³Ãº” instead of “áéíóú”
Cause: Text encoded as UTF-8 but interpreted as something else
Solution: Ensure consistent UTF-8 usage throughout your application

Mistake: Question Marks or Empty Boxes

Problem: Characters display as “?” or “□”
Cause: The font doesn’t support those characters, or wrong encoding
Solution: Use web fonts that support international

Mistake: Truncated Text

Problem: Text gets cut off unexpectedly
Cause: Counting bytes instead of characters
Solution: Use proper string functions that understand Unicode

Testing Your Application

Test with international characters: Try names like “José” and “王伟”
Test with emojis: Modern applications should handle “👨‍💻” properly
Test copy-paste: Users should be able to paste text from other applications

The Bottom Line

Modern web development is simple if you follow one rule: Use UTF-8 everywhere. Set it in your HTML meta tags, save your files as UTF-8, configure your database for UTF-8, and your international users will thank you.

The days of character encoding problems are over if you consistently use UTF-8 from the start. Don’t mix encodings, don’t use legacy character sets, and always be explicit about your encoding choices.

Unicode and Character Sets: The Essential Guide for Beginners in Web Engineering

https://mehamasum.github.io/blog/2025/5/unicode-for-beginners/

Author

Mehedi Hasan Masum

Posted on

2025-05-26

Licensed under

CC BY-NC-SA 4.0

#ai edited unicode

Unicode and Character Sets: The Essential Guide for Beginners in Web Engineering

The Problem: Computers Only Understand Numbers

Before Unicode: The Mess

Enter Unicode: The Solution

Encoding: How Unicode Numbers Become Bytes

UTF-8: The Most Common Encoding

Other Encodings

Critical Rules for Web Developers

1. Always Specify Encoding in HTML

2. Use UTF-8 Everywhere

3. Handle Form Submissions Properly

4. Database Storage

Common Mistakes and Solutions

Mistake: “Mojibake” (Garbled Text)

Mistake: Question Marks or Empty Boxes

Mistake: Truncated Text

Testing Your Application

The Bottom Line

Author

Posted on

Licensed under

Comments

Catalogue

Categories

Tags

Your browser is out-of-date!