Unicode and Character Sets: The Essential Guide for Beginners in Web Engineering
The Problem: Computers Only Understand Numbers
Computers don’t understand letters, symbols, or emojis. They only understand numbers (0s and 1s). So how does your computer display text? It uses a system that assigns each character a unique number.
Before Unicode: The Mess
In the early days, different systems used different number-to-character mappings:
- ASCII: Used numbers 0-127 for basic English letters, numbers, and symbols
- Various code pages: Each country/language had their own system for characters beyond ASCII
This created chaos. A document written in Russian might display as gibberish on a computer set up for Arabic text.
Enter Unicode: The Solution
Unicode is a universal standard that assigns a unique number (called a “code point”) to every character in every language, plus emojis, symbols, and more.
Examples:
- Letter “A” = U+0041 (65 in decimal)
- Greek letter “α” = U+03B1 (945 in decimal)
- Emoji “😀” = U+1F600 (128512 in decimal)
Encoding: How Unicode Numbers Become Bytes
Unicode defines what number each character gets, but computers need to store these numbers as bytes. This is where encoding comes in.
UTF-8: The Most Common Encoding
UTF-8 is the dominant way to encode Unicode text:
- Uses 1-4 bytes per character
- English letters use 1 byte (efficient for English text)
- Other languages may use 2-4 bytes
- Can represent all Unicode characters
Other Encodings
- UTF-16: Uses 2 or 4 bytes per character
- UTF-32: Uses exactly 4 bytes per character
- Legacy encodings: ISO-8859-1, Windows-1252, etc. (avoid these!)
Critical Rules for Web Developers
1. Always Specify Encoding in HTML
1 | <meta charset="UTF-8"> |
Put this in your HTML <head>
section. Without it, browsers guess the encoding and often guess wrong.
2. Use UTF-8 Everywhere
- Save your HTML, CSS, and JavaScript files as UTF-8
- Configure your database to use UTF-8
- Set your web server to serve UTF-8
3. Handle Form Submissions Properly
When users submit forms, make sure your server interprets the data as UTF-8:
1 | <form accept-charset="UTF-8"> |
4. Database Storage
Configure your database columns to store UTF-8:
1 | -- MySQL example |
Common Mistakes and Solutions
Mistake: “Mojibake” (Garbled Text)
Problem: Text displays as “áéÃóú” instead of “áéíóú”
Cause: Text encoded as UTF-8 but interpreted as something else
Solution: Ensure consistent UTF-8 usage throughout your application
Mistake: Question Marks or Empty Boxes
Problem: Characters display as “?” or “□”
Cause: The font doesn’t support those characters, or wrong encoding
Solution: Use web fonts that support international
Mistake: Truncated Text
Problem: Text gets cut off unexpectedly
Cause: Counting bytes instead of characters
Solution: Use proper string functions that understand Unicode
Testing Your Application
- Test with international characters: Try names like “José” and “王伟”
- Test with emojis: Modern applications should handle “👨💻” properly
- Test copy-paste: Users should be able to paste text from other applications
The Bottom Line
Modern web development is simple if you follow one rule: Use UTF-8 everywhere. Set it in your HTML meta tags, save your files as UTF-8, configure your database for UTF-8, and your international users will thank you.
The days of character encoding problems are over if you consistently use UTF-8 from the start. Don’t mix encodings, don’t use legacy character sets, and always be explicit about your encoding choices.
Unicode and Character Sets: The Essential Guide for Beginners in Web Engineering
https://mehamasum.github.io/blog/2025/5/unicode-for-beginners/