Unicode and Character Sets: The Essential Guide for Beginners in Web Engineering

Unicode and Character Sets: The Essential Guide for Beginners in Web Engineering

The Problem: Computers Only Understand Numbers

Computers don’t understand letters, symbols, or emojis. They only understand numbers (0s and 1s). So how does your computer display text? It uses a system that assigns each character a unique number.

Before Unicode: The Mess

In the early days, different systems used different number-to-character mappings:

  • ASCII: Used numbers 0-127 for basic English letters, numbers, and symbols
  • Various code pages: Each country/language had their own system for characters beyond ASCII

This created chaos. A document written in Russian might display as gibberish on a computer set up for Arabic text.

Enter Unicode: The Solution

Unicode is a universal standard that assigns a unique number (called a “code point”) to every character in every language, plus emojis, symbols, and more.

Examples:

  • Letter “A” = U+0041 (65 in decimal)
  • Greek letter “α” = U+03B1 (945 in decimal)
  • Emoji “😀” = U+1F600 (128512 in decimal)

Encoding: How Unicode Numbers Become Bytes

Unicode defines what number each character gets, but computers need to store these numbers as bytes. This is where encoding comes in.

UTF-8: The Most Common Encoding

UTF-8 is the dominant way to encode Unicode text:

  • Uses 1-4 bytes per character
  • English letters use 1 byte (efficient for English text)
  • Other languages may use 2-4 bytes
  • Can represent all Unicode characters

Other Encodings

  • UTF-16: Uses 2 or 4 bytes per character
  • UTF-32: Uses exactly 4 bytes per character
  • Legacy encodings: ISO-8859-1, Windows-1252, etc. (avoid these!)

Critical Rules for Web Developers

1. Always Specify Encoding in HTML

1
<meta charset="UTF-8">

Put this in your HTML <head> section. Without it, browsers guess the encoding and often guess wrong.

2. Use UTF-8 Everywhere

  • Save your HTML, CSS, and JavaScript files as UTF-8
  • Configure your database to use UTF-8
  • Set your web server to serve UTF-8

3. Handle Form Submissions Properly

When users submit forms, make sure your server interprets the data as UTF-8:

1
<form accept-charset="UTF-8">

4. Database Storage

Configure your database columns to store UTF-8:

1
2
3
4
-- MySQL example
CREATE TABLE users (
name VARCHAR(255) CHARACTER SET utf8mb4
);

Common Mistakes and Solutions

Mistake: “Mojibake” (Garbled Text)

Problem: Text displays as “áéíóú” instead of “áéíóú”
Cause: Text encoded as UTF-8 but interpreted as something else
Solution: Ensure consistent UTF-8 usage throughout your application

Mistake: Question Marks or Empty Boxes

Problem: Characters display as “?” or “□”
Cause: The font doesn’t support those characters, or wrong encoding
Solution: Use web fonts that support international

Mistake: Truncated Text

Problem: Text gets cut off unexpectedly
Cause: Counting bytes instead of characters
Solution: Use proper string functions that understand Unicode

Testing Your Application

  1. Test with international characters: Try names like “José” and “王伟”
  2. Test with emojis: Modern applications should handle “👨‍💻” properly
  3. Test copy-paste: Users should be able to paste text from other applications

The Bottom Line

Modern web development is simple if you follow one rule: Use UTF-8 everywhere. Set it in your HTML meta tags, save your files as UTF-8, configure your database for UTF-8, and your international users will thank you.

The days of character encoding problems are over if you consistently use UTF-8 from the start. Don’t mix encodings, don’t use legacy character sets, and always be explicit about your encoding choices.

Unicode and Character Sets: The Essential Guide for Beginners in Web Engineering

https://mehamasum.github.io/blog/2025/5/unicode-for-beginners/

Author

Mehedi Hasan Masum

Posted on

2025-05-26

Licensed under

CC BY-NC-SA 4.0

Comments

Your browser is out-of-date!

Update your browser to view this website correctly.&npsb;Update my browser now

×