About text encoding, Unicode, UTF-8 etc ...
This article is a valuable resource to understand text encodings.
Takeaways ...
- There Ain’t No Such Thing As Plain Text. It does not make sense to have a string without knowing what encoding it uses.
- Unicode is not encoding. Unicode is a standard meant to give a unique codepoint (Ex. U+0639) to every single letter (of every single language on earth).
- UTF-8 is one of the widely used encodings for Unicode, among some others. For the characters from 32 to 127, it interprets them same as ANSI does. Also, it's a variable encoding scheme. For plain English (without accents), it takes only one byte (just like ANSI). For some characters in Asian languages it even takes 6 bytes per character.
- Emails should indicate the encoding in Content-type specification. For an email message, there should be a string in the header of the form
Content-Type: text/plain; charset="UTF-8"
. - Webservers can send a similar
Content-Type
specification in one of the response headers (not as part of the HTML file itself). - Webpages (HTML files) should indicate the encoding in the
<meta>
tag (in theContent-Type
specification) as early at the beginning as possible. Though this is a chicken and egg problem (how can we even read the HTML file without knowing it's encoding first), the first few characters of any HTML page would be in plain English, and as most of the encodings in use today interpret the characters from 32 and 127 the same way (as equivalent to their ANSI values), it works. That's why we have to put that encoding information as early in the file as possible (in the<head>
tag) so that the browser can properly decode rest of the page.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
- Webservers can send the