Character Set & Unicode Tools and Conversions

Convert character to number (Unicode code point)

This tool shows Unicode details about any character (letter), including decimal/hex code point and HTML/URL encode syntax.

Character:


Convert number to character

Number:


Unicode and UTF-8

Unicode is a standard encoding system for computers to display text and symbols from all writing systems around the world. There are several Unicode encodings: the most popular is UTF-8, other examples are UTF-16 and UTF-7. UTF-8 uses a variable-length character encoding, and all basic Latin character codes are identical to ASCII. On the Unicode website you can read the following definition for Unicode: Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. more ...

Setting a charset in programming, servers & other files

Jump to (A-Z):

Looking to convert to/from UTF-8? View the conversion routines


Apache .htaccess file

You can use .htaccess to set a default character set for all your documents. Apache's default character set is ISO-8859-1. Apache will use this character set in the HTTP header it sends back to the browser after a request.

To set a default charset for your whole site add the following code to your .htaccess file:

AddDefaultCharset UTF-8

To serve just your .html documents as UTF-8 add the following line:

AddCharset UTF-8 .html

or:

AddType 'text/html; charset=UTF-8' html

AddCharset specifies just the charset, AddType specifies both MIME-type and charset in one line.

You can also limit with Files, FilesMatch, Directory etc.

<FilesMatch "\.(htm|html|css|js)$">
ForceType 'text/html; charset=UTF-8'
</FilesMatch>
<Files "index.php">
ForceType 'text/html; charset=UTF-8'
</Files>

You can also create a new extension (index.utf8 is served as an Unicode UTF-8 document, index.html is ISO-8859-1):

AddCharset UTF-8 .utf8

PHP

Use the header function to send a HTTP header:

header("Content-Type: text/html; charset=UTF-8");

You must use this function before any output is sent to the browser. more ...


Python

In your source code set the character set:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

In Python 3 UTF-8 is the default character set.


HTML

Set a META tag, there is a short version (introduced in HTML5) and a long version (also compatible with earlier HTML versions, like XHTML):

<meta charset="utf-8">
or
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />

You can use the short version unless you are targeting old browsers like IE6/IE7. Both versions will work in HTML5. The long version will overrule the short one and HTTP headers will overrule both.

Add the META tag in the <head>-section of your HTML document. Browsers might ignore this statement if your document has a BOM-header (see below).


XML

In the first line of the XML document:

<?xml version='1.0' encoding='utf-8'?>

Text-files: BOM-mark

The BOM-header or Byte Order Mark is a U+FEFF ("zero-width no-break space", EF BB BF in hex, 239 187 191 in decimal), is a mark that is saved at the beginning of a text-document to tell editors, browsers and other programs that the text file is UTF-8 encoded (or UTF-16, 32). Many editors will automatically add a BOM-header once you specify that the encoding is UTF-8. Some editors also have alternatives for the BOM-header, for example "UTF-8 Cookie", where the editor remembers that the document is UTF-8 by setting a cookie on your system.

BOM-headers might give problems with some scripting languages such as PHP (you will see some strange characters -the BOM header- flashing for a fraction of a second before a page is loaded).