Programming: Converting Latin to Unicode (UTF-8)

Converting from Latin to UTF-8 (and back) in your code

Quick jump A-Z:


PHP

To convert from Latin ISO-8859-1 to UTF-8 (PHP.net):

utf8_encode($data)

And to convert back from UTF-8 to ISO-8859-1 (PHP.net):

utf8_decode($data)

If you need to convert to/from other character sets look at iconv.

Notes:
Make sure not to save your PHP files using a BOM (Byte-Order Marker) UTF-8 file marker (your browser might show these BOM characters between PHP pages on your site).
In older PHP versions: Some native PHP functions such as strtolower(), strtoupper() and ucfirst() might not function correctly with UTF-8 strings. Possible solutions: convert to Latin first or add the following line to your code:

setlocale(LC_CTYPE, 'C');

Perl

To encode:

use Encode qw( from_to is_utf8 );
from_to($data, "iso-8859-1", "utf8");

You can use to following routine to to check if a string is valid UTF-8 (more)

is_utf8($data)

Python

To encode in UTF-8:

source_encoding = "iso-8859-1"
string = "Names with international characters like 'Andrée'"
string = string.encode(source_encoding)
string = unicode(string, 'utf-8')

To decode back to locale character set:

utf8string.encode("utf-8");

In Python 3 UTF-8 is the default character set.


.NET C#

In C-Sharp use System.Text:

byte[] utf8Bytes = Encoding.UTF8.GetBytes("ASCII to UTF8");
byte[] isoBytes = Encoding.Convert(Encoding.ASCII, Encoding.UTF8, utf8Bytes);
string uf8converted = Encoding.UTF8.GetString(isoBytes);

Java

String.getBytes(Charset)

Use String.getBytes to convert a string (more info) or use the CharsetEncoder class.


MySQL

MySQL uses character sets on all levels, there are settings like: character_set_connection and collation_connection, and you can specify a character set at the database level, the table level and field level. To convert a character set inside a MySQL query use convert:

SELECT CONVERT(latin1field USING utf8)

If you are experiencing speed issues with table joins after converting character sets of tables or fields make sure that all ID fields use the same COLLATE setting . More information.


HTML

To avoid character set problems it is sometimes easier to convert your special characters to (plain ASCII) HTML code (especially if you are editing HTML-files manually).

Use our HTML special character converter.


Unix/Linux systems

Use the iconv character set conversion tool (more):

iconv -f ISO-8859-1 -t UTF-8 filename.txt

Windows systems

Most good text-editors offer Unicode support, such as UltraEdit (File → Conversions → 'ASCII to UTF-8' or 'ASCII to Unicode (16-Bit)').


Thanks to software developers who sent me corrections and updates!