Converting from Latin to UTF-8 (and back) in your code
Quick jump A-Z:
PHP
Use multibyte functions or iconv to convert between character encodings (codepages). Examples:
// ISO-8859-1 (Latin1) -> UTF-8
// using mbstring
$utf8 = mb_convert_encoding($latin, 'UTF-8', 'ISO-8859-1');
// or using iconv (TRANSLIT or IGNORE can help with unrepresentable characters)
$utf8 = iconv('ISO-8859-1', 'UTF-8//TRANSLIT', $latin);
// UTF-8 -> ISO-8859-1
$latin = mb_convert_encoding($utf8, 'ISO-8859-1', 'UTF-8');
// with iconv fallback
$latin = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $utf8);
Notes:
- Use
mb_internal_encoding('UTF-8')to set the internal encoding when working with multibyte functions. - Check for the
mbstringextension and fall back toiconvif necessary. - For case conversions use multibyte-aware functions:
mb_strtolower(),mb_strtoupper(),mb_convert_case(). - Avoid saving PHP source files with a UTF-8 BOM — it can emit invisible bytes before your output.
Perl
To encode:
use Encode qw( from_to is_utf8 );
from_to($data, "iso-8859-1", "utf8");
You can use to following routine to to check if a string is valid UTF-8 (more)
is_utf8($data)
Python
Python 3 example — convert bytes encoded in ISO-8859-1 to UTF-8:
# if you have raw bytes in ISO-8859-1 (Latin1)
latin_bytes = b"Names with international characters like Andr\xe9e"
# decode to a Python 3 str (Unicode)
text = latin_bytes.decode('iso-8859-1')
# get UTF-8 encoded bytes
utf8_bytes = text.encode('utf-8')
# Python 3 strings are Unicode; 'text' already holds the correct Unicode string
print(text)
In Python 3 strings are Unicode by default; prefer decoding bytes with the correct source encoding and then encoding to UTF-8 when you need byte output.
.NET C#
In C-Sharp use System.Text:
// Convert a .NET string from ISO-8859-1 (Latin1) bytes to a UTF-8 string
var latinBytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(myString);
var utf8Bytes = Encoding.Convert(Encoding.GetEncoding("ISO-8859-1"), Encoding.UTF8, latinBytes);
var utf8String = Encoding.UTF8.GetString(utf8Bytes);
Java
// Java 8+ using StandardCharsets
byte[] latin = myString.getBytes(StandardCharsets.ISO_8859_1);
String fromLatin = new String(latin, StandardCharsets.ISO_8859_1);
byte[] utf8 = fromLatin.getBytes(StandardCharsets.UTF_8);
Use StandardCharsets where available to avoid spelling errors and Charset lookups.
MySQL
MySQL uses character sets on all levels, there are settings like: character_set_connection and collation_connection, and you can specify a character set at the database level, the table level and field level. To convert a character set inside a MySQL query use convert:
-- Prefer utf8mb4 for full Unicode support
SELECT CONVERT(latin1field USING utf8mb4);
If you are experiencing speed issues with table joins after converting character sets of tables or fields make sure that all ID fields use the same COLLATE setting.
HTML
To avoid character set problems it is sometimes easier to convert your special characters to (plain ASCII) HTML code (especially if you are editing HTML-files manually).
Use our HTML special character converter.
Unix/Linux systems
Use the iconv character set conversion tool:
iconv -f ISO-8859-1 -t UTF-8 filename.txt
Thanks to software developers who sent me corrections and updates!