Character Sets and PHP Follow Up
Following up to my Feb 23, 2010 post on difficulties with charaster sets I’ve yet run into another example of how this can mess you up.
In PHP, I was using curl to fetch a page from a remote web server and then extract some data using regex and then inserting it into a MySQL database. The issue that I encountered was that my database was in was in UTF-8 but the text being retrieved from the website was in ISO-8859-1 (aka Latin-1 according to wikipedia). So when I selected a record that had incompatible characters between UTF-8 and ISO-885901 I either saw garbage or truncated values. For example, the value ‘Atlantic Förlags’ on the web page showed up in my table just as ‘Atlantic F’.
So to investigate I used FireFox to open up the URL I was parsing in my PHP script and then right clicked on the web page and selected ‘View Page Info’ from the popup menu.

Now all I needed to do was use PHP’s string iconv(in_charset, out_charset, str) function to convert from ISO-8859-1 to UTF-8 before inserting each string value into my database. Like such:
$trans_text = iconv("ISO-8859-1", "UTF-8", $text);
At least that’s how it goes in theory… your mileage may vary.
1 year ago
