Unicode en character sets

To generate PDF’s on the fly, for standard letters and stuff, I started to use fop. This application can generate PDF’s, SVG’s or even HP PCL from an xsl/fo file. You can find more information at the apache fop site. Anyway, the PDF’s to be generated may contain accented letters like an Γ― or an ΓΉ. This needs to be converted to a iso-8859-1 or -15 character number, as those letters would otherwise be displayed as mangled characters.

Customer information is stored in a mysql database. Luckily, mysql is using the latin1 (aka iso-8859-1) by default to store data in tables. This way, it’ll be easy to convert characters with an ascii value larger than 127 to a &#number; value. Everything below 127 can just stay as it is, because it’s just plain ascii.

The php ord() function returns the ascii character number for a given character. So, to convert a string to something usable in xml/html, one would use a function like this:

function latin2entities($string)
{

$tmp = “”;
for ($i = 0; $i < strlen($string); $i++)

if (ord($string[$i]) > 127)

$tmp .= sprintf(“&#%s;”, ord($string[$i]));

else

$tmp .= $string[$i];

return $tmp;

}

Couldn’t be easier πŸ™‚

Besides that, the euro sign appears to have a unicode character number. To use it in xml with iso-8859-1(5), you’d have to use €: € See πŸ™‚

Lots of useful information on character set is yet to be read, when I have time..

Leave a Reply

You must be logged in to post a comment.