?

Log in

unidecode_php-0.3.tar.gz - Benjamin C. Wiley Sittler [entries|archive|friends|userinfo]
Benjamin C. Wiley Sittler

[ website | bsittler ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

unidecode_php-0.3.tar.gz [Dec. 18th, 2008|11:39 pm]
Benjamin C. Wiley Sittler
[Tags|, , ]

I recently wrote a conversion script and PHP wrapper so that the data from the Perl "last-chance transliterator" Text::Unidecode by Sean M. Burke can be used from PHP: unidecode_php-0.3.tar.gz. To use this you'll need to install the Perl Text::Unidecode module and then run the udec2bin.pl script inside the unidecode_php package.

Example PHP usage:

<?php
require("unidecode.php");
print htmlspecialchars(unidecode("中文", "utf-8"));
?>
produces
Zhong Wen

This allows very basic conversion of lots of Unicode to plain ASCII. It works pretty well for some scripts and languages, is somewhat usable for several more, and fails utterly in some. Here are some examples excerpted from the Emacs-MULE HELLO file, along with their ASCII transliterations:

Europe: ¡Hola!, Grüß Gott, Hyvää päivää, Tere õhtust, Bonġu
Europe: !Hola!, Gruss Gott, Hyvaa paivaa, Tere ohtust, Bongu
Cześć!, Dobrý den, Здравствуйте!, Γειά σας, გამარჯობა
Czesc!, Dobry den, Zdravstvuitie!, Geia sas, gamarjoba
Africa: ሠላም
Africa: szalaame
Middle/Near East: שלום
Middle/Near East: shlvm
South Asia: नमस्ते, ನಮಸ್ಕಾರ, നമസ്കാരം, வணக்கம்
South Asia: nmste, nmskaar, nmskaarN, vnnkkm 
South East Asia: ສະບາຍດີ, สวัสดีครับ, Chào bạn
South East Asia: sabaanydii, swasdiikhrab, Chao ban
East Asia: 你好, 早晨, こんにちは, 안녕하세요
East Asia: Ni Hao , Zao Chen , konnitiha, annyeonghaseyo
Misc: Eĥoŝanĝo ĉiuĵaŭde, ⠓⠑⠇⠇⠕, ∀ p ∈ world • hello p  □
Misc: Ehosango ciujaude, hello, [?] p [?] world * hello p  #
CJK variety: GB(元气,开发), BIG5(元氣,開發), JIS(元気,開発), KSC(元氣,開發)
CJK variety: GB(Yuan Qi ,Kai Fa ), BIG5(Yuan Qi ,Kai Fa ), JIS(Yuan Qi ,Kai Fa ), KSC(Yuan Qi ,Kai Fa )
Unicode charset: Eĥoŝanĝo ĉiuĵaŭde, Γειά σας, שלום, Здравствуйте!
Unicode charset: Ehosango ciujaude, Geia sas, shlvm, Zdravstvuitie!

LANGUAGE (NATIVE NAME)	HELLO
----------------------	-----
Amharic (አማርኛ)	ሠላም
Amharic ('amaarenyaa)	szalaame
Braille	⠓⠑⠇⠇⠕
Braille	hello
Bulgarian (български)	Здравейте
Bulgarian (blgharski)	Zdravieitie
Czech (čeština)	Dobrý den
Czech (cestina)	Dobry den
Danish (dansk)	Hej, Goddag, Halløj
Danish (dansk)	Hej, Goddag, Halloj
English [ˈiŋ-gliʃ]	Hello
English ['iNG-gliS]	Hello
Esperanto	Saluton (Eĥoŝanĝo ĉiuĵaŭde)
Esperanto	Saluton (Ehosango ciujaude)
Estonian (eesti)	Tere päevast, Tere õhtust
Estonian (eesti)	Tere paevast, Tere ohtust
Finnish (suomi)	Hei, Hyvää päivää
Finnish (suomi)	Hei, Hyvaa paivaa
French (français)	Bonjour, Salut
French (francais)	Bonjour, Salut
Georgian (ქართველი)	გამარჯობა
Georgian (k`art`veli)	gamarjoba
German (Deutsch)	Guten Tag, Grüß Gott
German (Deutsch)	Guten Tag, Gruss Gott
Greek (ελληνικά)	Γειά σας
Greek (ellenika)	Geia sas
Hebrew (עברית)	שלום
Hebrew (`bryt)	shlvm
Hungarian (magyar)	Szép jó napot!
Hungarian (magyar)	Szep jo napot!
Hindi (हिंदी)	नमस्ते, नमस्कार ।
Hindi (hiNdii)	nmste, nmskaar  / 
Kannada (ಕನ್ನಡ)	ನಮಸ್ಕಾರ
Kannada (knndd)	nmskaar
Lao (ພາສາລາວ)	ສະບາຍດີ, ຂໍໃຫ້ໂຊກດີ
Lao (phaasaalaaw)	sabaanydii, khMayhoskdii
Malayalam (മലയാളം)	നമസ്കാരം
Malayalam (mlyaallN)	nmskaarN
Maltese (il-Malti)	Bonġu, Saħħa
Maltese (il-Malti)	Bongu, Sahha
Mathematics	∀ p ∈ world • hello p  □
Mathematics	[?] p [?] world * hello p  #
Polish (polski)	Dzień dobry! Cześć!
Polish (polski)	Dzien dobry! Czesc!
Russian (русский)	Здравствуйте!
Russian (russkii)	Zdravstvuitie!
Slovak (slovenčina)	Dobrý deň
Slovak (slovencina)	Dobry den
Slovenian (slovenščina)	Pozdravljeni!
Slovenian (slovenscina)	Pozdravljeni!
Spanish (español)	¡Hola!
Spanish (espanol)	!Hola!
Swedish (svenska)	Hej, Goddag, Hallå
Swedish (svenska)	Hej, Goddag, Halla
Tamil (தமிழ்)	வணக்கம்
Tamil (tmilll)	vnnkkm
Thai (ภาษาไทย)	สวัสดีครับ, สวัสดีค่ะ
Thai (phaasaaaithy)	swasdiikhrab, swasdiikha
Tigrigna (ትግርኛ)	ሰላማት
Tigrigna (tegerenyaa)	salaamaate
Turkish (Türkçe)	Merhaba
Turkish (Turkce)	Merhaba
Ukrainian (українська)	Вітаю
Ukrainian (ukrayins'ka)	Vitaiu
Vietnamese (Tiếng Việt)	Chào bạn
Vietnamese (Tieng Viet)	Chao ban

Japanese (日本語)	こんにちは, コンニチハ
Japanese (Ri Ben Yu )	konnitiha, konnitiha
Chinese (中文,普通话,汉语)	你好
Chinese (Zhong Wen ,Pu Tong Hua ,Yi Yu )	Ni Hao 
Cantonese (粵語,廣東話)	早晨, 你好
Cantonese (Yue Yu ,Guang Dong Hua )	Zao Chen , Ni Hao 
Korean (한글)	안녕하세요, 안녕하십니까
Korean (hangeul)	annyeonghaseyo, annyeonghasibnigga

Copyright (C) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008
Free Software Foundation, Inc.

This file is part of GNU Emacs.

GNU Emacs is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3, or (at your option)
any later version.

GNU Emacs is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with GNU Emacs; see the file COPYING.  If not, write to the
Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
Boston, MA 02110-1301, USA.

update: version 0.2 adds a couple examples.

update: version 0.3 is better at finding the datafiles when included from another directory.

linkReply

Comments:
From: (Anonymous)
2010-02-24 12:37 pm (UTC)

A little bug

A nice wrapper around Text::Unidecode!

There's a little bug though: The code initially converts into UCS-4BE (which is outdated and can be replaced by UTF-32) so every character becomes 4 bytes long. The function _unidecode_codepoint(), however, treats characters as 2 byte long.

Looking at the original Perl code, it becomes pretty clear, that the function is handling two byte unicode characters, that is UTF-16 (or UCS-2BE).

It therefore is suitable to convert input to UTF-16 instead of UCS-4BE.

Best regards,

Gerd
(Reply) (Thread)
[User Picture]From: bsittler
2010-02-25 01:13 am (UTC)

Re: A little bug

The reason for that is to properly turn non-BMP characters into "[?]", rather than the incorrect "[?][?]", without lots of extra complexity. At least at the time I wrote that code (maybe it's different now?) Text::Unidecode didn't have any data for non-BMP characters anyhow.
(Reply) (Parent) (Thread)