from PHPDIG
function isUTF8($str) {
if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) {
return true;
} else {
return false;
}
}
mb_detect_encoding
(PHP 4 >= 4.0.6, PHP 5)
mb_detect_encoding — Detect character encoding
Beschreibung
string mb_detect_encoding
( string
$str
[, mixed $encoding_list = mb_detect_order()
[, bool $strict = false
]] )
Detects character encoding in string str.
Parameter-Liste
-
str -
The string being detected.
-
encoding_list -
encoding_listis list of character encoding. Encoding order may be specified by array or comma separated list string.If
encoding_listis omitted, detect_order is used. -
strict -
strictspecifies whether to use the strict encoding detection or not. Default isFALSE.
Rückgabewerte
The detected character encoding or FALSE if the encoding cannot be
detected from the given string.
Beispiele
Beispiel #1 mb_detect_encoding() example
<?php
/* Detect character encoding with current detect_order */
echo mb_detect_encoding($str);
/* "auto" is expanded according to mbstring.language */
echo mb_detect_encoding($str, "auto");
/* Specify encoding_list character encoding by comma separated list */
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");
/* Use array to specify encoding_list */
$ary[] = "ASCII";
$ary[] = "JIS";
$ary[] = "EUC-JP";
echo mb_detect_encoding($str, $ary);
?>
mb_detect_encoding
sunggsun
15-Aug-2006 09:26
15-Aug-2006 09:26
chris AT w3style.co DOT uk
03-Aug-2006 11:22
03-Aug-2006 11:22
Based upon that snippet below using preg_match() I needed something faster and less specific. That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8. I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.
I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string. This is quite a lot faster.
<?php
function detectUTF8($string)
{
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)+%xs', $string);
}
?>
telemach
28-Jul-2005 03:48
28-Jul-2005 03:48
beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)
mb_detect_encoding('accentuée' , 'UTF-8, ISO-8859-1')
returns ISO-8859-1, while
mb_detect_encoding('accentué' , 'UTF-8, ISO-8859-1')
returns UTF-8
bottom line : an ending 'é' (and probably other accentuated chars) mislead mb_detect_encoding
Chrigu
29-Mar-2005 05:32
29-Mar-2005 05:32
If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:
mb_detect_encoding($string, 'UTF-8, ISO-8859-1');
if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
php-note-2005 at ryandesign dot com
17-Feb-2005 04:57
17-Feb-2005 04:57
Much simpler UTF-8-ness checker using a regular expression created by the W3C:
<?php
// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
// From http://w3.org/International/questions/qa-forms-utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
} // function is_utf8
?>
jaaks at playtech dot com
14-Jan-2005 09:27
14-Jan-2005 09:27
Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it.
Replace
} // goto next char
with
} else {
return false; // 10xxxxxx occuring alone
} // goto next char
maarten
13-Jan-2005 12:55
13-Jan-2005 12:55
Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.
To verify utf 8 use the following:
//
// utf8 encoding validation developed based on Wikipedia entry at:
// http://en.wikipedia.org/wiki/UTF-8
//
// Implemented as a recursive descent parser based on a simple state machine
// copyright 2005 Maarten Meijer
//
// This cries out for a C-implementation to be included in PHP core
//
function valid_1byte($char) {
if(!is_int($char)) return false;
return ($char & 0x80) == 0x00;
}
function valid_2byte($char) {
if(!is_int($char)) return false;
return ($char & 0xE0) == 0xC0;
}
function valid_3byte($char) {
if(!is_int($char)) return false;
return ($char & 0xF0) == 0xE0;
}
function valid_4byte($char) {
if(!is_int($char)) return false;
return ($char & 0xF8) == 0xF0;
}
function valid_nextbyte($char) {
if(!is_int($char)) return false;
return ($char & 0xC0) == 0x80;
}
function valid_utf8($string) {
$len = strlen($string);
$i = 0;
while( $i < $len ) {
$char = ord(substr($string, $i++, 1));
if(valid_1byte($char)) { // continue
continue;
} else if(valid_2byte($char)) { // check 1 byte
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
} else if(valid_3byte($char)) { // check 2 bytes
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
} else if(valid_4byte($char)) { // check 3 bytes
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
} // goto next char
}
return true; // done
}
for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png