lithium\g11n\Multibyte::is()

public static method

Checks if a given string is UTF-8 encoded and is valid UTF-8.

In quick mode it will check only for non ASCII characters being used indicating any multibyte encoding. Don't use quick mode for integrity validation of UTF-8 encoded strings.

Meaning of RegExp:

'[\x09\x0A\x0D\x20-\x7E]';            // ASCII
'|[\xC2-\xDF][\x80-\xBF]';            // non-overlong 2-byte
'|\xE0[\xA0-\xBF][\x80-\xBF]';        // excluding overlongs
'|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}'; // straight 3-byte
'|\xED[\x80-\x9F][\x80-\xBF]';        // excluding surrogates
'|\xF0[\x90-\xBF][\x80-\xBF]{2}';     // planes 1-3
'|[\xF1-\xF3][\x80-\xBF]{3}';         // planes 4-15
'|\xF4[\x80-\x8F][\x80-\xBF]{2}';     // plane 16

Parameters

  • string $string

    The string to analyze.

  • array $options

    Allows to toggle mode via the 'quick' option, defaults to false.

Returns

boolean

Returns true if the string is UTF-8.

Source

	public static function is($string, array $options = []) {
		$defaults = ['quick' => false];
		$options += $defaults;

		if ($options['quick']) {
			$regex = '/[^\x09\x0A\x0D\x20-\x7E]/m';
		} else {
			$regex  = '/\A(';
			$regex .= '[\x09\x0A\x0D\x20-\x7E]';
			$regex .= '|[\xC2-\xDF][\x80-\xBF]';
			$regex .= '|\xE0[\xA0-\xBF][\x80-\xBF]';
			$regex .= '|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}';
			$regex .= '|\xED[\x80-\x9F][\x80-\xBF]';
			$regex .= '|\xF0[\x90-\xBF][\x80-\xBF]{2}';
			$regex .= '|[\xF1-\xF3][\x80-\xBF]{3}';
			$regex .= '|\xF4[\x80-\x8F][\x80-\xBF]{2}';
			$regex .= ')*\z/m';
		}
		return (boolean) preg_match($regex, $string);
	}