Revision: 5485
Initial Code
Initial URL
Initial Description
Initial Title
Initial Tags
Initial Language
at March 11, 2008 19:13 by localhorst
Initial Code
function strip_punctuation( $text ) { $urlbrackets = '\[\]\(\)'; $urlspacebefore = ':;\'_\*%@&?!' . $urlbrackets; $urlspaceafter = '\.,:;\'\-_\*@&\/\\\\\?!#' . $urlbrackets; $urlall = '\.,:;\'\-_\*%@&\/\\\\\?!#' . $urlbrackets; $specialquotes = '\'"\*<>'; $fullstop = '\x{002E}\x{FE52}\x{FF0E}'; $comma = '\x{002C}\x{FE50}\x{FF0C}'; $arabsep = '\x{066B}\x{066C}'; $numseparators = $fullstop . $comma . $arabsep; $numbersign = '\x{0023}\x{FE5F}\x{FF03}'; $percent = '\x{066A}\x{0025}\x{066A}\x{FE6A}\x{FF05}\x{2030}\x{2031}'; $prime = '\x{2032}\x{2033}\x{2034}\x{2057}'; $nummodifiers = $numbersign . $percent . $prime; return preg_replace( array( // Remove separator, control, formatting, surrogate, // open/close quotes. '/[\p{Z}\p{Cc}\p{Cf}\p{Cs}\p{Pi}\p{Pf}]/u', // Remove other punctuation except special cases '/\p{Po}(?<![' . $specialquotes . $numseparators . $urlall . $nummodifiers . '])/u', // Remove non-URL open/close brackets, except URL brackets. '/[\p{Ps}\p{Pe}](?<![' . $urlbrackets . '])/u', // Remove special quotes, dashes, connectors, number // separators, and URL characters followed by a space '/[' . $specialquotes . $numseparators . $urlspaceafter . '\p{Pd}\p{Pc}]+((?= )|$)/u', // Remove special quotes, connectors, and URL characters // preceded by a space '/((?<= )|^)[' . $specialquotes . $urlspacebefore . '\p{Pc}]+/u', // Remove dashes preceded by a space, but not followed by a number '/((?<= )|^)\p{Pd}+(?![\p{N}\p{Sc}])/u', // Remove consecutive spaces '/ +/', ), ' ', $text ); }
Initial URL
http://nadeausoftware.com/articles/2007/9/php_tip_how_strip_punctuation_characters_web_page#Removinglineparagraphandwordseparators
Initial Description
When processing text for a search engine or analysis tool, code needs to strip out punctuation, formatting, spacing, and control characters to reveal indexable text. In international text there are hundreds of these characters, and some should be removed in one context, but not in another. This tip shows how.
Initial Title
Strip punctuation from text.
Initial Tags
regex
Initial Language
PHP