Revision: 22577
Updated Code
at January 16, 2010 09:07 by ginoplusio
Updated Code
function webpage2txt($url) { $user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"; $ch = curl_init(); // initialize curl handle curl_setopt($ch, CURLOPT_URL, $url); // set url to post to curl_setopt($ch, CURLOPT_FAILONERROR, 1); // Fail on errors curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // allow redirects curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable curl_setopt($ch, CURLOPT_PORT, 80); //Set the port number curl_setopt($ch, CURLOPT_TIMEOUT, 15); // times out after 15s curl_setopt($ch, CURLOPT_USERAGENT, $user_agent); $document = curl_exec($ch); $search = array('@<script[^>]*?>.*?</script>@si', // Strip out javascript '@<style[^>]*?>.*?</style>@siU', // Strip style tags properly '@<[\/\!]*?[^<>]*?>@si', // Strip out HTML tags '@<![\s\S]*?�[ \t\n\r]*>@', // Strip multi-line comments including CDATA '/\s{2,}/', ); $text = preg_replace($search, "\n", html_entity_decode($document)); $pat[0] = "/^\s+/"; $pat[2] = "/\s+\$/"; $rep[0] = ""; $rep[2] = " "; $text = preg_replace($pat, $rep, trim($text)); return $text; } echo webpage2txt("http://www.repubblica.it");
Revision: 22576
Initial Code
Initial URL
Initial Description
Initial Title
Initial Tags
Initial Language
at January 16, 2010 09:06 by ginoplusio
Initial Code
function webpage2txt($url) { $user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"; $ch = curl_init(); // initialize curl handle curl_setopt($ch, CURLOPT_URL, $url); // set url to post to curl_setopt($ch, CURLOPT_FAILONERROR, 1); // Fail on errors curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // allow redirects curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable curl_setopt($ch, CURLOPT_PORT, 80); //Set the port number curl_setopt($ch, CURLOPT_TIMEOUT, 15); // times out after 15s curl_setopt($ch, CURLOPT_USERAGENT, $user_agent); $document = curl_exec($ch); $search = array('@<script[^>]*?>.*?</script>@si', // Strip out javascript '@<style[^>]*?>.*?</style>@siU', // Strip style tags properly '@<[\/\!]*?[^<>]*?>@si', // Strip out HTML tags '@<![\s\S]*?–[ \t\n\r]*>@', // Strip multi-line comments including CDATA '/\s{2,}/', ); $text = preg_replace($search, "\n", html_entity_decode($document)); $pat[0] = "/^\s+/"; $pat[2] = "/\s+\$/"; $rep[0] = ""; $rep[2] = " "; $text = preg_replace($pat, $rep, trim($text)); return $text; } echo webpage2txt("http://www.rockit.it");
Initial URL
http://www.barattalo.it/2010/01/16/php-web-page-to-text-function/
Initial Description
I’ve found this nice small bot on the www.php.net site, thanks to the author of the script on the preg_replace page. This bot returns the text content of a url and it could be used to take text from a site and find relevant words to search.
Initial Title
PHP bot that retrieves the text of a page with CURL
Initial Tags
Initial Language
PHP