Saturday, 3 October 2015

SERP Scrap: Moved permanently issue with curl?

Moved permanently in Google Scraping issue with curl?

Recently I have played with SERP results and find out that main problem is to scrap the result,I have found Two option are available in different SERVER and LOCAL ENVIRONMENT scenarios.

These script cand be used in any MOVED PERMANENTLY Issue, not mandatory to use in SERP parsing.


1 - Allow FollowLocation
function getcurl($url){ // create curl resource $ch = curl_init(); // set url curl_setopt($ch, CURLOPT_URL, $url); //return the transfer as a string curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true); // extra line to add for redirection in case of redirection occur. // $output contains the output string $output = curl_exec($ch); // close curl resource to free up system resources curl_close($ch); return $output; } print_r(getcurl("http://www.google.com/search?q=healthy+Diet"));
This is the first work around gives you desired page result ,now you can apply simplehtmldom or something to work more.

2 - A Custom function
function geturlBycurl($url){ (function_exists('curl_init')) ? '' : die('cURL Must be installed for geturlBycurl function to work. Ask your host to enable it or uncomment extension=php_curl.dll in php.ini');
$curl = curl_init(); $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; $header[] = "Cache-Control: max-age=0"; $header[] = "Connection: keep-alive"; $header[] = "Keep-Alive: 300"; $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; $header[] = "Accept-Language: en-us,en;q=0.5"; $header[] = "Pragma: ";
curl_setopt($curl, CURLOPT_URL, $url); curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox/5.0'); curl_setopt($curl, CURLOPT_HTTPHEADER, $header); curl_setopt($curl, CURLOPT_HEADER, true); curl_setopt($curl, CURLOPT_REFERER, $url); curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate'); curl_setopt($curl, CURLOPT_AUTOREFERER, true); curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); //curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); //CURLOPT_FOLLOWLOCATION Disabled... curl_setopt($curl, CURLOPT_TIMEOUT, 60);
$html = curl_exec($curl);
$status = curl_getinfo($curl); curl_close($curl);
if($status['http_code']!=200){ if($status['http_code'] == 301 || $status['http_code'] == 302) { list($header) = explode("\r\n\r\n", $html, 2); $matches = array(); preg_match("/(Location:|URI:)[^(\n)]*/", $header, $matches); $url = trim(str_replace($matches[1],"",$matches[0])); $url_parsed = parse_url($url); return (isset($url_parsed))? geturl($url):''; } $oline=''; foreach($status as $key=>$eline){$oline.='['.$key.']'.$eline.' ';} $line =$oline." \r\n ".$url."\r\n-----------------\r\n"; $handle = @fopen('./curl.error.log', 'a'); fwrite($handle, $line); return FALSE; } return $html; } print_r(geturlBycurl("http://www.google.com/search?q=healthy+Diet"));
Main source:stackoverflow.

If safe mode or open_basedir is enabled above script can be helpful.

No comments :

Post a Comment