Saturday, 3 October 2015

SERP Scrap: Moved permanently issue with curl?

Moved permanently in Google Scraping issue with curl?

Recently I have played with SERP results and find out that main problem is to scrap the result,I have found Two option are available in different SERVER and LOCAL ENVIRONMENT scenarios.

These script cand be used in any MOVED PERMANENTLY Issue, not mandatory to use in SERP parsing.


1 - Allow FollowLocation
function getcurl($url){ // create curl resource $ch = curl_init(); // set url curl_setopt($ch, CURLOPT_URL, $url); //return the transfer as a string curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true); // extra line to add for redirection in case of redirection occur. // $output contains the output string $output = curl_exec($ch); // close curl resource to free up system resources curl_close($ch); return $output; } print_r(getcurl("http://www.google.com/search?q=healthy+Diet"));
This is the first work around gives you desired page result ,now you can apply simplehtmldom or something to work more.

2 - A Custom function
function geturlBycurl($url){ (function_exists('curl_init')) ? '' : die('cURL Must be installed for geturlBycurl function to work. Ask your host to enable it or uncomment extension=php_curl.dll in php.ini');
$curl = curl_init(); $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; $header[] = "Cache-Control: max-age=0"; $header[] = "Connection: keep-alive"; $header[] = "Keep-Alive: 300"; $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; $header[] = "Accept-Language: en-us,en;q=0.5"; $header[] = "Pragma: ";
curl_setopt($curl, CURLOPT_URL, $url); curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox/5.0'); curl_setopt($curl, CURLOPT_HTTPHEADER, $header); curl_setopt($curl, CURLOPT_HEADER, true); curl_setopt($curl, CURLOPT_REFERER, $url); curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate'); curl_setopt($curl, CURLOPT_AUTOREFERER, true); curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); //curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); //CURLOPT_FOLLOWLOCATION Disabled... curl_setopt($curl, CURLOPT_TIMEOUT, 60);
$html = curl_exec($curl);
$status = curl_getinfo($curl); curl_close($curl);
if($status['http_code']!=200){ if($status['http_code'] == 301 || $status['http_code'] == 302) { list($header) = explode("\r\n\r\n", $html, 2); $matches = array(); preg_match("/(Location:|URI:)[^(\n)]*/", $header, $matches); $url = trim(str_replace($matches[1],"",$matches[0])); $url_parsed = parse_url($url); return (isset($url_parsed))? geturl($url):''; } $oline=''; foreach($status as $key=>$eline){$oline.='['.$key.']'.$eline.' ';} $line =$oline." \r\n ".$url."\r\n-----------------\r\n"; $handle = @fopen('./curl.error.log', 'a'); fwrite($handle, $line); return FALSE; } return $html; } print_r(geturlBycurl("http://www.google.com/search?q=healthy+Diet"));
Main source:stackoverflow.

If safe mode or open_basedir is enabled above script can be helpful.

Tuesday, 29 September 2015

PHP - strpos vs stripos

Strpos vs Stripos
Main difference is strpos case-sensitive where as stripos is case-insensitive
STRIPOS
$findme = 'a'; $mystring1 = 'xyz'; $mystring2 = 'ABC'; $pos1 = stripos($mystring1, $findme); $pos2 = stripos($mystring2, $findme); // Nope, 'a' is certainly not in 'xyz' if ($pos1 === false) { echo "The string '$findme' was not found in the string '$mystring1'"; } // Note our use of ===. Simply == would not work as expected // because the position of 'a' is the 0th (first) character. if ($pos2 !== false) { echo "We found '$findme' in '$mystring2' at position $pos2"; }
Code credit: stripos

STRPOS
$mystring = 'abc'; $findme = 'a'; $pos = strpos($mystring, $findme); // Note our use of ===. Simply == would not work as expected // because the position of 'a' was the 0th (first) character. if ($pos === false) { echo "The string '$findme' was not found in the string '$mystring'"; } else { echo "The string '$findme' was found in the string '$mystring'"; echo " and exists at position $pos"; }
Code credit: strpos