背景:外包写图片抓取时出现404内容,但就把nginx里的输出给保存在了jpg里,再读取时导致出现:图片没法显示,里面内容是404。
用curl抓取页面时,一般根据curl_exec的返回内容判断是否抓取成功了。但我发现,访问有些站点本来是返回404错误,但页面有内容时,curl把page not found的内容也抓回来了。如果以curl_exec的结果判断是否正确抓取就被误导了。如下面的代码:
查了下手册,发现curl里还有个curl_getinfo函数。应该判断http状态:
新加网上找了一个:
Add Time:2014-01-15
用curl抓取页面时,一般根据curl_exec的返回内容判断是否抓取成功了。但我发现,访问有些站点本来是返回404错误,但页面有内容时,curl把page not found的内容也抓回来了。如果以curl_exec的结果判断是否正确抓取就被误导了。如下面的代码:
$url = 'http://www.cq.xinhuanet.com/house/2008-11/24/content_14996426.htm-';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_ENCODING, "gzip, deflate");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; CIBA; InfoPath.1; .NET CLR 2.0.50727)");
curl_setopt($ch, CURLOPT_MAXREDIRS, 5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //自动跟踪location
curl_setopt($ch, CURLOPT_TIMEOUT, 10); //Timeout
curl_setopt($ch, CURLOPT_HEADER, 1);
//curl_setopt($ch, CURLOPT_NOBODY, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$contents = curl_exec($ch);
curl_close($ch);
if (false == $contents || empty($contents)) {
echo $contents;
} else {
echo “抓取页面失败!”;
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_ENCODING, "gzip, deflate");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; CIBA; InfoPath.1; .NET CLR 2.0.50727)");
curl_setopt($ch, CURLOPT_MAXREDIRS, 5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //自动跟踪location
curl_setopt($ch, CURLOPT_TIMEOUT, 10); //Timeout
curl_setopt($ch, CURLOPT_HEADER, 1);
//curl_setopt($ch, CURLOPT_NOBODY, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$contents = curl_exec($ch);
curl_close($ch);
if (false == $contents || empty($contents)) {
echo $contents;
} else {
echo “抓取页面失败!”;
}
查了下手册,发现curl里还有个curl_getinfo函数。应该判断http状态:
$contents = curl_exec($ch);
$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ($http_code >= 400) { //400 - 600都是服务器错误
echo "访问失败!";
exit;
} else {
echo $contents;
}
curl_close($ch);
$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ($http_code >= 400) { //400 - 600都是服务器错误
echo "访问失败!";
exit;
} else {
echo $contents;
}
curl_close($ch);
新加网上找了一个:
Add Time:2014-01-15
作者:jackxiang@向东博客 专注WEB应用 构架之美 --- 构架之美,在于尽态极妍 | 应用之美,在于药到病除
地址:https://jackxiang.com/post/1697/
版权所有。转载时必须以链接形式注明作者和原始出处及本声明!
最后编辑: jackxiang 编辑于2014-1-15 16:29
评论列表
2021-12-10 06:16 | FUCK YOU
fuck you
分页: 1/1 1