一步一步学网络爬虫（从python到scrapy）

本文转自sunnyxiaohu的博客,仅为个人学习资料保存,如有侵权,请告知.

大概花了一个星期的时间，学习了一下网络爬虫的知识，现在使用scrapy能爬一些基本的网页，图片，解决网页编码兼容问题，基础的模拟登陆。对于有些模拟登陆，由于其提交的表单要经过js进行处理后提交；更难的其网页也是经js渲染的，要学会一步步去分析，没有太多的去深入，但我会提到基本的分析方法。
参考文章：
1、http://www.runoob.com/ 一个很好的语言语法入门学习的网站，我主要用其学习了Python的语法。
2、http://blog.csdn.net/column/details/why-bug.html 此博客讲了一些网络爬虫的基础知识，包括http,url等，而且一步步讲解了实现爬虫的整个过程。
3、http://doc.scrapy.org/en/latest/intro/tutorial.html scrapy框架的学习教程，从安装讲到应用到常见问题，是个不可多得的参考手册，至少过一遍，对于想深入研究的同学，一定要多看几遍。
4、http://blog.csdn.net/u012150179/article/details/34486677 对于中文输出与保存，实现多网页的爬取，做了实现。
5、http://www.jianshu.com/p/b7f41df6202d
http://www.jianshu.com/p/36a39ea71bfd
对于怎么实现模拟登陆做了较好的解释和实现，当然由于技术的不断更新和动态变化，网站的反爬虫的技术也在不断更新，具体情况，应具体分析。

下面正式进入学习：
环境：ubuntu14.04
一、python
1、python的下载和安装：https://www.python.org/downloads/ 在链接中找到自己需要的版本，记得在研究中基本不用version>3.0的版本，然而有为了支持一些新的功能，基本上version>2.70 and version<3.0是一个比较合适的选择。由于ubuntu14.04的底层有些使用python实现的，所以都带了python,(python2.74的版本或者其它）如果需要不同的版本可在不删除原有版本的基础上下载新版本，并修改软链接即可。ln -s python pythonx.xx中间若有问题，请自行百度解决。
2、python的基础知识学习。熟悉一下基本的语法，重点关注列表，元组，字典，函数和类。其它的若有问题，再返回去学习吧，学习链接在参考中已给出，练习一下，一两天就差不多能搞定了。

二、网络爬虫的基础知识
1、网络爬虫的定义、浏览网页的过程、URI和URL的概念和举例、URL的理解和举例。
2、正则表达式
自己练习一下，如果记不住了看看下面的表。
这里写图片描述

三、scrapy
1、scrapy的安装
http://doc.scrapy.org/en/latest/intro/install.html 根据你自己应用的平台进行选择。比较简单，不做过多的解释。
2、一个scrapy例子
http://doc.scrapy.org/en/latest/intro/tutorial.html 有几点要注意一下：一是知道如何去调试，二是xpath()和css()，还有要学会使用firebox和firebug分析网页源码和表单提交情况，看到前面，我们基本能实现单网页的爬取。
3、讲讲scrapy框架
对scrapy的框架和运行有一个具体的思路之后我们才能更好的了解爬虫的整个情况，尤其是出了问题之后的调试
http://blog.csdn.net/u012150179/article/details/34441655
http://scrapy-chs.readthedocs.org/zh_CN/latest/_images/scrapy_architecture.png
这里写图片描述
4、scrapy自动多网页爬取
总结起来实现有三种方式:第一种，先prase眼前的url,得到各个item后，在获得新的url,调用prase_item进行解析，在prase_item中又callback parse function 。第二种，也就是例子（scrapy的一个例子）中后部分提到的：

def parse(self, response):
        for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

prase只做url的获取，具体的解析工作交给prase_dir_contents去做。第三种、就是根据具体的爬虫去实现了,Generic Spiders（例子中我们用的是Simple Spiders）中用的比较多的有CrawlSpider,这里我们介绍这个的实现

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default). follow
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item. process
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
    .......

这便是其的Link Extractors机制。
5、中文输出与存储编码
http://blog.csdn.net/u012150179/article/details/34450547
存储：在 item.py中编码：item[‘title’] = [t.encode(‘utf-8’) for t in title] ，在pipe.py中解码：

    import json  
    import codecs  


    class W3SchoolPipeline(object):  
        def __init__(self):  
            self.file = codecs.open('w3school_data_utf8.json', 'wb', encoding='utf-8')  

        def process_item(self, item, spider):  
            line = json.dumps(dict(item)) + '\n'  
            # print line  
            self.file.write(line.decode("unicode_escape"))  
            return item

输出：

for t in title:  
    print t.encode('utf-8')

6、图片或者文件的下载
http://doc.scrapy.org/en/latest/topics/media-pipeline.html
这里，主要提供一个抓取妹子图片的例子。注意怎么实现文件名的替换与改写。

class ImagedownPipeline(object):
    def process_item(self, item, spider):
        if 'image_url' in item:
            #images = []
            dir_path = '%s/%s' % (settings.IMAGES_STORE, spider.name)

#            if not os.path.exists(dir_path):
#                os.makedirs(dir_path)
            for image_url in item['image_url']:
                image_page = item['image_page']
                dir_path = '%s/%s' %(dir_path, image_page)
                if not os.path.exists(dir_path):
                    os.makedirs(dir_path)

                us = image_url.split('/')[3:]
                image_file_name = '_'.join(us)
                file_path = '%s/%s' % (dir_path, image_file_name)
                #images.append(file_path)
                if os.path.exists(file_path):
                    continue

                with open(file_path, 'wb') as handle:
                    response = requests.get(image_url, stream=True)
                    for block in response.iter_content():
                        #if not block:
                         #   break
                        handle.write(block)

            #item['images'] = images
        return item

this kind of process would make more efficiency

class MyImagesPipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
#        print settings.IMAGES_STORE + image_paths[0]
        if not image_paths:
            raise DropItem("Item contains no images")
        for img_url in item['image_urls']:
            us = img_url.split('/')[3:]
            newname = '_'.join(us)
        dir_path = settings.IMAGES_STORE+"full/"+item['image_page']+"/"
#        print dir_path
        if not os.path.exists(dir_path):
            os.makedirs(dir_path)
        os.rename(settings.IMAGES_STORE + image_paths[0], dir_path + newname)

        return item

7、scrapy模拟登陆
此处不做过多的解释，给出两个例子。里面分析的很详细
http://www.jianshu.com/p/b7f41df6202d
http://www.jianshu.com/p/36a39ea71bfd
还有，有时需要模拟Request的headers，这在http://doc.scrapy.org/en/latest/topics/request-response.html 中有很好的说明，至于怎么找到对应的headers，可通过firebug或者其它。
8、处理js渲染过的网页
……
9、验证码的识别
…….
10、分布式爬虫的实现
…….

图图家–Richer的黑科技

一步一步学网络爬虫（从python到scrapy）