• 企业400电话
  • 微网小程序
  • AI电话机器人
  • 电商代运营
  • 全 部 栏 目

    企业400电话 网络优化推广 AI电话机器人 呼叫中心 网站建设 商标✡知产 微网小程序 电商运营 彩铃•短信 增值拓展业务
    Python爬虫之Scrapy环境搭建案例教程

    Python爬虫之Scrapy环境搭建

    如何搭建Scrapy环境

    首先要安装Python环境,Python环境搭建见:https://blog.csdn.net/alice_tl/article/details/76793590

    接下来安装Scrapy

    1、安装Scrapy,在终端使用pip install Scrapy(注意最好是国外的环境)

    进度提示如下:

    alicedeMacBook-Pro:~ alice$ pip install Scrapy
    Collecting Scrapy
      Using cached https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl
    Collecting w3lib>=1.17.0 (from Scrapy)
      Using cached https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl
    Collecting six>=1.5.2 (from Scrapy)
     xxxxxxxxxxxxxxxxxxxxx
          File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/dist.py", line 380, in fetch_build_egg
            return cmd.easy_install(req)
          File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/command/easy_install.py", line 632, in easy_install
            raise DistutilsError(msg)
        distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('incremental>=16.10.1')
        
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/

    出现缺少Twisted的错误提示:

    Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/

    2、安装Twiseted,终端里输入:sudo pip install twisted==13.1.0

    alicedeMacBook-Pro:~ alice$ pip install twisted==13.1.0
    Collecting twisted==13.1.0
      Downloading https://files.pythonhosted.org/packages/10/38/0d1988d53f140ec99d37ac28c04f341060c2f2d00b0a901bf199ca6ad984/Twisted-13.1.0.tar.bz2 (2.7MB)
        100% |████████████████████████████████| 2.7MB 398kB/s 
    Requirement already satisfied: zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from twisted==13.1.0) (4.1.1)
    Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->twisted==13.1.0) (18.5)
    Installing collected packages: twisted
      Running setup.py install for twisted ... error
        Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-inJwZ2/twisted/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-record-OmuVWF/install-record.txt --single-version-externally-managed --compile:
        running install
        running build
        running build_py
        creating build
        creating build/lib.macosx-10.13-intel-2.7
        creating build/lib.macosx-10.13-intel-2.7/twisted
        copying twisted/copyright.py -> build/lib.macosx-10.13-intel-2.7/twisted
        copying twisted/_version.py -> build/li

    3、再次使用sudo pip install scrapy安装,发现仍然出现错误提示,这次是没有安装lxml的错误提示:

    Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )

    No matching distribution found for lxml (from Scrapy)

    alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy
    The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
    The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
    Collecting Scrapy
      Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB)
        100% |████████████████████████████████| 256kB 210kB/s 
    Collecting w3lib>=1.17.0 (from Scrapy)
      xxxxxxxxxxxx
      Downloading https://files.pythonhosted.org/packages/90/50/4c315ce5d119f67189d1819629cae7908ca0b0a6c572980df5cc6942bc22/Twisted-18.7.0.tar.bz2 (3.1MB)
        100% |████████████████████████████████| 3.1MB 59kB/s 
    Collecting lxml (from Scrapy)
      Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )
    No matching distribution found for lxml (from Scrapy)

    4、安装lxml,使用:sudo pip install lxml

    alicedeMacBook-Pro:~ alice$ sudo pip install lxml
    The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
    The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
    Collecting lxml
      Downloading https://files.pythonhosted.org/packages/a1/2c/6b324d1447640eb1dd240e366610f092da98270c057aeb78aa596cda4dab/lxml-4.2.4-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.7MB)
        100% |████████████████████████████████| 8.7MB 187kB/s 
    Installing collected packages: lxml
    Successfully installed lxml-4.2.4

    5、再次安装scrapy,使用sudo pip install scrapy,安装成功

    alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy
    The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
    The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
    Collecting Scrapy
      Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB)
        100% |████████████████████████████████| 256kB 11.5MB/s 
    Collecting w3lib>=1.17.0 (from Scrapy)
      xxxxxxxxx
    Requirement already satisfied: lxml in /Library/Python/2.7/site-packages (from Scrapy) (4.2.4)
    Collecting functools32; python_version  "3.0" (from parsel>=1.1->Scrapy)
      Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/functools32/
      Downloading https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB)
        100% |████████████████████████████████| 61kB 66kB/s 
    Installing collected packages: w3lib, cssselect, functools32, parsel, queuelib, PyDispatcher, attrs, pyasn1-modules, service-identity, zope.interface, constantly, incremental, Automat, idna, hyperlink, PyHamcrest, Twisted, Scrapy
      Running setup.py install for functools32 ... done
      Running setup.py install for PyDispatcher ... done
      Found existing installation: zope.interface 4.1.1
        Uninstalling zope.interface-4.1.1:
          Successfully uninstalled zope.interface-4.1.1
      Running setup.py install for zope.interface ... done
      Running setup.py install for Twisted ... done
    Successfully installed Automat-0.7.0 PyDispatcher-2.0.5 PyHamcrest-1.9.0 Scrapy-1.5.1 Twisted-18.7.0 attrs-18.1.0 constantly-15.1.0 cssselect-1.0.3 functools32-3.2.3.post2 hyperlink-18.0.0 idna-2.7 incremental-17.5.0 parsel-1.5.0 pyasn1-modules-0.2.2 queuelib-1.5.0 service-identity-17.0.0 w3lib-1.19.0 zope.interface-4.5.0

    6、检查scrapy是否安装成功,输入scrapy --version

    出现scrapy的版本信息,比如:Scrapy 1.5.1 - no active project即可。

    alicedeMacBook-Pro:~ alice$ scrapy --version
    Scrapy 1.5.1 - no active project
     
    Usage:
      scrapy command> [options] [args]
     
    Available commands:
      bench         Run quick benchmark test
      fetch         Fetch a URL using the Scrapy downloader
      genspider     Generate new spider using pre-defined templates
      runspider     Run a self-contained spider (without creating a project)
      settings      Get settings values
      shell         Interactive scraping console
      startproject  Create new project
      version       Print Scrapy version
      view          Open URL in browser, as seen by Scrapy
     
      [ more ]      More commands available when run from project directory
     
    Use "scrapy command> -h" to see more info about a command

    PS:如果中途没有能够正常访问org网和使用sudo管理员权限安装,则会出现类似的错误提示

    Exception:

    Traceback (most recent call last):

      File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main

        status = self.run(options, args)

      File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run

        resolver.resolve(requirement_set)

    Exception:
    Traceback (most recent call last):
      File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main
        status = self.run(options, args)
      File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run
        resolver.resolve(requirement_set)
      File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 102, in resolve
        self._resolve_one(requirement_set, req)
      File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 256, in _resolve_one
        abstract_dist = self._get_abstract_dist_for(req_to_install)
      File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 209, in _get_abstract_dist_for
        self.require_hashes
      File "/Library/Python/2.7/site-packages/pip/_internal/operations/prepare.py", line 283, in prepare_linked_requirement
        progress_bar=self.progress_bar
      File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 836, in unpack_url
        progress_bar=progress_bar
      File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 673, in unpack_http_url
        progress_bar)
      File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 897, in _download_http_url
        _download_url(resp, link, content_file, hashes, progress_bar)
      File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 617, in _download_url
        hashes.check_against_chunks(downloaded_chunks)
      File "/Library/Python/2.7/site-packages/pip/_internal/utils/hashes.py", line 48, in check_against_chunks
        for chunk in chunks:
      File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 585, in written_chunks
        for chunk in chunks:
      File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 574, in resp_read
        decode_content=False):
      File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 465, in stream
        data = self.read(amt=amt, decode_content=decode_content)
      File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 430, in read
        raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 35, in __exit__
        self.gen.throw(type, value, traceback)
      File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 345, in _error_catcher
        raise ReadTimeoutError(self._pool, None, 'Read timed out.')
    ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.

    按照指南上搭建好了Scrapy的环境。

    Scrapy爬虫运行常见报错及解决

    按照第一个Spider代码练习,保存在 tutorial/spiders 目录下的 dmoz_spider.py 文件中:

    import scrapy
     
    class DmozSpider(scrapy.Spider):
        name = "dmoz"
        allowed_domains = ["dmoz.org"]
        start_urls = [
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
        ]
     
        def parse(self, response):
            filename = response.url.split("/")[-2]
            with open(filename, 'wb') as f:
                f.write(response.body) 

    terminal中运行:scrapy crawl dmoz,试图启动爬虫

    报错提示一:

    Scrapy 1.6.0 - no active project

    Unknown command: crawl

    alicedeMacBook-Pro:~ alice$ scrapy crawl dmoz
    Scrapy 1.6.0 - no active project
     
    Unknown command: crawl
     
    Use "scrapy" to see available commands

    原因是:在使用命令行startproject的时候,会自动生成scrapy.cfg。而使用命令行cmd启动爬虫时,crawl会去搜索cmd当前目录下的scrapy.cfg文件,官方文档中也进行了说明。找不到scrapy.cfg文件则认为没有该project。

    解决方案:因此cd进入该dmoz项目的根目录,即scrapy.cfg文件在的目录,执行命令scrapy crawl dmoz

    正常情况下得到的输出应该是:

    2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)

    2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...

    2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}

    2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...

    2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...

    2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...

    2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...

    2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened

    2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

    2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

    然而实际不是

    报错提示二:

      File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 71, in load

        raise KeyError("Spider not found: {}".format(spider_name))

    KeyError: 'Spider not found: dmoz'

    alicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz
    2019-04-19 09:28:23 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial)
    2019-04-19 09:28:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:39:00) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.3.0-x86_64-i386-64bit
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 69, in load
        return self._spiders[spider_name]
    KeyError: 'dmoz'
     
    During handling of the above exception, another exception occurred:
     
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 71, in load
        raise KeyError("Spider not found: {}".format(spider_name))
    KeyError: 'Spider not found: dmoz'

    原因:定位的目录不正确,要进入到dmoz在的目录

    解决方案:也比较简单,重新check目录进去即可

    报错提示三:

     File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in module>
    from OpenSSL._util import lib as pyOpenSSLlib
    ImportError: No module named _util

    alicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz
    2018-08-06 22:25:23 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
    2018-08-06 22:25:23 [scrapy.utils.log] INFO: Versions: lxml 4.2.4.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.10 (default, Jul 15 2017, 17:16:57) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)], pyOpenSSL 0.13.1 (LibreSSL 2.2.7), cryptography unknown, Platform Darwin-17.3.0-x86_64-i386-64bit
    2018-08-06 22:25:23 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}
    Traceback (most recent call last):
      File "/usr/local/bin/scrapy", line 11, in module>
        sys.exit(execute())
      File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 150, in execute
        _run_print_help(parser, _run_command, cmd, args, opts)
      File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
        func(*a, **kw)
      File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
      t/ssl.py", line 230, in module>
        from twisted.internet._sslverify import (
      File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in module>
        from OpenSSL._util import lib as pyOpenSSLlib
    ImportError: No module named _util

    网上查了很久的资料,仍然无解。部分博主说是pyOpenSSL或Scrapy的安装有问题,于是重新装了pyOpenSSL和Scrapy,但还是报同样错误,实在不知道怎么解决了。

    后面重装了pyOpenSSL和Scrapy,貌似是解决了~

    2019-04-19 09:46:37 [scrapy.core.engine] INFO: Spider opened
    2019-04-19 09:46:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2019-04-19 09:46:39 [scrapy.core.engine] DEBUG: Crawled (403) GET http://www.dmoz.org/robots.txt> (referer: None)
    2019-04-19 09:46:39 [scrapy.core.engine] DEBUG: Crawled (403) GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
    2019-04-19 09:46:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response 403 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>: HTTP status code is not handled or not allowed
    2019-04-19 09:46:40 [scrapy.core.engine] DEBUG: Crawled (403) GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
    2019-04-19 09:46:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response 403 http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/>: HTTP status code is not handled or not allowed
    2019-04-19 09:46:40 [scrapy.core.engine] INFO: Closing spider (finished)
    2019-04-19 09:46:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 737,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'downloader/response_bytes': 2103,
     'downloader/response_count': 3,
     'downloader/response_status_count/403': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2019, 4, 19, 1, 46, 40, 570939),
     'httperror/response_ignored_count': 2,
     'httperror/response_ignored_status_count/403': 2,
     'log_count/DEBUG': 3,
     'log_count/INFO': 9,
     'log_count/WARNING': 1,
     'memusage/max': 65601536,
     'memusage/startup': 65597440,
     'response_received_count': 3,
     'robotstxt/request_count': 1,
     'robotstxt/response_count': 1,
     'robotstxt/response_status_count/403': 1,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'start_time': datetime.datetime(2019, 4, 19, 1, 46, 37, 468659)}
    2019-04-19 09:46:40 [scrapy.core.engine] INFO: Spider closed (finished)
    alicedeMacBook-Pro:tutorial alice$ 

    到此这篇关于Python爬虫之Scrapy环境搭建案例教程的文章就介绍到这了,更多相关Python爬虫之Scrapy环境搭建内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!

    您可能感兴趣的文章:
    • 关于python爬虫应用urllib库作用分析
    • python爬虫Scrapy框架:媒体管道原理学习分析
    • python爬虫Mitmproxy安装使用学习笔记
    • Python爬虫和反爬技术过程详解
    • python爬虫之Appium爬取手机App数据及模拟用户手势
    • 爬虫Python验证码识别入门
    • Python爬虫技术
    • Python爬虫爬取商品失败处理方法
    • Python获取江苏疫情实时数据及爬虫分析
    • Python爬虫中urllib3与urllib的区别是什么
    • 教你如何利用python3爬虫爬取漫画岛-非人哉漫画
    • Python爬虫分析汇总
    上一篇:Python基础之标准库和常用的第三方库案例教程
    下一篇:python实现图像处理之PiL依赖库的案例应用详解
  • 相关文章
  • 

    © 2016-2020 巨人网络通讯 版权所有

    《增值电信业务经营许可证》 苏ICP备15040257号-8

    Python爬虫之Scrapy环境搭建案例教程 Python,爬虫,之,Scrapy,环境,