一、使用工具:
python3.7、pycharm、scrapy
二、实现步骤
1、选择文件夹创建scrapy文件。首先打开cmd控制台,cd到scrapy文件夹,输入 scrapy stratproject [ ],其中【】输入你所创建的文件夹名称。
然后cd csdn,注:我创建的文件夹为csdn,在文件夹下输入 scrapy genspider -t crawl csdn_url runoob.com,其中csdn_url是python文件名, runoob.com为菜鸟教程域名。该方法能够爬取一定规则的网页,之后会讲到。
创建完的文件夹内容如下:
最后,为了在pycharm能够运行该文件,在csdn目录下创建一个start.py,内容如下:
from scrapy import cmdline
cmdline.execute("scrapy crawl csdn_url".split())
2、scrapy基本信息设置
首先在settings.py中进行以下设置
然后分析菜鸟教程网页,将抓取原则写入csdn_url.py,直接上代码,其中start_urls为初始网页链接,rule是设定的代码规则,菜鸟教程中的链接都是https://www.runoob.com/python3/python3-作为前缀。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from csdn.items import CsdnItem
class CsdnUrlSpider(CrawlSpider):
name = 'csdn_url'
allowed_domains = ['runoob.com']
start_urls = ['https://www.runoob.com/python3/python3-tutorial.html']
rules = (
Rule(LinkExtractor(allow=r'https://www.runoob.com/python3/python3-+'), callback='parse_item', follow=False),
)
def parse_item(self, response):
name = response.xpath('//div[@class="article-intro"]/h1/text()').get()
if response.xpath('//div[@class="article-intro"]/h1/span/text()').get():
name += response.xpath('//div[@class="article-intro"]/h1/span/text()').get()
contents = response.xpath('//div[@class="article-intro"]//text()').getall()
title = []
title.append(name)
if response.xpath('//div[@class="article-intro"]/h2/text()').get():
title_2 = response.xpath('//div[@class="article-intro"]/h2/text()').getall()
title += title_2
if response.xpath('//div[@class="article-intro"]/h3/text()').get():
title_3 = response.xpath('//div[@class="article-intro"]/h3/text()').getall()
title += title_3
print("===============")
print(name)
print(title)
content_list = []
for i in contents:
# if content=="
":
# continue
if " " in i:
continue
if "
" in i:
continue
if i in title:
content_list.append("
")
content_list.append(i.strip())
if i in title:
content_list.append("
")
content = " ".join(content_list)
print(content)
item = CsdnItem(name=name, content=content)
print(item)
yield item
再设置items.py,本案例只爬取教程标题和内容:
import scrapy
class CsdnItem(scrapy.Item):
name = scrapy.Field()
content = scrapy.Field()
最后在pipelines.py设置储存方式与路径,分别储存为json格式和txt格式:
from scrapy.exporters import JsonLinesItemExporter
class CsdnPipeline(object):
def process_item(self, item, spider):
self.fp = open("cainiao.json", "wb")
self.ft = open("cainiao.txt", "a", encoding="utf-8")
self.ft.write(str(item["name"]) + '
')
self.ft.write(str(item["content"]) + '
')
self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding="utf-8")
self.exporter.export_item(item)
return item
def close_spider(self, spider):
self.fp.close()
self.ft.close()
3、结果如下
原网页
结果: