博客跟新说明
:爬取时间已缩短至29.4s—-<<<<传送门
一、前言
英雄联盟是一款很火的游戏,像我这种没玩过的都知道疾风剑豪-亚索
,我便以此展示结果:
之前写过一篇多线程爬取王者荣耀1080P壁纸的博客—-<<<<文章链接
大家都说Python的多线程是鸡肋,因为有了GIL(全局解释锁),导致Python不能正真意义上实现多线程。只有在IO密集型操作里可以使用多线程,比如网络请求,读写文件会产生一些时间空隙,python在这段时间里不工作,多线程就可以利用这个空隙去做其他工作,从而使python一直保持工作。
但是,多线程每个线程都会有自己的缓存数据,每次切换线程的时候非常耗能。
协程不一样,协程的切换只是单纯的操作CPU的上下文,所以每一秒钟切换100万次都不成问题!而且协程还可以主动终断程序去做其它的事情,然后找个时间再反过来继续执行,这样程序员就可以主动控制程序的中断与否。
光说不行,既然协程这么牛,于是我打算实际写一个程序测试一下。有请“受害者”英雄联盟。
二、分析
python版本3.6
安装异步请求库:pip install aiohttp
(1)相关依赖
1 2 3 4 5 6 | from time import perf_counter from loguru import logger import requests import asyncio import aiohttp import os |
(2)全局变量
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # global variable ROOT_DIR = os.path.dirname(__file__) os.mkdir(f '{ROOT_DIR}/image' ) IMG_DIR = f '{ROOT_DIR}/image' RIGHT = 0 # counts of right image ERROR = 0 # counts of error image headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36' } # target url # skin's url, will completed with hero's id. |
(3)获取英雄编号
规律还是要找一段时间的,我是直接看了别人博客的分析,其实都是老套路,这里简单说一下。
多看几个就能找到规律,第一张图目标url(target_url)
是个json格式的数据,里面的heroId是英雄的编号,而很据这个编号正好能拼接
出皮肤url(skin_url)
,其中的mainImg就是要下载的皮肤的url了。
1 2 3 4 5 6 7 8 9 10 11 12 | def get_hero_id(url): """ get hero's id, to complete base_url. :param url: target url :return: hero's id """ response = requests.get(url = url, headers = headers) info = response.json() items = info.get( 'hero' ) for item in items: yield item.get( 'heroId' ) |
这是一个生成器,获取英雄的编号,然后后面好调用
(4)爬取每个英雄的皮肤信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | async def fetch_hero_url(url): """ fetch hero url, to get skin's info :param url: hero url :return: None """ async with aiohttp.ClientSession() as session: async with session.get(url = url, headers = headers) as response: if response.status = = 200 : response = await response.json(content_type = 'application/x-javascript' ) # skin's list skins = response.get( 'skins' ) for skin in skins: info = {} info[ 'hero_name' ] = skin.get( 'heroName' ) '_' skin.get( 'heroTitle' ) info[ 'skin_name' ] = skin.get( 'name' ) info[ 'skin_url' ] = skin.get( 'mainImg' ) await fetch_skin_url(info) |
筛选信息还是老套路。筛选好一个图片的信息后,即刻进入fetch_skin_url()函数,根据此信息对图片进行命名、下载和保存。
(5)下载图片
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | async def fetch_skin_url(info): """ fetch image, save it to jpg. :param info: skin's info :return: None """ global RIGHT, ERROR path = f '{IMG_DIR}/{info["hero_name"]}' make_dir(path) name = info[ 'skin_name' ] url = info[ 'skin_url' ] if name.count( '/' ): name.replace( '/' , '//' ) elif url = = '': ERROR = 1 logger.error(f '{name} url error {ERROR}' ) else : RIGHT = 1 async with aiohttp.ClientSession() as session: async with session.get(url = url, headers = headers) as response: if response.status = = 200 : with open (f '{path}/{name}.jpg' , 'wb' ) as file : chunk = await response.content.read() logger.success(f 'download {name} right {RIGHT}...' ) file .write(chunk) else : ERROR = 1 logger.error(f '{name},{url} status!=200' ) |
上来的第一件事就是创建目录,因为要进行分类。当然hero_name会有重复的情况,所以在后面的make_dir()里先要检测是否存在此目录
创建好目录后,还要检测这个url是否正确,因为在分析的时候发现:
这个有mainImg
这个没有mainImg
后来经过多次比对发现,有些重复的皮肤名只有原始皮肤版本有url,其他没有,每个英雄都是这样。而且还发现了皮肤名带有“/”的,
这个对给图片命名时很不友善,python会把“/”当作目录分隔符,所以首先将此类的“/”换成“//”再进行下面的判断,对于url是空
的情况就直接输出一个error并记录次数。对于下载成功的则输出success并记录次数。
(6)创建目录
1 2 3 4 5 6 7 8 9 10 | def make_dir(path): """ make dir with hero's name :param path: path :return: path, skin's dir """ if not os.path.exists(path): os.mkdir(path) return path |
三、完整代码
基于PyCharm、Python3.8、aiohttp3.6.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | # -*- coding: utf-8 -*- """ @author :Pineapple @contact :cppjavapython@foxmail.com @time :2020/8/13 13:33 @file :lol.py @desc :fetch lol hero's skins """ from time import perf_counter from loguru import logger import requests import asyncio import aiohttp import os start = perf_counter() # global variable ROOT_DIR = os.path.dirname(__file__) os.mkdir(f '{ROOT_DIR}/image' ) IMG_DIR = f '{ROOT_DIR}/image' RIGHT = 0 # counts of right image ERROR = 0 # counts of error image headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36' } # target url # skin's url, will completed with hero's id. loop = asyncio.get_event_loop() tasks = [] def get_hero_id(url): """ get hero's id, to complete base_url. :param url: target url :return: hero's id """ response = requests.get(url = url, headers = headers) info = response.json() items = info.get( 'hero' ) for item in items: yield item.get( 'heroId' ) async def fetch_hero_url(url): """ fetch hero url, to get skin's info :param url: hero url :return: None """ async with aiohttp.ClientSession() as session: async with session.get(url = url, headers = headers) as response: if response.status = = 200 : response = await response.json(content_type = 'application/x-javascript' ) # skin's list skins = response.get( 'skins' ) for skin in skins: info = {} info[ 'hero_name' ] = skin.get( 'heroName' ) '_' skin.get( 'heroTitle' ) info[ 'skin_name' ] = skin.get( 'name' ) info[ 'skin_url' ] = skin.get( 'mainImg' ) await fetch_skin_url(info) async def fetch_skin_url(info): """ fetch image, save it to jpg. :param info: skin's info :return: None """ global RIGHT, ERROR path = f '{IMG_DIR}/{info["hero_name"]}' make_dir(path) name = info[ 'skin_name' ] url = info[ 'skin_url' ] if name.count( '/' ): name.replace( '/' , '//' ) elif url = = '': ERROR = 1 logger.error(f '{name} url error {ERROR}' ) else : RIGHT = 1 async with aiohttp.ClientSession() as session: async with session.get(url = url, headers = headers) as response: if response.status = = 200 : with open (f '{path}/{name}.jpg' , 'wb' ) as file : chunk = await response.content.read() logger.success(f 'download {name} right {RIGHT}...' ) file .write(chunk) else : ERROR = 1 logger.error(f '{name},{url} status!=200' ) def make_dir(path): """ make dir with hero's name :param path: path :return: None """ if not os.path.exists(path): os.mkdir(path) if __name__ = = '__main__' : for hero_id in get_hero_id(hero_url): url = base_url str (hero_id) '.js' tasks.append(fetch_hero_url(url)) loop.run_until_complete(asyncio.wait(tasks)) logger.info(f 'count times {perf_counter() - start}s' ) logger.info(f 'download RIGHT {RIGHT}, download ERROR {ERROR}' ) |
四、小结
可能有点朋友下好之后发现才285M就花了45s?上次下王者荣耀壁纸的时候下了450M一共57s,那岂不是协程比多线程慢?其实不然,上次是400张壁纸也就是400次get请求,而这次是1385次get请求,你说哪个快?况且这次仅仅是网络请求的异步,写入文件并没有异步操作。
然而这里面有一个致命的问题想必大家也发现了,没用代理。
我在测试的过程中偶尔会出现封IP的情况,于是添加了代理,但是免费代理太不靠谱了,一个代理池中1000个左右的代理也就只有70左右
能用,能用的也不靠谱,由于代理数量的庞大,检测的速度也很慢,这就导致了代理上一秒还行,下一秒就挂了。时不时抛出各种异常,
无奈之下放弃了代理,等改善好代理池之后再做这方面的尝试。
https://www.cnblogs.com/james-wangx/p/16106312.html