最近需要采集一个目标网站的数据(MP3),该网站不限速,于是试过了多线程,多进程,都不尽人意,最后祭出大杀器:asyncio
需要安装库:
pip install asyncio -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install aiohttp -i https://pypi.tuna.tsinghua.edu.cn/simple
在这个过程中,假如你的并发达到2000个,程序会报错:ValueError: too many file descriptors in select()。报错的原因字面上看是 Python 调取的 select 对打开的文件有最大数量的限制,这个其实是操作系统的限制,linux打开文件的最大数默认是1024,windows默认是509,超过了这个值,程序就开始报错。个人推荐限制并发数的方法,设置并发数为100~500,处理速度更快。
#coding:utf-8
import asyncio,aiohttp
#目标网站
url = 'https://www.baidu.com/'
#采集程序
async def hello(url,semaphore):
async with semaphore:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.read()
# 预处理函数
async def run():
semaphore = asyncio.Semaphore(500) # 限制并发量为500
to_get = [hello(url.format(),semaphore) for _ in range(1000)] #总共1000任务
await asyncio.wait(to_get)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(run())
loop.close()
'''
async with aiohttp.ClientSession(headers=headers, cookies=cookies) as session:
result_text = None
try:
result = await session.post(url, timeout=timeout, data=data)
result_text = await result.text()
except Exception as e:
raise (e)
return result
'''
aiohttp请求和requests大部分是相同的。tasks是固定长度的,首页,次页规则分别爬取。
import time
import asyncio,aiohttp
# 起始时间
start_time = time.time()
# 原始列表
urls = [
'https://www.baidu.com',
'https://www.sogou.com',
'https://www.csdn.net/'
]
async def get_page(url):
async with aiohttp.ClientSession() as session:
async with await session.get(url) as response:
# text()返回字符串数据;read()返回二进制数据;json()返回json对象
# 在获取响应数据操作之前一定要使用await进行手动挂起
page_text = await response.text()
# print(page_text)
print(url)
# 异步列表
tasks = []
for url in urls:
c = get_page(url)
task = asyncio.ensure_future(c)
tasks.append(task)
# 异步循环
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
# 运行时间
end_time = time.time()
print(end_time - start_time)