Civilpy - 文章

Python数据分析及可视化实例之多线程、进程

发布时间：2021-12-03 公开文章

Base

基础知识

Civil

土木分类资料

Python

Python编程学习

Tools

自媒体效率工具

1.给虫虫们的建议

多线程，多进程，分布式采集，

一般不建议使用，

很容易被远程服务器封杀，甚至击穿底线，

当然，用在本地环境做一些处理可以随意！

此外，免费代理池每天就那么十来个，

还不怎么稳定，访问速度也比较垃圾。

最后，关于验证码，接入打码平台最好，

验证码识别？系列文章结束时会来一发。

最后的最后，要想做到心中无码的境界，

还的自己撸一个网站，那怕用PHP都行。

2.多线程

# encoding: utf-8
__author__ = 'yeayee'  # 2015-06

from collections import deque
import threading ,time
from threading import current_thread


# 自定义线程类
class MyThread(threading.Thread):
    def __init__(self, funcs, args, name=''):
        threading.Thread.__init__(self)
        self.funcs = funcs
        self.name = name
        self.args = args
    def run(self):
        self.funcs(*self.args)


###接下来就是爬取网页了
def getContent(que):
    while que:
        try:
            url = que.popleft()  # 取一个URL并将它从队列里面剔除
            print('爬虫编号'+current_thread().name+"，正在爬取："+url)
            time.sleep(10)  # 为了演示效果，应用时注释掉即可
        except:
            print('爬虫掉粪坑了,稍等，接着爬！')
            pass # 这特么就是错不悔改榜样

que = deque()
visited = set()
in_urls = ['wx_nemoon_01','wx_nemoon_02','wx_nemoon_03','wx_nemoon_04','wx_nemoon_05','wx_nemoon_06','wx_nemoon_07']
# 初步获取的URLS


for in_url in in_urls:
    get_url = 'http://www.baidu.com/' + in_url
    que.append(get_url)  # 将要采集的内页网址加入队列

thread=[]
for i in range(4):
    wx_nemoon = MyThread(getContent, (que, ), name='ID' + str(i))
    thread.append(wx_nemoon)
for i in range(4):
    thread[i].start()
for i in range(4):
    thread[i].join()

3.多进程

# # encoding: utf-8
__author__ = 'yeayee'  # 2015-06

from multiprocessing import  Pool
import time

in_urls = ['wx_nemoon_01','wx_nemoon_02','wx_nemoon_03','wx_nemoon_04','wx_nemoon_05','wx_nemoon_06','wx_nemoon_07']
# 初步获取的URLS

full_urls = []
for in_url in in_urls:
    get_url = 'http://www.baidu.com/' + in_url
    full_urls.append(get_url)  # 将要采集的内页完整网址

def getContent(url):
    try:
        print('爬虫正在爬取：'+url)
        time.sleep(10)  # 为了演示效果，应用时注释掉即可
    except:
        print('爬虫掉粪坑了,稍等，接着爬！')
        pass # 这特么就是错不悔改榜样

if __name__ == '__main__':
    pool = Pool(4)  # 4个进程
    pool.map(getContent, full_urls)
    pool.close()
    pool.join()