一个Python多进程解析域名的例子

工作要求,需要知道上万个域名的解析IP,并判断指向是否正确。最开始想到的是Ping命令,但其结果不容易处理。经过一番查找,最终选择了socket.gethostbyname()方法。一开始因为是普通的编程方法,一万多条数据处理下来居然花了好几个小时,效率很低。这时主要的瓶颈其实在于gethostbyname,但一时没找到其他更好用的解析IP的方法。后来得到一个同事的启发,用Python的多进程处理,处理时间缩短了一大半,这样多多少少弥补了gethostbyname的缺陷。完整案例如下(数据是虚假的):
需要判断的ip(txt格式,一行一个ip)

...
192.168.0.2
192.168.9.2
...

原始域名数据(txt格式,一行一个域名)

...
xxx.cn
xxxx.com
...

处理后的数据(txt格式,一行一个域名+ip+判断词)

...
xxx.cn 192.168.0.1 in
xx2x.cn 192.168.0.2 not in
xx3x.cn unresolved unresolved
...

处理程序如下:

#coding:utf-8
import socket
from multiprocessing import Pool

# IPs
ipList = []
with open("/path/to/ip.txt", "r") as fip:
    for ip in fip.readlines():
        ip = ip.strip()
        ipList.append(ip)

def URL2IP(url):
    url = url.strip()
    # urlList = url.split("\t");
    try:
        ip = socket.gethostbyname("www." + str(url))
        if ip in ipList:
            tip = "in"
        else:
            tip = "no in"
    except:
        print url + " this URL 2 IP ERROR "
        ip = "unresolved"
        tip = "unresolved"
    
    return url + "\t" + str(ip) + "\t" + str(tip)
if __name__ == '__main__':     
    # domains
    allUrls = []
    with open("/path/to/domain.txt", "r", encoding='utf-8') as urllist:
        allUrls = urllist.readlines()
    
    p = Pool(8) # 建议设置成CPU核数
    resultList = p.map(URL2IP, allUrls)
    p.close()
    p.join()

    # write the result to file
    with open("/path/to/resolve.txt", "w") as resovelist:
        resovelist.writelines("\n" . join(resultList))

    print "complete !"

关于如何使用Python多进程,大家可以自行搜索。

标签: python, 多进程, multiprocessing, Pool

添加新评论