【Python爬虫】xpath模块

xpath是在XML文档中搜索内容的一门语言

HTML是XML的一种

需要安装lxml模块

from lxml import etree
xml = """
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <author>
        <nick id="10">Tom</nick>
        <nick id="11">Bob</nick>
        <nick id="12">Alice</nick>
        <nick id="13">John</nick>
        <div>
            <nick>锟斤拷锟斤</nick>
            <div>
                <nick>锟斤拷锟斤拷</nick>
            </div>
        </div>
        <span>
            <nick>锟斤拷锟斤拷锟斤拷</nick>
        </span>
    </author>
  <body>Don't forget me this weekend!</body>
</note>
"""

tree = etree.XML(xml)
result1 = tree.xpath("/note")

其中/表示层级关系，第一个/是根节点

result2 = tree.xpath("/note/from/text()")

text()是拿文本，结果如下：

result3 = tree.xpath("/note/author/nick/text()")
result4 = tree.xpath("/note/author/div/nick/text()")
result5 = tree.xpath("/note/author//nick/text()")
result6 = tree.xpath("/note/author/*/nick/text()") # 任意的节点，通配符

<!DOCTYPE html>
<html lang="zh_CN">
<head>
    <meta charset="utf-8" />
    <title>Title</title>
</head>
<body>
    <ul>
        <li><a href="https://www.people.com.cn/">人民网</a></li>
        <li><a href="https://www.xinhuanet.com/">新华网</a></li>
        <li><a href="https://www.cctv.com/">央视网</a></li>
    </ul>
    <ol>
        <li><a href="https://weibo.com/">新浪微博</a></li>
        <li><a href="https://www.zhihu.com/">知乎</a></li>
        <li><a href="https://www.douban.com/">豆瓣</a></li>
    </ol>
    <div class="aaa">AAA</div>
    <div class="bbb">BBB</div>
</body>
</html>

result1 = tree.xpath("/html/body/ul/li/a/text()")
# 注意：xpath是从1开始数的
result2 = tree.xpath("/html/body/ul/li[1]/a/text()")

可以使用 @属性='值' 进行筛选

result3 = tree.xpath("/html/body/ol/li/a[@href='https://www.zhihu.com/']/text()")

循环详细提取（注意相对查找）

ol_li_list = tree.xpath("/html/body/ol/li")
for li in ol_li_list:
    # 从每一个li中提取到文字信息
    result4 = li.xpath("./a/text()") # 此时是相对查找
    result5 = li.xpath("./a/@href")

Tips：当页面极其复杂的时候，可以利用F12右键进行选择xpath路径

实战-抓取猪八戒数据

import requests
from lxml import etree

url = "https://beijing.zbj.com/search/f/?type=new&kw=saas"
resp = requests.get(url)

html = etree.HTML(resp.text)
# 拿到每块的div
divs = html.xpath("/html/body/div[6]/div/div/div[2]/div[5]/div[1]/div")
for div in divs:
    price = div.xpath("./div/div/a[1]/div[2]/div[1]/span[1]/text()")[0].strip("¥")
    title = "saas".join(div.xpath("./div/div/a[1]/div[2]/div[2]/p/text()"))
    company_name = div.xpath("./div/div/a[2]/div[1]/p/text()")[1][2:]
    company_location = div.xpath("./div/div/a[2]/div[1]/div/span/text()")[0]
    print(title, price, company_name, company_location)

resp.close()

实战-抓取猪八戒数据

评论