xpath
是在XML文档中搜索内容的一门语言
HTML是XML的一种
需要安装lxml
模块
from lxml import etree xml = """ <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <author> <nick id="10">Tom</nick> <nick id="11">Bob</nick> <nick id="12">Alice</nick> <nick id="13">John</nick> <div> <nick>锟斤拷锟斤</nick> <div> <nick>锟斤拷锟斤拷</nick> </div> </div> <span> <nick>锟斤拷锟斤拷锟斤拷</nick> </span> </author> <body>Don't forget me this weekend!</body> </note> """
Python
tree = etree.XML(xml)
result1 = tree.xpath("/note")
Python
其中/
表示层级关系,第一个/
是根节点
result2 = tree.xpath("/note/from/text()")
Python
text()
是拿文本,结果如下:
result3 = tree.xpath("/note/author/nick/text()")
result4 = tree.xpath("/note/author/div/nick/text()")
result5 = tree.xpath("/note/author//nick/text()")
result6 = tree.xpath("/note/author/*/nick/text()") # 任意的节点,通配符
Python
<!DOCTYPE html> <html lang="zh_CN"> <head> <meta charset="utf-8" /> <title>Title</title> </head> <body> <ul> <li><a href="https://www.people.com.cn/">人民网</a></li> <li><a href="https://www.xinhuanet.com/">新华网</a></li> <li><a href="https://www.cctv.com/">央视网</a></li> </ul> <ol> <li><a href="https://weibo.com/">新浪微博</a></li> <li><a href="https://www.zhihu.com/">知乎</a></li> <li><a href="https://www.douban.com/">豆瓣</a></li> </ol> <div class="aaa">AAA</div> <div class="bbb">BBB</div> </body> </html>
HTML
result1 = tree.xpath("/html/body/ul/li/a/text()")
# 注意:xpath是从1开始数的
result2 = tree.xpath("/html/body/ul/li[1]/a/text()")
Python
可以使用 @属性='值'
进行筛选
result3 = tree.xpath("/html/body/ol/li/a[@href='https://www.zhihu.com/']/text()")
Python
循环详细提取(注意相对查找)
ol_li_list = tree.xpath("/html/body/ol/li")
for li in ol_li_list:
# 从每一个li中提取到文字信息
result4 = li.xpath("./a/text()") # 此时是相对查找
result5 = li.xpath("./a/@href")
Python
Tips:当页面极其复杂的时候,可以利用F12右键进行选择xpath路径
实战-抓取猪八戒数据
import requests from lxml import etree url = "https://beijing.zbj.com/search/f/?type=new&kw=saas" resp = requests.get(url) html = etree.HTML(resp.text) # 拿到每块的div divs = html.xpath("/html/body/div[6]/div/div/div[2]/div[5]/div[1]/div") for div in divs: price = div.xpath("./div/div/a[1]/div[2]/div[1]/span[1]/text()")[0].strip("¥") title = "saas".join(div.xpath("./div/div/a[1]/div[2]/div[2]/p/text()")) company_name = div.xpath("./div/div/a[2]/div[1]/p/text()")[1][2:] company_location = div.xpath("./div/div/a[2]/div[1]/div/span/text()")[0] print(title, price, company_name, company_location) resp.close()
Python