python爬虫---从零开始（五）pyQuery库-白红宇

强烈建议你试试无所不能的chatGPT，快点击我

python爬虫---从零开始（五）pyQuery库

阅读量：4310 次

发布时间：2019-06-06

本文共 7691 字，大约阅读时间需要 25 分钟。

什么是pyQuery：

　　强大又灵活的网页解析库。如果你觉得正则写起来太麻烦（我不会写正则），如果你觉得BeautifulSoup的语法太难记，如果你熟悉JQuery的语法，那么PyQuery就是你最佳的选择。

pyQuery的安装pip3 install pyquery即可安装啦。

pyQuery的基本用法：

初始化：

字符串初始化：

#!/usr/bin/env python# -*- coding: utf-8 -*-html = """The Dormouse's story</head>The Dormouse's story
Once upon a time there were three little sisters;and thier names were
       Lacie andTitle; and they lived at the boottom of a well.
...
"""from pyquery import PyQuery as pqdoc = pq(html)print(doc('a'))

运行结果：

URL初始化：

#!/usr/bin/env python# -*- coding: utf-8 -*-# URL初始化from pyquery import PyQuery as pqdoc = pq('http://www.baidu.com')print(doc('input'))

运行结果：

文件初始化：

#!/usr/bin/env python# -*- coding: utf-8 -*-# 文件初始化from pyquery import PyQuery as pqdoc = pq(filename='baidu.html')print(doc('title'))

运行结果：

选择方式和jquery一致，id、name、class都是如此，还有很多都和jquery一致。

基本CSS选择器：

#!/usr/bin/env python# -*- coding: utf-8 -*-# Css选择器html = """The Dormouse's story</head>The Dormouse's story
Once upon a time there were three little sisters;and thier names were
       Lacie andTitle; and they lived at the boottom of a well.
...
"""from pyquery import PyQuery as pqdoc = pq(html)print(doc('.title'))

运行结果：

查找元素：

子元素：

#!/usr/bin/env python# -*- coding: utf-8 -*-# 子元素html = """The Dormouse's story</head>The Dormouse's story
Once upon a time there were three little sisters;and thier names were
       Lacie andTitle; and they lived at the boottom of a well.
...
"""from pyquery import PyQuery as pqdoc = pq(html)items = doc('.title')print(type(items))print(items)p = items.find('b')print(type(p))print(p)

该代码为查找id为title的标签，我们可以看到id为title的标签有两个一个是p标签，一个是a标签，然后我们再使用find方法，查找出我们需要的p标签，运行结果：

这里需要注意的是，我们所使用的find是查找每一个元素内部的标签.

children：

#!/usr/bin/env python# -*- coding: utf-8 -*-# 子元素html = """The Dormouse's story</head>The Dormouse's story
Once upon a time there were three little sisters;and thier names were
       Lacie andTitle; and they lived at the boottom of a well.
...
"""from pyquery import PyQuery as pqdoc = pq(html)items = doc('.title')print(items.children())

运行结果：

也可以在children()内添加选择器条件：

#!/usr/bin/env python# -*- coding: utf-8 -*-# 子元素html = """The Dormouse's story</head>The Dormouse's story
Once upon a time there were three little sisters;and thier names were
       Lacie andTitle; and they lived at the boottom of a well.
...
"""from pyquery import PyQuery as pqdoc = pq(html)items = doc('.title')print(items.children('b'))

输出结果和上面的一致。

父元素：

#!/usr/bin/env python# -*- coding: utf-8 -*-# 子元素html = """The Dormouse's story</head>The Dormouse's story
Once upon a time there were three little sisters;and thier names were
       Lacie andTitle; and they lived at the boottom of a well.
...
"""from pyquery import PyQuery as pqdoc = pq(html)items = doc('#link1')print(items)print(items.parent())

运行结果：

这里只输出一个父元素。这里我们用parents方法会给予我们返回所有父元素，祖先元素

#!/usr/bin/env python# -*- coding: utf-8 -*-# 祖先元素html = """            The Dormouse's story                Once upo a time were three little sister;and theru name were                            Elsie                        Lacie            and             Title            Title        
        ...
"""from pyquery import PyQuery as pqdoc = pq(html)items = doc('#link1')print(items)print(items.parents('body'))

运行结果：

兄弟元素：

#!/usr/bin/env python# -*- coding: utf-8 -*-# 兄弟元素html = """            The Dormouse's story                Once upo a time were three little sister;and theru name were                            Elsie                        Lacie            and             Title            Title        
        ...
"""from pyquery import PyQuery as pqdoc = pq(html)items = doc('#link1')print(items)print(items.siblings('#link2'))

运行结果：

上面就把查找元素的方法都说了，下面我来看一下如何遍历元素。

遍历

#!/usr/bin/env python# -*- coding: utf-8 -*-# 兄弟元素html = """            The Dormouse's story                Once upo a time were three little sister;and theru name were                            Elsie                        Lacie            and             Title            Title        
        ...
"""from pyquery import PyQuery as pqdoc = pq(html)items = doc('a')for k,v in enumerate(items.items()):    print(k,v)

运行结果：

获取信息：

　　获取属性：

#!/usr/bin/env python# -*- coding: utf-8 -*-# 获取属性html = """            The Dormouse's story                Once upo a time were three little sister;and theru name were                            Elsie                        Lacie            and             Title            Title        
        ...
"""from pyquery import PyQuery as pqdoc = pq(html)items = doc('a')print(items)print(items.attr('href'))print(items.attr.href)

运行结果：

　　获得文本：

#!/usr/bin/env python# -*- coding: utf-8 -*-# 获取属性html = """            The Dormouse's story                Once upo a time were three little sister;and theru name were                            Elsie                        Lacie            and             Title            Title        
        ...
"""from pyquery import PyQuery as pqdoc = pq(html)items = doc('a')print(items)print(items.text())print(type(items.text()))

运行结果：

　　获得HTML：

#!/usr/bin/env python# -*- coding: utf-8 -*-# 获取属性html = """            The Dormouse's story                Once upo a time were three little sister;and theru name were                            Elsie                        Lacie            and             Title            Title        
        ...
"""from pyquery import PyQuery as pqdoc = pq(html)items = doc('a')print(items.html())

运行结果：

DOM操作：

addClass、removeClass

#!/usr/bin/env python# -*- coding: utf-8 -*-# DOM操作，addClass、removeClasshtml = """            The Dormouse's story                Once upo a time were three little sister;and theru name were                            Elsie                        Lacie            and             Title            Title        
        ...
"""from pyquery import PyQuery as pqdoc = pq(html)items = doc('#link2')print(items)items.addClass('addStyle') # add_classprint(items)items.remove_class('sister') # removeClass print(items)

运行结果：

attr、css：

#!/usr/bin/env python# -*- coding: utf-8 -*-# DOM操作，attr,csshtml = """            The Dormouse's story                Once upo a time were three little sister;and theru name were                            Elsie                        Lacie            and             Title            Title        
        ...
"""from pyquery import PyQuery as pqdoc = pq(html)items = doc('#link2')items.attr('name','addname')print(items)items.css('width','100px')print(items)

可以给予新的属性，如果原来有该属性，会覆盖掉原有的属性

运行结果：

remove：

#!/usr/bin/env python# -*- coding: utf-8 -*-# DOM操作，removehtml = """
     
          Hello World    
      This is a paragraph.
     
"""from pyquery import PyQuery as pqdoc = pq(html)wrap = doc('.wrap')print(wrap.text())wrap.find('p').remove()print("remove以后的数据")print(wrap)

运行结果：

还有很多其他的DOM方法，想了解更多的小伙伴可以阅读其官方文档，地址：

伪类选择器：

#!/usr/bin/env python# -*- coding: utf-8 -*-# DOM操作，伪类选择器html = """            The Dormouse's story                Once upo a time were three little sister;and theru name were                            Elsie                        Lacie            and             Title            Title        
        ...
"""from pyquery import PyQuery as pqdoc = pq(html)# print(doc)wrap = doc('a:first-child') # 第一个标签print(wrap)wrap = doc('a:last-child')  # 最后一个标签print(wrap)wrap = doc('a:nth-child(2)') # 第二个标签print(wrap)wrap = doc('a:gt(2)') # 比2大的索引 标签  即为  0 1 2 3 4 从0开始的  不是1print(wrap)wrap = doc('a:nth-child(2n)') # 第 2的整数倍 个标签print(wrap)wrap = doc('a:contains(Lacie)') # 包含Lacie文本的标签print(wrap)

这里不在详细的一一列举了，了解更多CSS选择器可以查看官方文档，由W3C提供地址：

到这里我们就把pyQuery的使用方法大致的说完了，想了解更多，更详细的可以阅读官方文档，地址：

上述代码地址：

　　　　　　　　　　感谢大家的阅读，不正确的地方，还希望大家来斧正，鞠躬，谢谢?。

转载于:https://www.cnblogs.com/cxiaocai/p/10939829.html

你可能感兴趣的文章

Centos6安装图形界面(hdp不需要，hdp直接从github上下载数据即可)

CentOS7 中把yum源更换成163源

关于yum Error: Cannot retrieve repository metadata (repomd.xml) for repository:xxxxxx.

Docker面试题（二）

【NOI 2018】归程（Kruskal重构树）

TZC Intercommunication System

HDU 4571 SPFA+DP

centos 创建以日期为名的文件夹

Java Timer触发定时器

Page Object设计模式

程序的基础知识

在VIM中使用GDB调试 – 使用vimgdb

python爬虫---从零开始（五）pyQuery库

Centos MySQL数据库迁移详细步骤

2初出茅庐--初级篇2.1

新建 WinCE7.0 下的 Silverlight 工程

腾讯的张小龙是一个怎样的人？

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！-- 愿君每日到此一游！

当前时间: 2024-10-06 05:57:29 当前IP: 18.118.1.179 联系邮箱:javaeecc@qq.com Copyright © 2020 - 2022 baihongyu.com 京ICP备2021015314号-2

强烈建议你试试无所不能的CHAT-GPT，快点击我

强烈建议你试试无所不能的CHAT-GPT，快点击我