harry's blog

1. requests 库

Requests 库提供了简单的接口来发送 HTTP 请求，包括 GET、POST、PUT、DELETE 等。它的主要优点是易用性和良好的文档支持。

安装: pip install requests
基本使用：

import requests

url = 'http://baidu.com'
response = requests.get(url)  # 发送GET请求

if response.status_code == 200:  # 检查响应状态码
    print(response.text)  # 打印响应内容
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

发送 POST 请求：

data = {'key1': 'value1', 'key2': 'value2'}
response = requests.post('http://example.com/post', data=data)
print(response.text)

设置请求头和参数：

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
params = {'key1': 'value1', 'key2': 'value2'}
response = requests.get('http://example.com', headers=headers, params=params)
print(response.text)

Requests 库通常与 BeautifulSoup4 结合使用，以便获取网页内容并解析。

import requests
from bs4 import BeautifulSoup

url = 'http://example.com'
response = requests.get(url)

if response.status_code == 200:  # 检查响应状态码
    soup = BeautifulSoup(response.text, 'html.parser')  # 解析HTML内容
    
    title = soup.title.text  # 查找元素
    links = soup.find_all('a')

    for link in links:
        print(link.get('href'))
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

2. BeautifulSoup4

BeautifulSoup4 是一个用于解析 HTML 和 XML 文档的库，能够方便地从网页中提取数据。它适合处理静态网页内容。

安装 BeautifulSoup4：pip install beautifulsoup4
基本使用

import requests
from bs4 import BeautifulSoup

url = 'http://example.com'  # 发送 HTTP 请求
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')  # 解析网页内容

title = soup.title.text  # 查找元素
links = soup.find_all('a')

for link in links:
    print(link.get('href'))

3. Selenium

Selenium 是一个强大的工具，主要用于自动化浏览器操作，模拟用户的操作行为（如点击、输入等）。它特别适合处理动态加载的网页内容。

安装 Selenium：pip install selenium
安装 WebDriver：安装 Chrome 的 WebDriver（chromedriver），并将其路径添加到系统环境变量中。
基本使用

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()  # 初始化 WebDriver
driver.get('http://example.com')  # 打开网页

search_box = driver.find_element(By.NAME, 'q')  # 查找元素
search_box.send_keys('Hello, World!')
search_box.send_keys(Keys.RETURN)

content = driver.page_source  # 获取页面内容
driver.quit()  # 关闭浏览器

4. 三者的区别和关系

4.1 Requests 与 Selenium 区别

Requests：用于发送 HTTP 请求，适合处理静态网页或 API 调用。性能较好，使用简单。
Selenium：用于自动化浏览器操作，适合处理动态网页和模拟用户交互。配置复杂，性能较差。适用于处理需要 JavaScript 渲染的动态内容（如 AJAX 请求加载的数据）。相对较慢，因为它需要模拟浏览器操作并等待页面渲染完成。适合需要用户交互、表单提交或处理动态内容的场景。需要了解一些浏览器操作的基础知识和 WebDriver 的配置。

4.2 Requests 与 BeautifulSoup4 联系

Requests：用于获取网页的 HTML 内容。
BeautifulSoup4：用于解析 HTML 内容，提取所需数据。通常与 Requests 结合使用。适用于处理静态内容。它依赖于 requests 获取页面源代码，无法处理动态加载的内容。相对较快，因为它直接解析 HTML 文档，无需等待浏览器渲染。适合快速提取静态网页内容，尤其是在无需模拟用户操作的情况下。相对简单，只需基本的 HTML 和 CSS 选择器知识。

4.3 总结

Requests：用于发送 HTTP 请求，获取网页或 API 数据。
BeautifulSoup4：用于解析和提取 HTML 内容。
Selenium：用于自动化浏览器操作，处理动态内容和模拟用户交互。

选择工具时，可以根据任务需求和网页特性来决定使用哪一个或结合使用。例如，可以用 Requests 获取网页内容，然后用 BeautifulSoup4 解析；对于动态内容，用 Selenium 加载网页，再用 BeautifulSoup4 提取数据。

本文由 Yonghui Wang 创作，采用知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外，均为本站原创或翻译，转载前请务必署名
最后编辑时间为: Dec 19, 2024 12:13 pm

爬虫基本命令