携程爬取机票信息
utils
本文字数:860 字 | 阅读时长 ≈ 4 min

携程爬取机票信息

utils
本文字数:860 字 | 阅读时长 ≈ 4 min

本文基于 Linux 写了一个爬虫,实时爬取携程上机票等信息

1. BrowseMob

BrowserMob Proxy,简称 BMP,它是一个 HTTP 代理服务,我们可以利用它截获 HTTP 请求和响应内容,他是基于 Java 的,因此需要确保系统上安装了 Java,安装流程如下

  1. 安装 Java:安装命令,sudo apt update sudo apt install default-jdk
  2. 安装 BrowserMob Proxy:安装命令 pip install browsermob-proxy
  3. 下载 browsermob-proxy 二进制文件:在 GitHub 上下载browsermob文件,用于启动 BrowserMob Proxy
  4. 测试是否安装成功
from browsermobproxy import Server

# 启动代理, 修改下载的文件路径
server = Server(r'/blog/xiecheng/browsermob-proxy-2.1.4/bin/browsermob-proxy', options={"port": 8090}) 
server.start()
proxy = server.create_proxy()
print('proxy', proxy.proxy)

程序输出 proxy localhost:xxxx 则表示安装成功,如果出现错误,可查看 server.log 中的信息,这里我报了下面的 bug

(python) (base) root@iZuf67mlx4ftb5vohxeobnZ:/blog/xiecheng# python test.py 
Traceback (most recent call last):
  File "/blog/xiecheng/test.py", line 7, in <module>
    proxy = server.create_proxy()
  File "/root/miniconda3/envs/python/lib/python3.10/site-packages/browsermobproxy/server.py", line 40, in create_proxy
    client = Client(self.url[7:], params)
  File "/root/miniconda3/envs/python/lib/python3.10/site-packages/browsermobproxy/client.py", line 38, in __init__
    self.port = jcontent['port']
KeyError: 'port'

在网上没有找到解决办法,结果发现是 8090 端口没开启(或者被其他程序占用了),这里开启端口即可

2. 安装浏览器

unzip chromedriver_linux64.zip
cp xx/chromedriver /usr/bin/chromedriver
chmod +x /usr/bin/chromedriver
unzip chromedriver-mac-arm64.zip
cp xx/chromedriver /usr/local/bin/chromedriver

3. 开始爬取机票信息

首先新开一个窗口运行 browsermob-proxy

(python) (base) root@iZuf67mlx4ftb5vohxeobnZ:/blog/xiecheng# ./browsermob-proxy-2.1.4/bin/browsermob-proxy
Running BrowserMob Proxy using LittleProxy implementation. To revert to the legacy implementation, run the proxy with the command-line option '--use-littleproxy false'.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.inject.internal.cglib.core.$ReflectUtils$2 (file:/blog/xiecheng/browsermob-proxy-2.1.4/lib/browsermob-dist-2.1.4.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$2
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[INFO  2024-06-18T16:24:48,907 net.lightbody.bmp.proxy.Main] (main) Starting BrowserMob Proxy version 2.1.4 
[INFO  2024-06-18T16:24:48,974 org.eclipse.jetty.util.log] (main) jetty-7.x.y-SNAPSHOT 
[INFO  2024-06-18T16:24:49,068 org.eclipse.jetty.util.log] (main) started o.e.j.s.ServletContextHandler{/,null} 
[INFO  2024-06-18T16:24:49,403 org.eclipse.jetty.util.log] (main) Started SelectChannelConnector@0.0.0.0:8080 

然后

4月 06, 2025
3月 10, 2025
12月 31, 2024