feat(v3.0): done

dylanyang17 · Dec 7, 2024 · 453189c · 453189c
1 parent 897b92c
commit 453189c
Show file tree

Hide file tree

Showing 8 changed files with 188 additions and 151 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,4 @@
-download/
+downloads/
 *~
 *.swp
 .idea/

diff --git a/README.md b/README.md
@@ -4,68 +4,71 @@
 
 最近疫情严重，购买教材较为困难，为了方便大家在线学习，写了一个爬取清华教参的 python 脚本，因为还有很多其它事情要做(Orz毛概还没写完呢)，所以就写得比较简单了。
 
-功能上可以多进程爬取整本书的每一张图片（清晰度很高，可参见下面的实例），并自动合并得到 pdf 文件。下载过程支持“断点续传”，不会重复下载图片。脚本中要求输入的学号和密码为清华大学电子身份服务认证所需要，均仅用于清华官方服务的认证，以获得允许访问教参平台的书籍。学号和密码信息不会保存在本地，更不会上传到别处，这点我可以用人格担保，不放心的同学也可以自行阅读检查代码。
+功能上可以多进程爬取整本书的每一张图片（清晰度很高，可参见下面的实例），并自动合并得到 pdf 文件。下载过程支持“断点续传”，不会重复下载图片。由于双因子认证，脚本不再要求输入用户名和密码，而是需要手动获取 token。
 
 另外注意此脚本**仅供方便清华师生学习之用**，下载得到的电子书请务必不要进行传播（尤其是对校外的未授权者），也坚决反对任何批量下载书籍的违规行为。请大家自觉维护版权，合理使用资源，一切滥用该脚本导致的不良后果，作者概不负责。
 
 ## 使用说明
 
 ### 环境
 
-python 版本为 python3，使用 requirements.txt 一键安装依赖：``pip install -r requirements.txt``。
+python 版本为 python3（已测试 python 3.7-3.13），使用 requirements.txt 一键安装依赖：``pip install -r requirements.txt``。
 
 ### 使用
 
-用于下载清华教参平台上的电子书pdf版本，清华教参平台：http://reserves.lib.tsinghua.edu.cn 。找到自己需要的书籍之后，进入阅读界面将网址复制过来即可。此处也可用https，但教参平台的证书过期，会导致打印很多Warning。
+用于下载清华教参平台上的电子书pdf版本，清华教参平台：https://ereserves.lib.tsinghua.edu.cn/ 。
 
 使用 ``python main.py -h`` 可以打印帮助信息：
 
 ```
-usage: main.py [-h] [-n N] [-q Q] [-p] [-r] url
+usage: main.py [-h] -t TOKEN [-n N] [-q Q] [-d] [-r] url
 
-Version: v2.1.3. Download e-book from http://reserves.lib.tsinghua.edu.cn. By
-default, the number of processes is four and the temporary images will not be
-preserved. For example, "python main.py http://reserves.lib.tsinghua.edu.cn/bo
-ok5//00004634/00004634000/mobile/index.html".
+Version: v3.0. Download e-book from http://ereserves.lib.tsinghua.edu.cn. By default, the
+number of processes is four and the temporary images WILL BE preserved. For example, "python
+main.py https://ereserves.lib.tsinghua.edu.cn/bookDetail/c01e1db11c4041a39db463e810bac8f9
+4af518935a1ec46ef --token eyJhb...". Note that you need to manually login the ereserves
+website and obtain the token from the FIRST request after login, like "/index?token=xxx", due
+to two-factor authentication (2FA).
 
 positional arguments:
   url
 
-optional arguments:
-  -h, --help      show this help message and exit
-  -n N            Optional, [1~16] (4 by default). The number of processes.
-  -q Q            Optional, [3~10] (10 by default). The quality of the
-                  generated PDF. The bigger the value, the higher the
-                  resolution.
-  -p, --preserve  Optional. Preserve the temporary images.
-  -r, --auto-resize  Optional. Automatically unify page sizes.
+options:
+  -h, --help            show this help message and exit
+  -t TOKEN, --token TOKEN
+                        Required. The token from the "/index?token=xxx".
+  -n N                  Optional, [1~16] (4 by default). The number of processes.
+  -q Q                  Optional, [3~10] (10 by default). The quality of the generated PDF.    
+                        The bigger the value, the higher the resolution.
+  -d, --del-img         Optional. Delete the temporary images.
+  -r, --auto-resize     Optional. Automatically unify page sizes.
 ```
 
-一般来说不加参数使用即可，默认进程数为4。例子如上帮助信息所述，在存放main.py的目录下用命令行执行：``"python main.py http://reserves.lib.tsinghua.edu.cn/book5//00004634/00004634000/mobile/index.html``，在提示输入用户名和密码(密码不会显示)以及章节数后，将自动下载到 download 子目录下。
+一般来说只需要加上参数 --token 使用即可，默认进程数为4。例子如上帮助信息所述，在存放main.py的目录下用命令行执行：``"python main.py https://ereserves.lib.tsinghua.edu.cn/bookDetail/c01e1db11c4041a39db463e810bac8f9
+4af518935a1ec46ef --token eyJhb...``，将自动下载到 download 子目录下。
+
+必需参数说明：
+* 链接：书籍的详情页面链接，例如 https://ereserves.lib.tsinghua.edu.cn/bookDetail/c01e1db11c4041a39db463e810bac8f9
+* token：在浏览器中用 F12 打开开发者工具，选择 “Network”。此时登录教参平台（注意需要先打开开发者工具，再登录），可以看到一条 “index?token=eyJh...” 的请求，等号之后的内容即为 token，复制下来作为参数传递给程序即可。
 
-对于一般的单章节书籍来说，在提示输入章节数时直接回车跳过即可。
 
 ### 高级
 
 #### 自动统一页面尺寸 (beta)
 
 一些书籍不同页面的尺寸不同，影响观感，所以加入了 -r/--auto-resize 可选参数，用于自动统一页面尺寸。
 
-例如书籍 ``http://reserves.lib.tsinghua.edu.cn/book5//00005348/00005348000/mobile/index.html`` 存在尺寸问题，可加上 -r 参数解决。完整指令为： ``python main.py -r http://reserves.lib.tsinghua.edu.cn/book5//00005348/00005348000/mobile/index.html``。
-
-#### 多链接书籍下载——章节数参数
-
-章节数为 v1.2 中加入的特性，实际上指链接数。主要是为了方便下载给出了多个链接的少部分书目。此时只需要将第一个链接作为 url 传入，并且提示输入章节数时输入实际链接数即可。
-
-例如书籍：``http://reserves.lib.tsinghua.edu.cn/Search/BookDetail?bookId=3cf9814a-33ce-4489-b025-c58140c26263``，找到其第一个链接之后，执行 ``python main.py http://reserves.lib.tsinghua.edu.cn/book5//00001044/00001044000/index.html``，并在提示输入章节数时输入 5 即可。
+## 特性
 
-#### 关于清晰度
+### v3.0 —— 2024/11/7
 
-v2.0 版本以上：由于使用新接口，只有唯一版本图片，目前测试看来应该是最高清的，如果出现异常或是发现更高清版本的接口，烦请联系作者，感谢。另外在 v2.1 版本以后对几乎同等质量下的 PDF 大小进行了大幅优化，并且支持调低清晰度以进一步降低生成的 PDF 大小，具体使用方法为 ``-q [3~10]``，数值越高则质量越好，默认为 10。如果可能会多次生成 PDF 以选择合适质量，请加上 ``-p`` 参数以避免多次下载图片文件。
+教参平台再次更新（变为 ereserves，而不是原本的 reserves），所以更新了 v3.0 版本。由于联邦认证要求双因子认证（2FA），目前的脚本取消了用户名和密码输入机制，改为手动获取 token 作为参数传入。
 
-低于 v2.0 版本的描述：``-s {1, 2, 3}`` 可以显式设定清晰度，一般来说, 1、2、3 对应的清晰度依次递增，然而存在一些特例。故在 v1.2.1 版本中加入了对清晰度的自动选择（而不是默认``-s 3``），在没有指定清晰度时，将自动找到最高清晰度进行下载。
+随之更新的一些其他特性：
 
-## 特性
+* 修复了下载图片时 Ctrl+C 异常的问题；
+* 修正 requirements.txt 以及 multiprocessing 处理代码，兼容不同 python 版本。
+* 目前默认保留下载图片，可以加上 -d 取消保留。
 
 ### v2.1.3 —— 2023/11/17
 
@@ -144,8 +147,10 @@ v2.0 版本以上：由于使用新接口，只有唯一版本图片，目前测
 * 感谢 zhaofeng-shu33 和 HongYurui 同学反馈 PyMyPDF 版本问题，由此更新了 v2.1.1 版本。
 * 感谢 Tsingshanyuan 同学反馈部分书籍网页信息中不存在 book_name 的问题，感谢 Long-Miao 同学反馈部分书籍存在章节序号不连续的问题，由此更新了 v2.1.2 版本。
 * 感谢 Long-Miao 同学反馈页面尺寸统一性问题，并提交初步 PR，由此更新了 v2.1.3 版本。
+* 感谢 baron0426 同学对 ereserves 新版教参平台 API 的分析，并完成了对应的核心代码编写，以此为基础更新了 v3.0 版本。
+
 
-近期时间有限，非常感谢各位反馈的同学，尤其是 zhaofeng-shu33, Tsingshanyuan 和 Long-Miao 同学直接提 PR 完成了修复。
+近期时间有限，非常感谢各位反馈的同学，尤其是 zhaofeng-shu33, Tsingshanyuan， Long-Miao 直接提 PR 完成了修复。此外，特别感谢 baron0426 对 v3.0 版本的贡献。
 
 ## 说明
 

diff --git a/auth_get.py b/auth_get.py
@@ -1,3 +1,5 @@
+""" DEPRECATED """
+
 # coding:utf-8
 import requests
 import re

diff --git a/download_imgs.py b/download_imgs.py
@@ -2,13 +2,17 @@
 import os
 import sys
 import random
+import signal
 
 import requests
 from multiprocessing.pool import Pool
-from urllib.parse import urljoin
+from multiprocessing import Value
+from urllib.parse import urljoin, quote
 from auth_get import auth_get
 
 
+terminate_flag = Value('b', False)
+
 def randstr(num):
     H = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
     ret = ''
@@ -25,19 +29,27 @@ def get_tmpname():
     return '.tmp' + randstr(16)
 
 
-def download_one(session, username, password, url, save_dir, filename):
+def download_one(botu_read_kernel, img_path, save_dir, filename):
     """
     下载一张图片
-    :param session: Session 类型
-    :param url: 下载的url
+    :param botu_read_kernel: token
+    :param img_path: 下载的url
     :param save_dir: 保存的目录
     :param filename: 文件名
+    :param terminate_flag: 终止标志
     :return:
     """
+    with terminate_flag.get_lock(): 
+        if terminate_flag.value:
+            return
+
+    url_template = 'https://ereserves.lib.tsinghua.edu.cn/readkernel/JPGFile/DownJPGJsNetPage?filePath={img_path}'
+
     try:
         save_path = os.path.join(save_dir, filename)
         try:
-            res = auth_get(url, session, username, password, timeout=15)
+            res = requests.get(url_template.format(img_path=quote(img_path)), cookies={'BotuReadKernel': botu_read_kernel})
+            print(url_template.format(img_path=quote(img_path)))
         except requests.exceptions.Timeout:
             print('请求超时:', filename)
             return
@@ -52,42 +64,65 @@ def download_one(session, username, password, url, save_dir, filename):
         os.rename(tmp_path, save_path)
         print('下载图片成功：' + filename)
     except KeyboardInterrupt:
-        pid = os.getpid()
-        print('子进程 %d 被终止...' % pid)
+        with terminate_flag.get_lock():
+            terminate_flag.value = True
     except Exception as e:
         print(e)
 
 
-def download_imgs(session, username, password, img_urls, page_count, save_dir, processing_num):
+def download_imgs(botu_read_kernel, page_urls, save_dir, processing_num):
     """
     下载一本书的所有图片
-    :param session: Session类型
-    :param username: 用户名
-    :param password: 密码
-    :param img_urls: 要下载的所有图片路径
-    :param page_count: 页数
+    :param botU_read_kernel: 下载token
+    :param page_urls: 要下载的所有图片路径
     :param save_dir: 保存的目录
     :param processing_num: 进程数
     :return:
     """
     os.makedirs(save_dir, exist_ok=True)
+
+    pool = None
+
+    def terminate_pool(sig, frame):
+        print('terminating')
+        with terminate_flag.get_lock():
+            terminate_flag.value = True
+        if pool is None:
+            exit(0)
+
+
+    signal.signal(signal.SIGINT, terminate_pool)
+    signal.signal(signal.SIGTERM, terminate_pool)
+
     fail = True
-    img_fmt = img_urls[0][img_urls[0].rfind('.')+1:]
+
     try:
         while fail:
-            p = Pool(processing_num)
+            pool = Pool(processing_num)
             fail = False
-            for i, img_url in enumerate(img_urls):
-                filename = '%d.%s' % (i+1, img_fmt)
-                path = os.path.join(save_dir, filename)
-                if os.path.exists(path):
-                    print('已下载：%s, 跳过' % filename)
-                    continue
-                fail = True
-                p.apply_async(download_one, args=(session, username, password, img_url, save_dir, filename))
-            p.close()
-            p.join()
-    except KeyboardInterrupt:
-        print('父进程被终止')
-        pid = os.getpid()
-        os.popen('taskkill.exe /f /pid:%d' % pid)
+            download_names = []
+            for chap_num, img_urls in enumerate(page_urls):
+                for page_num, img_path in enumerate(img_urls):
+                    filename = str(chap_num) + '_' + str(page_num) + '.' + img_path.split('/')[-1].split('.')[-1]
+                    path = os.path.join(save_dir, filename)
+                    if os.path.exists(path):
+                        print('已下载：%s, 跳过' % filename)
+                        continue
+                    fail = True
+                    download_names.append(filename)
+                    pool.apply_async(download_one, args=(botu_read_kernel, img_path, save_dir, filename), error_callback=lambda x: print(x))
+            if len(download_names) != 0:
+                print(f'即将下载：', end='')
+                for name in download_names:
+                    print(name, end=' ')
+                print(f'\n共需下载图片数：{len(download_names)}')
+            pool.close()
+            pool.join()
+            pool = None
+
+            with terminate_flag.get_lock():
+                if terminate_flag.value:
+                    exit(0)
+    finally:
+        if pool:
+            pool.terminate()
diff --git a/img2pdf.py b/img2pdf.py
@@ -26,7 +26,6 @@ def img2pdf(imgs: list, pdf_path: str, quality: int, auto_resize: bool):
         sample_size.setdefault(key, 0)
         sample_size[key] += 1
 
-    # print(sample_size)
     common_size = max(sample_size, key=sample_size.get)
 
     with fitz.open() as doc: