feat(v2.1)：大幅优化几乎同等质量的 PDF 大小，并增加质量选择选项，以进一步降低 PDF 大小

dylanyang17 · Mar 4, 2021 · d0e00e6 · d0e00e6
1 parent c969031
commit d0e00e6
Show file tree

Hide file tree

Showing 4 changed files with 62 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -12,7 +12,9 @@
 
 ### 环境
 
-python 版本为 python3，需要安装 pymupdf：``pip install pymupdf``。
+python 版本为 python3，需要安装 pymupdf、requests、PIL：``pip install pymupdf requests pillow``。
+
+也可以使用 requirements.txt 进行一键安装：``pip install -r requirements.txt``。
 
 ### 使用
 
@@ -21,9 +23,9 @@ python 版本为 python3，需要安装 pymupdf：``pip install pymupdf``。
 使用 ``python main.py -h`` 可以打印帮助信息：
 
 ```
-usage: main.py [-h] [-n N] [-p] url
+usage: main.py [-h] [-n N] [-q Q] [-p] url
 
-Version: v2.0. Download e-book from http://reserves.lib.tsinghua.edu.cn. By
+Version: v2.1. Download e-book from http://reserves.lib.tsinghua.edu.cn. By
 default, the number of processes is four and the temporary images will not be
 preserved. For example, "python main.py http://reserves.lib.tsinghua.edu.cn/bo
 ok5//00004634/00004634000/mobile/index.html".
@@ -34,12 +36,15 @@ positional arguments:
 optional arguments:
   -h, --help      show this help message and exit
   -n N            Optional, [1~16] (4 by default). The number of processes.
+  -q Q            Optional, [3~10] (10 by default). The quality of the
+                  generated PDF. The bigger the value, the higher the
+                  resolution.
   -p, --preserve  Optional. Preserve the temporary images.
 ```
 
 一般来说不加参数使用即可，默认进程数为4。例子如上帮助信息所述，在存放main.py的目录下用命令行执行：``"python main.py http://reserves.lib.tsinghua.edu.cn/book5//00004634/00004634000/mobile/index.html``，在提示输入用户名和密码(密码不会显示)以及章节数后，将自动下载到download子目录下。
 
-对于一般书籍来说，在提示输入章节数时直接回车跳过即可。
+对于一般的单章节书籍来说，在提示输入章节数时直接回车跳过即可。
 
 ### 高级
 
@@ -49,16 +54,21 @@ optional arguments:
 
 章节数为 v1.2 中加入特性，实际上指链接数。主要是为了方便下载给出了多个链接的少部分书目。此时只需要将第一个链接作为 url 传入，并且提示输入章节数时输入实际链接数即可。
 
-例如书籍：``http://reserves.lib.tsinghua.edu.cn/Search/BookDetail?bookId=3cf9814a-33ce-4489-b025-c58140c26263``，找到其第一个链接之后，执行 ``"python main.py http://reserves.lib.tsinghua.edu.cn/book5//00004634/00004634000/mobile/index.html``，并在提示输入章节数时输入 5 即可。
+例如书籍：``http://reserves.lib.tsinghua.edu.cn/Search/BookDetail?bookId=3cf9814a-33ce-4489-b025-c58140c26263``，找到其第一个链接之后，执行 ``"python main.py http://reserves.lib.tsinghua.edu.cn/book5//00001044/00001044000/index.html``，并在提示输入章节数时输入 5 即可。
 
 #### 关于清晰度
 
-v2.0 版本：由于使用新接口，只有唯一版本图片，目前测试看来应该是最高清的，如果出现异常或是发现更高清版本的接口，烦请联系作者，感谢。
+v2.0 版本以上：由于使用新接口，只有唯一版本图片，目前测试看来应该是最高清的，如果出现异常或是发现更高清版本的接口，烦请联系作者，感谢。另外在 v2.1 版本以后对几乎同等质量下的 PDF 大小进行了大幅优化，并且支持调低清晰度以进一步降低生成的 PDF 大小，具体使用方法为 ``-q [3~10]``，数值越高则质量越好，默认为 10。如果可能会多次生成 PDF 以选择合适质量，请加上 ``-p`` 参数以避免多次下载图片文件。
 
 低于 v2.0 版本的描述：``-s {1, 2, 3}`` 可以显式设定清晰度，一般来说, 1、2、3 对应的清晰度依次递增，然而存在一些特例。故在 v1.2.1 版本中加入了对清晰度的自动选择（而不是默认``-s 3``），在没有指定清晰度时，将自动找到最高清晰度进行下载。
 
 ## 特性
 
+### v2.1 —— 2021/3/4
+
+* 大幅优化了几乎同等质量下生成的 PDF 文件大小；
+* 支持质量选项 ``-q [3~10]``，默认为 10 （最高质量），调小该值可以在降低清晰度的前提下降低 PDF 文件大小，若需多次测试合适清晰度建议开启 ``-p`` 选项以避免多次下载图片文件。
+
 ### v2.0 —— 2021/2/21
 
 由于教参平台接口更新，于是该脚本也迎来了 v2.0 版本，目前测试中发现影响不大，受到影响的特性有：

diff --git a/img2pdf.py b/img2pdf.py
@@ -1,20 +1,35 @@
 # coding:utf-8
+import os
 import fitz
+import shutil
+from PIL import Image
 
 
-def img2pdf(imgs, pdf_path):
+def img2pdf(imgs, pdf_path, quality):
     """
     利用图片生成pdf
     :param imgs: 图片列表, list类型
     :param pdf_path: 保存的pdf路径(包含文件名)
+    :param quality: 质量参数，默认为 10
     :return: True 表示生成成功，False表示失败
     """
+    intermediate_dir = os.path.join(os.path.dirname(pdf_path), 'intermediate')
+    if not os.path.exists(intermediate_dir):
+        os.mkdir(intermediate_dir)
     with fitz.open() as doc:
         page_count = len(imgs)
         for i, img in enumerate(imgs):
-            print('正在转换: %d/%d' % (i, page_count))
-            imgdoc = fitz.open(img)
+            print('正在转换: %d/%d' % (i+1, page_count))
+            # 生成相应质量的临时文件
+            tmp_img = os.path.join(intermediate_dir, os.path.basename(img))
+            img_obj = Image.open(img)
+            w, h = img_obj.size
+            img_obj.resize((int(w / 10 * quality), int(h / 10 * quality))).save(tmp_img, "JPEG")
+
+            # 插入到 PDF 中
+            imgdoc = fitz.open(tmp_img)
             pdfbytes = imgdoc.convertToPDF()
             imgpdf = fitz.open("pdf", pdfbytes)
             doc.insertPDF(imgpdf)
         doc.save(pdf_path)
+    shutil.rmtree(intermediate_dir)
diff --git a/main.py b/main.py
@@ -14,25 +14,32 @@
 def get_input():
     """
     获得输入的各参数
-    :return: [username, password, url, processing_num, del_img, size, links_cnt]
-    分别表示username学号、password密码、url爬取的首个链接、processing_num进程数、del_img是否删除临时图片、links_cnt（链接数，也即章节数）
+    :return: [username, password, url, processing_num, quality, del_img, size, links_cnt]
+    分别表示username学号、password密码、url爬取的首个链接、processing_num进程数、quality PDF质量（越高则PDF越清晰但大小越大）、
+    del_img是否删除临时图片、links_cnt（链接数，也即章节数）
     """
-    parser = argparse.ArgumentParser(description='Version: v2.0. Download e-book from http://reserves.lib.tsinghua.edu.cn. '
+    parser = argparse.ArgumentParser(description='Version: v2.1. Download e-book from http://reserves.lib.tsinghua.edu.cn. '
                                                  'By default, the number of processes is four and the temporary images '
                                                  'will not be preserved. \nFor example, '
                                                  '"python main.py http://reserves.lib.tsinghua.edu.cn/book5//00004634/00004634000/mobile/index.html".')
     parser.add_argument('url')
     parser.add_argument('-n', help='Optional, [1~16] (4 by default). The number of processes.', type=int, default=4)
+    parser.add_argument('-q', help='Optional, [3~10] (10 by default). The quality of the generated PDF. The bigger the value, the higher the resolution.', type=int, default=10)
     parser.add_argument('-p', '--preserve', help='Optional. Preserve the temporary images.', action='store_true')
 
     args = parser.parse_args()
     url = args.url
     processing_num = args.n
+    quality = args.q
     del_img = not args.preserve
     if processing_num not in list(range(1, 17)):
         print('Please check your parameter: -n [1~16]')
         parser.print_usage()
         sys.exit()
+    if quality not in list(range(3, 11)):
+        print('Please check your parameter: -q [3~11]')
+        parser.print_usage()
+        sys.exit()
     print('Student ID:', end='')
     username = input()
     password = getpass.getpass('Password:')
@@ -43,11 +50,11 @@ def get_input():
     if links_cnt <= 0:
         print('There must be one chapter to download at least.')
         sys.exit()
-    return [username, password, url, processing_num, del_img, links_cnt]
+    return [username, password, url, processing_num, quality, del_img, links_cnt]
 
 
 if __name__ == '__main__':
-    username, password, url0, processing_num, del_img, links_cnt = get_input()
+    username, password, url0, processing_num, quality, del_img, links_cnt = get_input()
     js_relpath = 'mobile/javascript/config.js'
     img_relpath = 'files/mobile/'
     candi_fmts = ['jpg', 'png']
@@ -64,7 +71,6 @@ def get_input():
     for i in range(links_cnt):
         url = url0[:st] + ''.join(['0' for _ in range(zero_len)]) + str(chap0 + i) + '/'
         urls.append(url)
-        print('lala:', url)
 
     # 获得需要下载的所有图片url, 并存放在 img_urls 中
     book_name = ''
@@ -91,21 +97,26 @@ def get_input():
 
     print('书名: %s  总页数: %d' % (book_name, page_cnt))
     save_dir = os.path.join('download', book_name)
-    if os.path.exists(os.path.join(save_dir, book_name + '.pdf')):
-        print('该书已经下载过, 停止下载')
+    pdf_path = os.path.join(save_dir, book_name + '.pdf')
+    if os.path.exists(pdf_path):
+        print('该书已经下载, 停止下载')
         sys.exit()
 
     download_imgs(session, username, password, img_urls, page_cnt, save_dir,
                   processing_num=processing_num)
-    print('图片下载完成, 开始转换..')
-    pdf_path = os.path.join(save_dir, book_name + '.pdf')
+    print('图片下载完成')
+
+    print('原始大小 PDF 转换中... quality：%d' % quality)
     imgs = [os.path.join(save_dir, '%d.%s' % (i, img_fmt)) for i in range(1, page_cnt + 1)]
     if os.path.exists(pdf_path):
         print('已经生成完毕, 跳过转换')
     else:
-        img2pdf(imgs, pdf_path)
-        print('生成pdf成功：' + book_name + '.pdf')
+        img2pdf(imgs, pdf_path, quality)
+        print('生成 PDF 成功：' + os.path.basename(pdf_path))
+
     if del_img:
-        print('清理临时图片完成')
         for img in imgs:
-            os.remove(img)
+            if os.path.exists(img):
+                os.remove(img)
+
+        print('清理临时图片完成')
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,3 @@
+Pillow==8.1.1
+requests==2.25.1
+PyMuPDF==1.18.9