pdftotext

DocHub 程序是开源免费的，不存在售，所以没有所谓的售后，所以凡是遇到问题，请到GitHub或者Gitee提交issues，以便问题存档以及在有空的时候查看和排查，不接受除此之外的任何答疑求助。每天上班要工作，下班要生活、休闲、学习以及对开源项目做改进和优化…请理解和见谅，谢谢。

pdftotext

作用

提取 PDF 中的文本内容

安装

Windows

Windows 下不需要安装，因为我目前也没有发现存在Windows的版本。

不安装这个工具，对程序有影响，但是影响不大，因为从PDF中提取txt文本内容，还可以使用 calibre 进行提取。

Linux

[sudo] apt install poppler-utils

Mac

[sudo] brew install poppler-utils

是否安装成功

执行如下命令：

pdftotext --help

看到如下结果，则表示安装成功。

pdftotext --help
------
pdftotext version 0.41.0
Copyright 2005-2016 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>             : first page to convert
  -l <int>             : last page to convert
  -r <fp>              : resolution, in DPI (default is 72)
  -x <int>             : x-coordinate of the crop area top left corner
  -y <int>             : y-coordinate of the crop area top left corner
  -W <int>             : width of crop area in pixels (default is 0)
  -H <int>             : height of crop area in pixels (default is 0)
  -layout              : maintain original physical layout
  -fixed <fp>          : assume fixed-pitch (or tabular) text
  -raw                 : keep strings in content stream order
  -htmlmeta            : generate a simple HTML file, including the meta information
  -enc <string>        : output text encoding name
  -listenc             : list available encodings
  -eol <string>        : output end-of-line convention (unix, dos, or mac)
  -nopgbrk             : don't insert page breaks between pages
  -bbox                : output bounding box for each word and page size to html.  Sets -htmlmeta
  -bbox-layout         : like -bbox but with extra layout bounding box data.  Sets -htmlmeta
  -opw <string>        : owner password (for encrypted files)
  -upw <string>        : user password (for encrypted files)
  -q                   : don't print any messages or errors
  -v                   : print copyright and version info
  -h                   : print usage information
  -help                : print usage information
  --help               : print usage information
  -?                   : print usage information

测试

使用如下命令，测试文本提取结果。

pdftotext -f 1 -l 5 example.pdf example.txt

如果提取到 txt 文件中的文本内容没有出现乱码，则表示内容提取成功。如果出现乱码，需要从字符编码和中文字体排查。

[非必需，建议安装] pdftotext

pdftotext

作用

安装

Windows

Linux

Mac

是否安装成功

测试