Appendix 2: Considerations on Embedded Files
This chapter provides some background on embedded files support in PyMuPDF.
General
Starting with version 1.4, PDF supports embedding arbitrary files as part (“Embedded File Streams”) of a PDF document file (see chapter “7.11.4 Embedded File Streams”, pp. 103 of the Adobe PDF References).
In many aspects, this is comparable to concepts also found in ZIP files or the OLE technique in MS Windows. PDF embedded files do, however, not support directory structures as does the ZIP format. An embedded file can in turn contain embedded files itself.
Advantages of this concept are that embedded files are under the PDF umbrella, benefitting from its permissions / password protection and integrity aspects: all data, which a PDF may reference or even may be dependent on, can be bundled into it and so form a single, consistent unit of information.
In addition to embedded files, PDF 1.7 adds collections to its support range. This is an advanced way of storing and presenting meta information (i.e. arbitrary and extensible properties) of embedded files.
MuPDF Support
After adding initial support for collections (portfolios) and /EmbeddedFiles in MuPDF version 1.11, this support was dropped again in version 1.15.
As a consequence, the cli utility mutool no longer offers access to embedded files.
PyMuPDF – having implemented an /EmbeddedFiles API in response in its version 1.11.0 – was therefore forced to change gears starting with its version 1.16.0 (we never published a MuPDF v1.15.x compatible PyMuPDF).
We are now maintaining our own code basis supporting embedded files. This code makes use of basic MuPDF dictionary and array functions only.
PyMuPDF Support
We continue to support the full old API with respect to embedded files – with only minor, cosmetic changes.
There even also is a new function, which delivers a list of all names under which embedded data are resgistered in a PDF, Document.embfile_names().