OleFileIO Module

The OleFileIO module reads Microsoft OLE2 files (also calledStructured Storage or Microsoft Compound Document File Format), suchas Microsoft Office documents, Image Composer and FlashPix files, andOutlook messages.

This module is the OleFileIO_PL project by Philippe Lagadec, v0.42,merged back into Pillow.

How to use this module

For more information, see also the file PIL/OleFileIO.py, samplecode at the end of the module itself, and docstrings within the code.

About the structure of OLE files

An OLE file can be seen as a mini file system or a Zip archive: Itcontains streams of data that look like files embedded within theOLE file. Each stream has a name. For example, the main stream of a MSWord document containing its text is named “WordDocument”.

An OLE file can also contain storages. A storage is a folder thatcontains streams or other storages. For example, a MS Word document withVBA macros has a storage called “Macros”.

Special streams can contain properties. A property is a specificvalue that can be used to store information such as the metadata of adocument (title, author, creation date, etc). Property stream namesusually start with the character ‘05’.

For example, a typical MS Word document may look like this:

  1. \x05DocumentSummaryInformation (stream)
  2. \x05SummaryInformation (stream)
  3. WordDocument (stream)
  4. Macros (storage)
  5. PROJECT (stream)
  6. PROJECTwm (stream)
  7. VBA (storage)
  8. Module1 (stream)
  9. ThisDocument (stream)
  10. _VBA_PROJECT (stream)
  11. dir (stream)
  12. ObjectPool (storage)

Test if a file is an OLE container

Use isOleFile to check if the first bytes of the file contain the Magicfor OLE files, before opening it. isOleFile returns True if it is an OLEfile, False otherwise.

  1. assert OleFileIO.isOleFile('myfile.doc')

Open an OLE file from disk

Create an OleFileIO object with the file path as parameter:

  1. ole = OleFileIO.OleFileIO('myfile.doc')

Open an OLE file from a file-like object

This is useful if the file is not on disk, e.g. already stored in astring or as a file-like object.

  1. ole = OleFileIO.OleFileIO(f)

For example the code below reads a file into a string, then uses BytesIOto turn it into a file-like object.

  1. data = open('myfile.doc', 'rb').read()
  2. f = io.BytesIO(data) # or StringIO.StringIO for Python 2.x
  3. ole = OleFileIO.OleFileIO(f)

How to handle malformed OLE files

By default, the parser is configured to be as robust and permissive aspossible, allowing to parse most malformed OLE files. Only fatal errorswill raise an exception. It is possible to tell the parser to be morestrict in order to raise exceptions for files that do not fully conformto the OLE specifications, using the raise_defect option:

  1. ole = OleFileIO.OleFileIO('myfile.doc', raise_defects=DEFECT_INCORRECT)

When the parsing is done, the list of non-fatal issues detected isavailable as a list in the parsing_issues attribute of the OleFileIOobject:

  1. print('Non-fatal issues raised during parsing:')
  2. if ole.parsing_issues:
  3. for exctype, msg in ole.parsing_issues:
  4. print('- %s: %s' % (exctype.__name__, msg))
  5. else:
  6. print('None')

Syntax for stream and storage path

Two different syntaxes are allowed for methods that need or return thepath of streams and storages:

  • Either a list of strings including all the storages from the rootup to the stream/storage name. For example a stream called“WordDocument” at the root will have [‘WordDocument’] as full path. Astream called “ThisDocument” located in the storage “Macros/VBA” willbe [‘Macros’, ‘VBA’, ‘ThisDocument’]. This is the original syntaxfrom PIL. While hard to read and not very convenient, this syntaxworks in all cases.
  • Or a single string with slashes to separate storage and streamnames (similar to the Unix path syntax). The previous examples wouldbe ‘WordDocument’ and ‘Macros/VBA/ThisDocument’. This syntax iseasier, but may fail if a stream or storage name contains a slash.
    Both are case-insensitive.

Switching between the two is easy:

  1. slash_path = '/'.join(list_path)
  2. list_path = slash_path.split('/')

Get the list of streams

listdir() returns a list of all the streams contained in the OLE file,including those stored in storages. Each stream is listed itself as alist, as described above.

  1. print(ole.listdir())

Sample result:

  1. [['\x01CompObj'], ['\x05DocumentSummaryInformation'], ['\x05SummaryInformation']
  2. , ['1Table'], ['Macros', 'PROJECT'], ['Macros', 'PROJECTwm'], ['Macros', 'VBA',
  3. 'Module1'], ['Macros', 'VBA', 'ThisDocument'], ['Macros', 'VBA', '_VBA_PROJECT']
  4. , ['Macros', 'VBA', 'dir'], ['ObjectPool'], ['WordDocument']]

As an option it is possible to choose if storages should also be listed,with or without streams:

  1. ole.listdir (streams=False, storages=True)

Test if known streams/storages exist:

exists(path) checks if a given stream or storage exists in the OLE file.

  1. if ole.exists('worddocument'):
  2. print("This is a Word document.")
  3. if ole.exists('macros/vba'):
  4. print("This document seems to contain VBA macros.")

Read data from a stream

openstream(path) opens a stream as a file-like object.

The following example extracts the “Pictures” stream from a PPT file:

  1. pics = ole.openstream('Pictures')
  2. data = pics.read()

Get information about a stream/storage

Several methods can provide the size, type and timestamps of a givenstream/storage:

get_size(path) returns the size of a stream in bytes:

  1. s = ole.get_size('WordDocument')

get_type(path) returns the type of a stream/storage, as one of thefollowing constants: STGTY_STREAM for a stream, STGTY_STORAGE for astorage, STGTY_ROOT for the root entry, and False for a non existingpath.

  1. t = ole.get_type('WordDocument')

get_ctime(path) and get_mtime(path) return the creation andmodification timestamps of a stream/storage, as a Python datetime objectwith UTC timezone. Please note that these timestamps are only present ifthe application that created the OLE file explicitly stored them, whichis rarely the case. When not present, these methods return None.

  1. c = ole.get_ctime('WordDocument')
  2. m = ole.get_mtime('WordDocument')

The root storage is a special case: You can get its creation andmodification timestamps using the OleFileIO.root attribute:

  1. c = ole.root.getctime()
  2. m = ole.root.getmtime()

Extract metadata

get_metadata() will check if standard property streams exist, parse allthe properties they contain, and return an OleMetadata object with thefound properties as attributes.

  1. meta = ole.get_metadata()
  2. print('Author:', meta.author)
  3. print('Title:', meta.title)
  4. print('Creation date:', meta.create_time)
  5. # print all metadata:
  6. meta.dump()

Available attributes include:

  1. codepage, title, subject, author, keywords, comments, template,
  2. last_saved_by, revision_number, total_edit_time, last_printed, create_time,
  3. last_saved_time, num_pages, num_words, num_chars, thumbnail,
  4. creating_application, security, codepage_doc, category, presentation_target,
  5. bytes, lines, paragraphs, slides, notes, hidden_slides, mm_clips,
  6. scale_crop, heading_pairs, titles_of_parts, manager, company, links_dirty,
  7. chars_with_spaces, unused, shared_doc, link_base, hlinks, hlinks_changed,
  8. version, dig_sig, content_type, content_status, language, doc_version

See the source code of the OleMetadata class for more information.

Parse a property stream

get_properties(path) can be used to parse any property stream that isnot handled by get_metadata. It returns a dictionary indexed byintegers. Each integer is the index of the property, pointing to itsvalue. For example in the standard property stream‘05SummaryInformation’, the document title is property #2, and thesubject is #3.

  1. p = ole.getproperties('specialprops')

By default as in the original PIL version, timestamp properties areconverted into a number of seconds since Jan 1,1601. With the optionconvert_time, you can obtain more convenient Python datetime objects(UTC timezone). If some time properties should not be converted (such astotal editing time in ‘05SummaryInformation’), the list of indexes canbe passed as no_conversion:

  1. p = ole.getproperties('specialprops', convert_time=True, no_conversion=[10])

Close the OLE file

Unless your application is a simple script that terminates afterprocessing an OLE file, do not forget to close each OleFileIO objectafter parsing to close the file on disk.

  1. ole.close()

Use OleFileIO as a script

OleFileIO can also be used as a script from the command-line todisplay the structure of an OLE file and its metadata, for example:

  1. PIL/OleFileIO.py myfile.doc

You can use the option -c to check that all streams can be read fully,and -d to generate very verbose debugging information.

How to contribute

The code is available in a Mercurial repository onbitbucket. You may useit to submit enhancements or to report any issue.

If you would like to help us improve this module, or simply providefeedback, please contact me. You canhelp in many ways:

  • test this module on different platforms / Python versions
  • find and report bugs
  • improve documentation, code samples, docstrings
  • write unittest test cases
  • provide tricky malformed files

How to report bugs

To report a bug, for example a normal file which is not parsedcorrectly, please use the issue reportingpage,or if you prefer to do it privately, use this contactform. Please provide all theinformation about the context and how to reproduce the bug.

If possible please join the debugging output of OleFileIO. For this,launch the following command :

  1. PIL/OleFileIO.py -d -c file >debug.txt

Classes and Methods

  • class PIL.OleFileIO.OleFileIO(filename=None, raise_defects=40, write_mode=False, debug=False, path_encoding=None)[源代码]
  • 基类:object

OLE container object

This class encapsulates the interface to an OLE 2 structuredstorage file. Use the listdir() andopenstream() methods toaccess the contents of this file.

Object names are given as a list of strings, one for each subentrylevel. The root entry should be omitted. For example, the followingcode extracts all image streams from a Microsoft Image Composer file:

  1. ole = OleFileIO("fan.mic")
  2.  
  3. for entry in ole.listdir():
  4. if entry[1:2] == "Image":
  5. fin = ole.openstream(entry)
  6. fout = open(entry[0:1], "wb")
  7. while True:
  8. s = fin.read(8192)
  9. if not s:
  10. break
  11. fout.write(s)

You can use the viewer application provided with the Python ImagingLibrary to view the resulting files (which happens to be standardTIFF files).

  • close()[源代码]
  • close the OLE file, to release the file object

  • dumpdirectory()[源代码]

  • Dump directory (for debugging only)

  • dumpfat(fat, firstindex=0)[源代码]

  • Displays a part of FAT in human-readable form for debugging purpose

  • dumpsect(sector, firstindex=0)[源代码]

  • Displays a sector in a human-readable form, for debugging purpose.

  • exists(filename)[源代码]

  • Test if given filename exists as a stream or a storage in the OLEcontainer.Note: filename is case-insensitive.

参数:filename – path of stream in storage tree. (see openstream for syntax)返回:True if object exist, else False.

  • get_metadata()[源代码]
  • Parse standard properties streams, return an OleMetadata objectcontaining all the available metadata.(also stored in the metadata attribute of the OleFileIO object)

new in version 0.25

  • get_rootentry_name()[源代码]
  • Return root entry name. Should usually be ‘Root Entry’ or ‘R’ in mostimplementations.

  • getsize(_filename)[源代码]

  • Return size of a stream in the OLE container, in bytes.

参数:
filename – path of stream in storage tree (see openstream for syntax)
返回:
size in bytes (long integer)
引发:

  1. - **IOError** if file not found
  2. - **TypeError** if this is not a stream.
  • gettype(_filename)[源代码]
  • Test if given filename exists as a stream or a storage in the OLEcontainer, and return its type.

参数:filename – path of stream in storage tree. (see openstream for syntax)返回:False if object does not exist, its entry type (>0) otherwise:

  1. - STGTY_STREAM: a stream
  2. - STGTY_STORAGE: a storage
  3. - STGTY_ROOT: the root entry
  • getctime(filename)[源代码]
  • Return creation time of a stream/storage.

参数:filename – path of stream/storage in storage tree. (see openstream forsyntax)返回:None if creation time is null, a python datetime objectotherwise (UTC timezone)

new in version 0.26

  • getmtime(filename)[源代码]
  • Return modification time of a stream/storage.

参数:filename – path of stream/storage in storage tree. (see openstream forsyntax)返回:None if modification time is null, a python datetime objectotherwise (UTC timezone)

new in version 0.26

  • getproperties(filename, convert_time=False, no_conversion=None)[源代码]
  • Return properties described in substream.

参数:

  1. - **filename** path of stream in storage tree (see openstream for syntax)
  2. - **convert_time** bool, if True timestamps will be converted to Python datetime
  3. - **no_conversion** None or list of int, timestamps not to be converted(for example total editing time is not a real timestamp)返回:

a dictionary of values indexed by id (integer)

  • getsect(sect)[源代码]
  • Read given sector from file on disk.

参数:sect – int, sector index返回:a string containing the sector data.

  • listdir(streams=True, storages=False)[源代码]
  • Return a list of streams and/or storages stored in this file

参数:

  1. - **streams** bool, include streams if True (True by default) - new in v0.26
  2. - **storages** bool, include storages if True (False by default) - new in v0.26(note: the root storage is never included)返回:

list of stream and/or storage paths

参数:sect – sector index of directory stream.

  • loadfat(header)[源代码]
  • Load the FAT table.

  • loadfatsect(_sect)[源代码]

  • Adds the indexes of the given sector to the FAT

参数:sect – string containing the first FAT sector, or array of long integers返回:index of last FAT sector.

  • loadminifat()[源代码]
  • Load the MiniFAT table.

  • open(filename, write_mode=False)[源代码]

  • Open an OLE2 file in read-only or read/write mode.Read and parse the header, FAT and directory.

参数:

  1. - **filename**

string-like or file-like object, OLE file to parse

  1. - if filename is a string smaller than 1536 bytes, it is the pathof the file to open. (bytes or unicode string)
  2. - if filename is a string longer than 1535 bytes, it is parsedas the content of an OLE file in memory. (bytes type only)
  3. - if filename is a file-like object (with read, seek and tell methods),it is parsed as-is.
  4. - **write_mode** bool, if True the file is opened in read/write mode insteadof read-only by default. (ignored if filename is not a path)
  • openstream(filename)[源代码]
  • Open a stream as a read-only file object (BytesIO).Note: filename is case-insensitive.

参数:filename
path of stream in storage tree (except root entry), either:

  1. - a string using Unix path syntax, for example:‘storage_1/storage_1.2/stream
  2. - or a list of storage filenames, path to the desired stream/storage.Example: [‘storage_1’, storage_1.2’, stream’]返回:file object (read-only)引发:**IOError** if filename not found, or if this is not a stream.
  • raisedefect(_defect_level, message, exception_type=)[源代码]
  • This method should be called for any defect found during file parsing.It may raise an IOError exception according to the minimal level chosenfor the OleFileIO object.

参数:

  1. - **defect_level**

defect level, possible values are:

  1. - DEFECT_UNSURE : a case which looks weird, but not sure its a defect
  2. - DEFECT_POTENTIAL : a potential defect
  3. - DEFECT_INCORRECT : an error according to specifications, but parsing can go on
  4. - DEFECT_FATAL : an error which cannot be ignored, parsing is impossible
  5. - **message** string describing the defect, used with raised exception.
  6. - **exception_type** exception class to be raised, IOError by default
  • sect2array(sect)[源代码]
  • convert a sector to an array of 32 bits unsigned integers,swapping bytes on big endian CPUs such as PowerPC (old Macs)

  • writesect(_sect, data, padding=b'\x00')[源代码]

  • Write given sector to file on disk.

参数:

  1. - **sect** int, sector index
  2. - **data** bytes, sector data
  3. - **padding** single byte, padding character if data < sector size
  • writestream(_stream_name, data)[源代码]
  • Write a stream to disk. For now, it is only possible to replace anexisting stream by data of the same size.

参数:

  1. - **stream_name**

path of stream in storage tree (except root entry), either:

  1. - a string using Unix path syntax, for example:‘storage_1/storage_1.2/stream
  2. - or a list of storage filenames, path to the desired stream/storage.Example: [‘storage_1’, storage_1.2’, stream’]
  3. - **data** bytes, data to be written, must be the same size as the originalstream.

  • class PIL.OleFileIO.OleMetadata[源代码]
  • 基类:object

class to parse and store metadata from standard properties of OLE files.

Available attributes:codepage, title, subject, author, keywords, comments, template,last_saved_by, revision_number, total_edit_time, last_printed, create_time,last_saved_time, num_pages, num_words, num_chars, thumbnail,creating_application, security, codepage_doc, category, presentation_target,bytes, lines, paragraphs, slides, notes, hidden_slides, mm_clips,scale_crop, heading_pairs, titles_of_parts, manager, company, links_dirty,chars_with_spaces, unused, shared_doc, link_base, hlinks, hlinks_changed,version, dig_sig, content_type, content_status, language, doc_version

Note: an attribute is set to None when not present in the properties of theOLE file.

References for SummaryInformation stream:- https://msdn.microsoft.com/en-us/library/dd942545.aspx- https://msdn.microsoft.com/en-us/library/dd925819%28v=office.12%29.aspx- https://msdn.microsoft.com/en-us/library/windows/desktop/aa380376%28v=vs.85%29.aspx- https://msdn.microsoft.com/en-us/library/aa372045.aspx- http://sedna-soft.de/articles/summary-information-stream/- http://poi.apache.org/apidocs/org/apache/poi/hpsf/SummaryInformation.html

References for DocumentSummaryInformation stream:- https://msdn.microsoft.com/en-us/library/dd945671%28v=office.12%29.aspx- https://msdn.microsoft.com/en-us/library/windows/desktop/aa380374%28v=vs.85%29.aspx- http://poi.apache.org/apidocs/org/apache/poi/hpsf/DocumentSummaryInformation.html

new in version 0.25

  • DOCSUMATTRIBS = ['codepagedoc', 'category', 'presentation_target', 'bytes', 'lines', 'paragraphs', 'slides', 'notes', 'hidden_slides', 'mm_clips', 'scale_crop', 'heading_pairs', 'titles_of_parts', 'manager', 'company', 'links_dirty', 'chars_with_spaces', 'unused', 'shared_doc', 'link_base', 'hlinks', 'hlinks_changed', 'version', 'dig_sig', 'content_type', 'content_status', 'language', 'doc_version']
  • SUMMARYATTRIBS = ['codepage', 'title', 'subject', 'author', 'keywords', 'comments', 'template', 'lastsaved_by', 'revision_number', 'total_edit_time', 'last_printed', 'create_time', 'last_saved_time', 'num_pages', 'num_words', 'num_chars', 'thumbnail', 'creating_application', 'security']
  • dump()[源代码]
  • Dump all metadata, for debugging purposes.

  • parseproperties(_olefile)[源代码]

  • Parse standard properties of an OLE file, from the streams“SummaryInformation” and “DocumentSummaryInformation”,if present.Properties are converted to strings, integers or python datetime objects.If a property is not present, its value is set to None.
  • PIL.OleFileIO.debug(msg)
  • PIL.OleFileIO.filetime2datetime(filetime)[源代码]
  • convert FILETIME (64 bits int) to Python datetime.datetime
  • PIL.OleFileIO.i16(c, o=0)[源代码]
  • Converts a 2-bytes (16 bits) string to an integer.

c: string containing bytes to converto: offset of bytes to convert in string

  • PIL.OleFileIO.i32(c, o=0)[源代码]
  • Converts a 4-bytes (32 bits) string to an integer.

c: string containing bytes to converto: offset of bytes to convert in string

  • PIL.OleFileIO.isOleFile(filename)[源代码]
  • Test if a file is an OLE container (according to the magic bytes in its header).

参数:filename
string-like or file-like object, OLE file to parse

  • if filename is a string smaller than 1536 bytes, it is the pathof the file to open. (bytes or unicode string)
  • if filename is a string longer than 1535 bytes, it is parsedas the content of an OLE file in memory. (bytes type only)
  • if filename is a file-like object (with read and seek methods),it is parsed as-is.返回:True if OLE, False otherwise.
  • PIL.OleFileIO.setdebug_mode(_debug_mode)[源代码]
  • Set debug mode on or off, to control display of debugging messages.:param mode: True or False