-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
155 lines (113 loc) · 4.86 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
==========
openxmllib
==========
openxmllib is a set of tools that deals with the new ECMA 376 office file
formats known as OpenXML.
http://www.ecma-international.org/publications/standards/Ecma-376.htm
OpenXML format is actually used by Microsoft Office 2007. Apple iWork'08 and
OpenOffice 2.2 have filters to use this format too.
Features
========
Tested features
---------------
* Extract words from a document for indexing purpose.
* Get metadata from a document
* Add OpenXml mimetypes to standard ``mimetypes`` module.
Planned features
----------------
* Transform a document to HTML
Public API
==========
These examples say all::
>>> import openxmllib
>>> doc = openxmllib.openXmlDocument(path='office.docx')
>>> # Raises a ValueError on not supported office files.
>>> doc.mimeType
'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
>>> doc.coreProperties # Keys may depend on application
{'title': u'blah...', u'creator': u'John Doe', ...}
>>> doc.extendedProperties # Keys may depend on application
{'Words': u'312', 'Application': u'Your favorite word processor', ...}
>>> doc.customProperties # May return an empty mapping
{'My property': u'My value', ...}
>>> doc.allProperties # Merges core+extended+custom properties (see above)
{...}
>>> doc.indexableText(include_properties=False)
u'all the words of that document body'
>>> doc.indexableText(include_properties=True)
u'all the words of that document body and all properties values'
Standard ``mimetypes`` package extensions ::
>>> import mimetypes
>>> mimetypes.guess_type('somedoc.docx')
('application/vnd.openxmlformats-officedocument.wordprocessingml.document', None)
>>> mimetypes.guess_type('somecalc.xlsx')
('application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', None)
>>> mimetypes.guess_type('someslides.pptx')
('application/vnd.openxmlformats-officedocument.presentationml.presentation', None)
Document factory signatures::
>>> # We have the path for the office file
>>> doc = openxmllib.openXmlDocument(path='office.docx')
>>> # We have a file object for the office file
>>> fh = open('office.docx', 'rb')
>>> doc = openxmllib.openXmlDocument(file_='office.docx')
>>> # We have the URL for the office file
>>> doc = openxmllib.openXmlDocument(url='http://domain.tld/office.docx')
>>> # Xe have the raw data of the office file
>>> import mimetypes
>>> docx_mimetype = mimetypes.guess_type('office.docx')
>>> body = open('office.docx', 'rb').read()
>>> doc = open(data=body, mime_type=docx_mimetype)
Note that if you're not running a Python application, you may get the indexable
text from a document with the `openxmlinfo.py` console utility. Just type::
$ openxmlinfo --help
Copying and License
===================
Copyright (c) 2008 Gilles Lenfant
This software is subject to the provisions of the GNU General Public
License, Version 2.0 (GPL). A copy of the GPL should accompany this
distribution. THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL
EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF TITLE, MERCHANTABILITY,
AGAINST INFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE
More details in the ``COPYING`` file included in this package.
Status
======
This software is in production quality, has been tested on Mac OSX, Linux and
Windows with Python 2.4, Python 2.5, lxml 1.3.6 and lxml 2.2.
It should work on other platforms, with Python 2.6, perhaps with
other versions of lxml.
Installation
============
Using the usual setuptools command::
$ [sudo] easy_install openxmllib
Note that this will install the excellent `lxml` egg too if not already done.
From now you can "import openxmllib" in your Python apps and use the
"openxmlinfo" command line utility.
Gotchas
=======
Be aware that most text data coming from the various openxmllib
services might be us-ascii or Unicode. This is a side effect of lxml
(bug or feature ?). It's up to your application to convert these texts
to the appropriate charset.
We do not actually handle exceptions due to malformed XML or various
unexpected structures. You should handle the various (potential)
problems in a try (...) except (...) block in your application.
Developing and testing
======================
You should grab openxmllib with your subversion client from its `repository at
Google code <http://code.google.com/p/openxmllib/source/checkout>`_.
Then::
$ cd /where/you/installed/openxmllib
$ python setup.py develop
Note that testing does not require the installation::
$ cd tests
$ python runalltests.py
Support
=======
Use the issue tracker provided from the `project site
<http://code.google.com/p/openxmllib/>`_.
Credits
=======
* Gilles Lenfant [gilles.lenfant] <gilles dot lenfant at gmail dot com>
* Kevin Deldycke [kevin.deldycke] <kevin at deldycke dot com>
* Hugo Lopes Tavares [hltbra] <hltbra at gmail dot com>