-
Notifications
You must be signed in to change notification settings - Fork 0
/
usingattoparser.html
executable file
·455 lines (365 loc) · 17.5 KB
/
usingattoparser.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
<!DOCTYPE html>
<html>
<head>
<title>attoparser: powerful and easy java parser for XML and HTML markup</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="description" content="powerful and easy java parser for XML and HTML markup" />
<meta name="author" content="Attoparser" />
<!-- Le styles -->
<link href="css/bootstrap.css" rel="stylesheet" />
<style type="text/css">
body {
padding-top: 60px;
padding-bottom: 40px;
}
.sidebar-nav {
padding: 9px 0;
}
</style>
<link href="css/bootstrap-responsive.css" rel="stylesheet" />
<link href="css/google-code-prettify/prettify.css" rel="stylesheet" />
</head>
<body lang="en" dir="ltr" onload="prettyPrint()">
<div class="navbar navbar-inverse navbar-fixed-top">
<div class="navbar-inner">
<div class="container-fluid">
<a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</a>
<a class="brand" href="index.html"><img src="img/attoparser.png" alt="attoparser" /></a>
<div class="nav-collapse collapse">
<p class="navbar-text pull-right">
<img src="img/attoparser_motto.png" alt="powerful and easy java parser for XML and HTML markup" />
</p>
<ul class="nav">
<li><a href="index.html">home</a></li>
<li><a href="download.html">download</a></li>
<li class="active"><a href="usingattoparser.html">using attoparser</a></li>
<li><a href="javadoc.html">javadoc</a></li>
</ul>
</div>
</div>
</div>
</div>
<div class="container-fluid">
<div class="row-fluid">
<!-- --------------------------------------------------------------------- -->
<!-- SIDEBAR -->
<!-- --------------------------------------------------------------------- -->
<div class="span2">
<div class="well sidebar-nav">
<ul class="nav nav-list">
<li class="nav-header">ATTOPARSER</li>
<li><a href="index.html">home</a></li>
<li><a href="download.html">download</a></li>
<li class="nav-header">DOCS & HELP</li>
<li class="active"><a href="usingattoparser.html">using attoparser</a></li>
<li><a href="javadoc.html">javadoc API</a></li>
<li><a href="issuetracking.html">issue tracking</a></li>
<li><a href="license.html">license</a></li>
<li><a href="faq.html">faq</a></li>
<li><a href="team.html">team</a></li>
<li class="nav-header">SOURCE REPOSITORIES</li>
<li><a href="https://github.com/attoparser/attoparser">attoparser @GitHub</a></li>
</ul>
</div>
</div>
<!-- --------------------------------------------------------------------- -->
<!-- CONTENT -->
<!-- --------------------------------------------------------------------- -->
<div class="span10">
<h3>Glossary</h3>
<p>
In order to fully understand how attoparser works, you will first need to know some basic
concepts:
</p>
<table class="table table-striped">
<tbody>
<tr>
<td class="span3"><strong>markup</strong></td>
<td class="span9">
For the sake of conciseness, in attoparser we will consider this term as a synonym of <em>"XML and/or HTML"</em>.
</td>
</tr>
<tr>
<td><strong>structure</strong></td>
<td>
A <em>structure</em> is an artifact in the parsed document that is not simply <em>text</em>, this
is, some kind of directive, format, metadata... for example: elements, DOCTYPE clauses, comments, etc.
</td>
</tr>
<tr>
<td><strong>element</strong></td>
<td>
The term <em>element</em> is just the official standard name for a markup <em>tag</em>.
</td>
</tr>
<tr>
<td><strong><kbd>(offset,len)</kbd> pair</strong></td>
<td>
An <kbd>(offset,len)</kbd> pair is a couple of integer numbers that specify a subsequence of elements
in an array. The first component (<em>offset</em>) signals the first position in the array to be
included in the subsequence, and the second component (<em>len</em>) indicates the length of the
subsequence. <br />
These pairs are extensively used in attoparser in order to delimit parsed
artifacts on the original <kbd>char[]</kbd> buffer. Converting an <kbd>(offset,len)</kbd> pair
into a <kbd>String</kbd> object is easy, just do <code>new String(buffer, offset, len)</code>.
</td>
</tr>
<tr>
<td><strong>attoDOM</strong></td>
<td>
A DOM-style interface offered by attoparser similar to the standard DOM, but implemented
with classes from the <kbd>org.attoparser.markup.dom</kbd> package.
</td>
</tr>
</tbody>
</table>
<h3>Handlers</h3>
<p>
The first thing we need to do for using attoparser is creating an <em>event handler</em>.
This event handler will be an implementation of the <kbd>IAttoHandler</kbd> interface,
but we will normally not use this interface directly, creating instead a subclass of one
of the several abstract base classes already provided by attoparser.
</p>
<p>
Each of these abstract base classes provide a set of overriddable methods —all of them
having a default empty implementation—, and each of these sets of methods will
offer a different level of detail to us:
</p>
<h6>Example general handlers</h6>
<table class="table table-striped">
<tbody>
<tr>
<td class="span4"><strong><kbd>AbstractAttoHandler</kbd></strong></td>
<td class="span8">
Basic implementation only differentiating between <i>text</i> and <i>structures</i>.
</td>
</tr>
<tr>
<td><strong><kbd>AbstractBasicMarkupAttoHandler</kbd></strong></td>
<td>
Abstract handler able to differentiate among different types of markup structures:
elements, comments, CDATA, DOCTYPE, etc. without breaking them down (for example,
elements will be offered as a whole, without differentiating name and attributes).
</td>
</tr>
<tr>
<td><strong><kbd>AbstractDetailedMarkupAttoHandler</kbd></strong></td>
<td>
Abstract handler able not only to differentiate among different types of markup structures,
but also of reporting lowel-level detail inside elements (name, attributes, inner
whitespace) and DOCTYPE clauses (keyword, root element name, public and system ID, etc.).
</td>
</tr>
<tr>
<td><strong><kbd>AbstractStandardMarkupAttoHandler</kbd></strong></td>
<td>
Higher-level abstract handler that offers an interface
more similar to the Standard SAX <kbd>ContentHandler</kbd>s (fewer
events, use of <kbd>String</kbd> instead of <kbd>char[]</kbd>,
attributes reported as <kbd>Map<String,String></kbd>, etc).
</td>
</tr>
</tbody>
</table>
<h6>Example XML handlers</h6>
<table class="table table-striped">
<tbody>
<tr>
<td class="span4"><strong><kbd>AbstractDetailedXmlAttoHandler</kbd></strong></td>
<td class="span8">
Abstract handler with the same level of detail as <kbd>AbstractDetailedMarkupAttoHandler</kbd>,
using specific XML configuration.
</td>
</tr>
<tr>
<td><strong><kbd>AbstractStandardXmlAttoHandler</kbd></strong></td>
<td>
Higher-level abstract handler similar to <kbd>AbstractStandardMarkupAttoHandler</kbd>,
using specific XML configuration.
</td>
</tr>
<tr>
<td><strong><kbd>DOMXmlAttoHandler</kbd></strong></td>
<td>
Specialized handler that converts SAX-style events into an attoDOM tree.
</td>
</tr>
</tbody>
</table>
<h6>Example HTML handlers</h6>
<table class="table table-striped">
<tbody>
<tr>
<td class="span4"><strong><kbd>AbstractDetailedNonValidatingHtmlAttoHandler</kbd></strong></td>
<td class="span8">
Abstract handler with the same level of detail as <kbd>AbstractDetailedMarkupAttoHandler</kbd>,
using specific HTML configuration and intelligence.
</td>
</tr>
<tr>
<td><strong><kbd>AbstractStandardNonValidatingHtmlAttoHandler</kbd></strong></td>
<td>
Higher-level abstract handler similar to <kbd>AbstractStandardMarkupAttoHandler</kbd>,
using specific HTML configuration and intelligence.
</td>
</tr>
</tbody>
</table>
<h5>Creating a handler</h5>
<p>
For example, we could choose <kbd>AbstractStandardMarkupAttoHandler</kbd> and create a
very simple handler for counting the number of standalone elements in our parsed documents, like:
</p>
<pre class="prettyprint linenums language-java">
public class StandaloneCountingAttoHandler extends AbstractStandardXmlAttoHandler {
// Let's count the number of standalone elements in our document!
private int standaloneCount = 0;
public StandaloneCountingAttoHandler() {
super();
}
public int getStandaloneCount() {
return this.standaloneCount;
}
@Override
public void handleXmlStandaloneElement(
final String elementName, final Map<String, String> attributes,
final int line, final int col)
throws AttoParseException {
this.standaloneCount++;
}
}
</pre>
<p>
Looking at the code above, note that most attoparser handlers offer different handlers for
<em>opening elements</em> and <em>standalone elements</em>, a differentiation that is
not easy to achieve using standard SAX parsers.
</p>
<p>
Also note the <code>line</code> and <code>col</code> arguments, specifying the exact position
of these standalone elements in the document.
</p>
<p>
And finally, note the fact that we are using an <i>XML-specific</i> handler,
which instructs attoparser to require the parsed document to be well-formed from an
XML standpoint. This means a well-formed prolog, balanced tags, correctly formatted
attribute values, etc.
</p>
<p>
If our code was HTML instead of XML, we could have created our handler as an implementation
of, for example, <kbd>AbstractStandardNonValidatingHtmlMarkupAttoHandler</kbd>, which
would offer similar events to those of its XML counterpart, but removing a lot of
restrictions of format (of attributes, for instance) and adding some HTML-specific
intelligence like knowing that a <code><img src="..."></code> is a standalone tag
(and not an <i>open tag</i>) even if it isn't written like <code><img src="..." /></code>.
</p>
<h5>Just one more...</h5>
<p>
What if we wanted to strip our document of markup tags, leaving only the text? We could easily
create a handler for this by extending <kbd>AbstractBasicMarkupAttoHandler</kbd>:
</p>
<pre class="prettyprint linenums language-java">
public class TagStrippingAttoHandler extends AbstractBasicMarkupAttoHandler {
private final StringBuilder strBuilder;
public TagStrippingAttoHandler() {
super();
this.strBuilder = new StringBuilder();
}
public String getTagStrippedText() {
return this.strBuilder.toString();
}
@Override
public void handleText(
final char[] buffer, final int offset, final int len,
final int line, final int col)
throws AttoParseException {
this.strBuilder.append(buffer, offset, len);
}
}
</pre>
<p>
Quite easy, right?
</p>
<h3>Parsers</h3>
<p>
attoparser offers a parser interface called <kbd>IAttoParser</kbd>, and only one implementation
for it: <kbd>MarkupAttoParser</kbd>.
</p>
<p>
This <kbd>MarkupAttoParser</kbd> class should be directly used (without extending) and its
instances are <em>thread-safe</em>, so they can be safely reused without synchronization.
Also note that this <em>thread-safety</em> feature usually does not apply to <em>handlers</em>.
</p>
<h5>Parsing our document</h5>
<p>
<kbd>MarkupAttoParser</kbd> allows us to specify the document to be parsed in several useful
ways: as a <kbd>java.io.Reader</kbd>, a <kbd>String</kbd> or a <kbd>char[]</kbd>.
</p>
<p>
Let's say we have a document in our classpath and we want to parse it using our recently created
<em>handler</em> in order to count the number of standalone elements it contains. For the sake
of simplicity, we will ignore the <code>try..finally</code> code required to adequately close
the streams:
</p>
<pre class="prettyprint linenums language-java">
final InputStream is = Thread.currentThread().getContextClassLoader().getResourceAsStream(fileName);
// We know our file's encoding is ISO-8859-1, and we need that info to create a Reader
final Reader reader = new BufferedReader(new InputStreamReader(is, "ISO-8859-1"));
final StandaloneCountingAttoHandler handler = new StandaloneCountingAttoHandler();
parser.parse(reader, handler);
final int standaloneCount = handler.getStandaloneCount();
</pre>
<p>
And we are done!
</p>
<h3>Using the DOM features</h3>
<p>
As a plus to its main SAX-style parsing capabilities, attoparser offers us a DOM-style
interface that enables us to handle a document as an attoDOM tree.
Note that, currently, only an XML version of the DOM conversion facilities is offered
out-of-the-box.
</p>
<p>
Using it is easy: we just need to use the prebuilt <kbd>DOMXmlAttoHandler</kbd>:
</p>
<pre class="prettyprint linenums language-java">
final Reader reader = ...
final DOMXmlAttoHandler handler = new DOMXmlAttoHandler();
parser.parse(reader, handler);
final Document doc = handler.getDocument();
final DocType docType = doc.getDocType();
final List<Element> elements = doc.getRootElement().getElementChildren();
...
</pre>
<h5>Writing markup from an attoDOM tree</h5>
<p>
attoparser provides, out-of-the-box, a writer object capable of writing an attoDOM tree as
markup code again. It's the <kbd>XmlDOMWriter</kbd> class:
</p>
<pre class="prettyprint linenums language-java">
final Document doc = ...
// Modify our document if we wish
...
final StringWriter stringWriter = new StringWriter();
final XmlDOMWriter domWriter = new XmlDOMWriter();
// Execute the writer
domWriter.writeDocument(doc, stringWriter);
// Obtain the result of executing the visitor
final String markup = stringWriter.toString();
</pre>
</div>
</div>
<hr />
<footer>
<p>Copyright © <a href="team.html">Attoparser</a>.</p>
</footer>
</div>
<script src="https://code.jquery.com/jquery-latest.js"></script>
<script src="js/bootstrap.js"></script>
<script src="js/google-code-prettify/prettify.js"></script>
</body>
</html>