usingattoparser.html

<!DOCTYPE html>
<html>

  <head>
  
	<title>attoparser: powerful and easy java parser for XML and HTML markup</title>
    
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta name="description" content="powerful and easy java parser for XML and HTML markup" />
    <meta name="author" content="Attoparser" />

    <!-- Le styles -->
    <link href="css/bootstrap.css" rel="stylesheet" />
    <style type="text/css">
      body {
        padding-top: 60px;
        padding-bottom: 40px;
      }
      .sidebar-nav {
        padding: 9px 0;
      }
    </style>
    <link href="css/bootstrap-responsive.css" rel="stylesheet" />
    <link href="css/google-code-prettify/prettify.css" rel="stylesheet" />
        

  </head>


  <body lang="en" dir="ltr" onload="prettyPrint()">


    <div class="navbar navbar-inverse navbar-fixed-top">
      <div class="navbar-inner">
        <div class="container-fluid">
          <a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
            <span class="icon-bar"></span>
            <span class="icon-bar"></span>
            <span class="icon-bar"></span>
          </a>
          <a class="brand" href="index.html"><img src="img/attoparser.png" alt="attoparser" /></a>
          <div class="nav-collapse collapse">
            <p class="navbar-text pull-right">
              <img src="img/attoparser_motto.png" alt="powerful and easy java parser for XML and HTML markup" />
            </p>
            <ul class="nav">
              <li><a href="index.html">home</a></li>
              <li><a href="download.html">download</a></li>
              <li class="active"><a href="usingattoparser.html">using attoparser</a></li>
              <li><a href="javadoc.html">javadoc</a></li>
            </ul>
          </div>
        </div>
      </div>
    </div>


    <div class="container-fluid">
    
      <div class="row-fluid">

        <!-- --------------------------------------------------------------------- -->      
        <!--   SIDEBAR                                                             -->      
        <!-- --------------------------------------------------------------------- -->
              
        <div class="span2">
          <div class="well sidebar-nav">
            <ul class="nav nav-list">
            
              <li class="nav-header">ATTOPARSER</li>
            
              <li><a href="index.html">home</a></li>
              <li><a href="download.html">download</a></li>
            
              <li class="nav-header">DOCS &amp; HELP</li>
            
              <li class="active"><a href="usingattoparser.html">using attoparser</a></li>
              <li><a href="javadoc.html">javadoc API</a></li>
              <li><a href="issuetracking.html">issue tracking</a></li>
              <li><a href="license.html">license</a></li>
              <li><a href="faq.html">faq</a></li>
              <li><a href="team.html">team</a></li>
            
              <li class="nav-header">SOURCE REPOSITORIES</li>
              
              <li><a href="https://github.com/attoparser/attoparser">attoparser @GitHub</a></li>
              
            </ul>
          </div>
        </div>

        
        <!-- --------------------------------------------------------------------- -->      
        <!--   CONTENT                                                             -->      
        <!-- --------------------------------------------------------------------- -->
              
        <div class="span10">
        
          <h3>Glossary</h3>
          
          <p>
            In order to fully understand how attoparser works, you will first need to know some basic
            concepts:
          </p>
          
          <table class="table table-striped">
            <tbody>
              <tr>
                <td class="span3"><strong>markup</strong></td>
                <td class="span9">
                  For the sake of conciseness, in attoparser we will consider this term as a synonym of <em>"XML and/or HTML"</em>.
                </td>
              </tr>
              <tr>
                <td><strong>structure</strong></td>
                <td>
                  A <em>structure</em> is an artifact in the parsed document that is not simply <em>text</em>, this
                  is, some kind of directive, format, metadata... for example: elements, DOCTYPE clauses, comments, etc.
                </td>
              </tr>
              <tr>
                <td><strong>element</strong></td>
                <td>
                  The term <em>element</em> is just the official standard name for a markup <em>tag</em>.  
                </td>
              </tr>
              <tr>
                <td><strong><kbd>(offset,len)</kbd> pair</strong></td>
                <td>
                  An <kbd>(offset,len)</kbd> pair is a couple of integer numbers that specify a subsequence of elements
                  in an array. The first component (<em>offset</em>) signals the first position in the array to be
                  included in the subsequence, and the second component (<em>len</em>) indicates the length of the
                  subsequence. <br /> 
                  These pairs are extensively used in attoparser in order to delimit parsed
                  artifacts on the original <kbd>char[]</kbd> buffer. Converting an <kbd>(offset,len)</kbd> pair
                  into a <kbd>String</kbd> object is easy, just do <code>new String(buffer, offset, len)</code>.  
                </td>
              </tr>
              <tr>
                <td><strong>attoDOM</strong></td>
                <td>
                  A DOM-style interface offered by attoparser similar to the standard DOM, but implemented
                  with classes from the <kbd>org.attoparser.markup.dom</kbd> package.  
                </td>
              </tr>
            </tbody>
          </table>
        
          <h3>Handlers</h3>

          <p>
            The first thing we need to do for using attoparser is creating an <em>event handler</em>.
            This event handler will be an implementation of the <kbd>IAttoHandler</kbd> interface,
            but we will normally not use this interface directly, creating instead a subclass of one
            of the several abstract base classes already provided by attoparser.
          </p>
          <p>
            Each of these abstract base classes provide a set of overriddable methods &mdash;all of them
            having a default empty implementation&mdash;, and each of these sets of methods will
            offer a different level of detail to us:
          </p>
        
          <h6>Example general handlers</h6>

          
          <table class="table table-striped">
            <tbody>
              <tr>
                <td class="span4"><strong><kbd>AbstractAttoHandler</kbd></strong></td>
                <td class="span8">
                  Basic implementation only differentiating between <i>text</i> and <i>structures</i>.
                </td>
              </tr>
              <tr>
                <td><strong><kbd>AbstractBasicMarkupAttoHandler</kbd></strong></td>
                <td>
                  Abstract handler able to differentiate among different types of markup structures: 
                  elements, comments, CDATA, DOCTYPE, etc. without breaking them down (for example, 
                  elements will be offered as a whole, without differentiating name and attributes).
                </td>
              </tr>
              <tr>
                <td><strong><kbd>AbstractDetailedMarkupAttoHandler</kbd></strong></td>
                <td>
                  Abstract handler able not only to differentiate among different types of markup structures, 
                  but also of reporting lowel-level detail inside elements (name, attributes, inner 
                  whitespace) and DOCTYPE clauses (keyword, root element name, public and system ID, etc.).  
                </td>
              </tr>
              <tr>
                <td><strong><kbd>AbstractStandardMarkupAttoHandler</kbd></strong></td>
                <td>
                  Higher-level abstract handler that offers an interface
                  more similar to the Standard SAX <kbd>ContentHandler</kbd>s (fewer
                  events, use of <kbd>String</kbd> instead of <kbd>char[]</kbd>, 
                  attributes reported as <kbd>Map&lt;String,String&gt;</kbd>, etc).  
                </td>
              </tr>
            </tbody>
          </table>

        
          <h6>Example XML handlers</h6>

          
          <table class="table table-striped">
            <tbody>
              <tr>
                <td class="span4"><strong><kbd>AbstractDetailedXmlAttoHandler</kbd></strong></td>
                <td class="span8">
                  Abstract handler with the same level of detail as <kbd>AbstractDetailedMarkupAttoHandler</kbd>,
                  using specific XML configuration.
                </td>
              </tr>
              <tr>
                <td><strong><kbd>AbstractStandardXmlAttoHandler</kbd></strong></td>
                <td>
                  Higher-level abstract handler similar to <kbd>AbstractStandardMarkupAttoHandler</kbd>,
                  using specific XML configuration.
                </td>
              </tr>
              <tr>
                <td><strong><kbd>DOMXmlAttoHandler</kbd></strong></td>
                <td>
                  Specialized handler that converts SAX-style events into an attoDOM tree.
                </td>
              </tr>
            </tbody>
          </table>

        
          <h6>Example HTML handlers</h6>

          
          <table class="table table-striped">
            <tbody>
              <tr>
                <td class="span4"><strong><kbd>AbstractDetailedNonValidatingHtmlAttoHandler</kbd></strong></td>
                <td class="span8">
                  Abstract handler with the same level of detail as <kbd>AbstractDetailedMarkupAttoHandler</kbd>,
                  using specific HTML configuration and intelligence.
                </td>
              </tr>
              <tr>
                <td><strong><kbd>AbstractStandardNonValidatingHtmlAttoHandler</kbd></strong></td>
                <td>
                  Higher-level abstract handler similar to <kbd>AbstractStandardMarkupAttoHandler</kbd>,
                  using specific HTML configuration and intelligence.
                </td>
              </tr>
            </tbody>
          </table>
        
          <h5>Creating a handler</h5>
        
          <p>
            For example, we could choose <kbd>AbstractStandardMarkupAttoHandler</kbd> and create a
            very simple handler for counting the number of standalone elements in our parsed documents, like:
          </p>

<pre class="prettyprint linenums language-java">
public class StandaloneCountingAttoHandler extends AbstractStandardXmlAttoHandler {

    // Let's count the number of standalone elements in our document!
    private int standaloneCount = 0;
    
    public StandaloneCountingAttoHandler() {
        super();
    }
    
    public int getStandaloneCount() {
        return this.standaloneCount;
    }

    @Override
    public void handleXmlStandaloneElement(
            final String elementName, final Map&lt;String, String&gt; attributes, 
            final int line, final int col)
            throws AttoParseException {
        this.standaloneCount++;
    }

}
</pre>

          <p>
            Looking at the code above, note that most attoparser handlers offer different handlers for
            <em>opening elements</em> and <em>standalone elements</em>, a differentiation that is
            not easy to achieve using standard SAX parsers. 
          </p>
          <p>
            Also note the <code>line</code> and <code>col</code> arguments, specifying the exact position 
            of these standalone elements in the document.
          </p>
          <p>
            And finally, note the fact that we are using an <i>XML-specific</i> handler,
            which instructs attoparser to require the parsed document to be well-formed from an
            XML standpoint. This means a well-formed prolog, balanced tags, correctly formatted 
            attribute values, etc.
          </p>
          <p>
            If our code was HTML instead of XML, we could have created our handler as an implementation
            of, for example, <kbd>AbstractStandardNonValidatingHtmlMarkupAttoHandler</kbd>, which 
            would offer similar events to those of its XML counterpart, but removing a lot of 
            restrictions of format (of attributes, for instance) and adding some HTML-specific
            intelligence like knowing that a <code>&lt;img src="..."&gt;</code> is a standalone tag
            (and not an <i>open tag</i>) even if it isn't written like <code>&lt;img src="..." /&gt;</code>.
          </p>
        
        
          <h5>Just one more...</h5>
        
          <p>
            What if we wanted to strip our document of markup tags, leaving only the text? We could easily
            create a handler for this by extending <kbd>AbstractBasicMarkupAttoHandler</kbd>:
          </p>
        
<pre class="prettyprint linenums language-java">
public class TagStrippingAttoHandler extends AbstractBasicMarkupAttoHandler {

    private final StringBuilder strBuilder;
    
    public TagStrippingAttoHandler() {
        super();
        this.strBuilder = new StringBuilder();
    }
    
    public String getTagStrippedText() {
        return this.strBuilder.toString();
    }

    @Override
    public void handleText(
            final char[] buffer, final int offset, final int len, 
            final int line, final int col)
            throws AttoParseException {
        this.strBuilder.append(buffer, offset, len);
    }
    

}
</pre>
          <p>
            Quite easy, right?
          </p>
        
          <h3>Parsers</h3>

          <p>
            attoparser offers a parser interface called <kbd>IAttoParser</kbd>, and only one implementation
            for it: <kbd>MarkupAttoParser</kbd>.
          </p>
          <p>
            This <kbd>MarkupAttoParser</kbd> class should be directly used (without extending) and its 
            instances are <em>thread-safe</em>, so they can be safely reused without synchronization.
            Also note that this <em>thread-safety</em> feature usually does not apply to <em>handlers</em>.  
          </p>
          
          <h5>Parsing our document</h5>
          
          <p>
            <kbd>MarkupAttoParser</kbd> allows us to specify the document to be parsed in several useful
            ways: as a <kbd>java.io.Reader</kbd>, a <kbd>String</kbd> or a <kbd>char[]</kbd>. 
          </p>
          
          <p>
            Let's say we have a document in our classpath and we want to parse it using our recently created
            <em>handler</em> in order to count the number of standalone elements it contains. For the sake
            of simplicity, we will ignore the <code>try..finally</code> code required to adequately close 
            the streams: 
          </p>

<pre class="prettyprint linenums language-java">
final InputStream is = Thread.currentThread().getContextClassLoader().getResourceAsStream(fileName);
// We know our file's encoding is ISO-8859-1, and we need that info to create a Reader
final Reader reader = new BufferedReader(new InputStreamReader(is, "ISO-8859-1"));
                
final StandaloneCountingAttoHandler handler = new StandaloneCountingAttoHandler();
parser.parse(reader, handler);

final int standaloneCount = handler.getStandaloneCount();
</pre>          
        
          <p>
            And we are done!
          </p>
        
          <h3>Using the DOM features</h3>

          <p>
            As a plus to its main SAX-style parsing capabilities, attoparser offers us a DOM-style
            interface that enables us to handle a document as an attoDOM tree.
            Note that, currently, only an XML version of the DOM conversion facilities is offered 
            out-of-the-box.
          </p>
          <p>
            Using it is easy: we just need to use the prebuilt <kbd>DOMXmlAttoHandler</kbd>:
          </p>

<pre class="prettyprint linenums language-java">
final Reader reader = ...
                
final DOMXmlAttoHandler handler = new DOMXmlAttoHandler();
parser.parse(reader, handler);

final Document doc = handler.getDocument();
final DocType docType = doc.getDocType();
final List&lt;Element&gt; elements = doc.getRootElement().getElementChildren();
...
</pre>          
        
        <h5>Writing markup from an attoDOM tree</h5>
        
        <p>
          attoparser provides, out-of-the-box, a writer object capable of writing an attoDOM tree as
          markup code again. It's the <kbd>XmlDOMWriter</kbd> class:
        </p>
          
<pre class="prettyprint linenums language-java">
final Document doc = ...

// Modify our document if we wish 
...
                
final StringWriter stringWriter = new StringWriter();
final XmlDOMWriter domWriter = new XmlDOMWriter(); 

// Execute the writer
domWriter.writeDocument(doc, stringWriter);

// Obtain the result of executing the visitor            
final String markup = stringWriter.toString();
</pre>
        
        </div>
        
      </div>

      <hr />
            
      <footer>
        <p>Copyright &copy; <a href="team.html">Attoparser</a>.</p>
      </footer>
      
    </div>
    
    <script src="https://code.jquery.com/jquery-latest.js"></script>
    <script src="js/bootstrap.js"></script>
    <script src="js/google-code-prettify/prettify.js"></script>
    
  </body>
  
</html>