Skip to content

Latest commit

 

History

History
26 lines (19 loc) · 828 Bytes

README.md

File metadata and controls

26 lines (19 loc) · 828 Bytes

=== Rule Engine ===

A simple system which can work and process work items based on rules define for them. The idea was to alieviate the pain of site specific parsing to extract relevant content Three types of rules can be defined.

  • CSS based extraction
  • Regex based extraction
  • Executing JS on the HTML to extract content ( Not done yet - WIP)

The Rule engine would return results based on the best fit for a particular URL For ex., if Rules were defined as below:

*: title: h1 comments: div#comments, div#comment, div#user_comment user: div#comment span.user

xyz.com: title: div#name comments: div#comment_bar span.comments

Now, if a user wanted to scrape title, comments & user for xyz.com the title, comments would xyz.com would take effect. And, the rule for user at * will take effect. (WIP)