Skip to content

Notes from Lucene Eurocon

onyxfish edited this page Oct 24, 2011 · 4 revisions

Notes from Lucene Eurocon

Notes taken at Lucene Eurocon 2011.

Page 1

  • Faceting provides very similar functionality to grouping, could be used instead.
  • Do we need query logs? "Discovery analytics"
  • Carrot2 is a clustering engine that integrates with Lucene.
  • There is an eBay/SigIR paper about debugging strategies and good UX for zero-result queries. (link?)
  • The Solr stats module can perform descriptive statistics over fields.
  • Better "join" support is coming in Lucene/Solr 4.0. (Simulated via implicit chained queries.)
  • HCatalog is an Apache project to provide homogenous access to various heterogenous tabular datastores (HBase, Pig, etc.) May someday support Solr and traditional relational datastores.
  • Subtracting the results of two faceted queries can provide an effective way of illustrating the differences between result sets.
  • Ajax-Solr is a native HTTP/80 interface to Solr for quickly building internal UIs.
  • A possible domain with similar problems to ours: text message search.
  • Lucene has a Multiple Index Reader that will span queries across indexes.

Page 2

  • Tika is an Apache project for extracting metadata from text, i.e. language detection.
  • It would be possible to auto-generate a complete Solr schema using csvkit's type inference.
  • PANDA: Boost search results by dataset creation date?
  • Sort by drop-down in search results? (relevance, date, etc.)
  • Interesting idea for search performance testing: randomly generate queries from terms in the index.
  • It can sometimes be effective to build different indexes for different datasets as the smaller size can improve performance and the cost of duplicating data may be negligible.
  • NLP for queries? "police reports about Rod Blagojevich".
  • Tools for doing NLP with Lucene: Apache UIMA.
  • Solr meter is a testing tool for Solr.

Page 3

  • Lucene 4.0 will make it possible to specify custom index encoding codecs. For some applications this could be a significant performance gain.
  • TF/IDF scoring is not appropriate for metadata--of which 90% of PANDA essentially is. Terms are unlikely to appear more than once in a document (TF) and even if they do it shouldn't matter. And the relative scarcity of terms (IDF) is completely irrelevant since the documents span unrelated domains.
  • Lucene "payloads" allow for per-field value boosts for multivalued fields.
  • One way to post filter search results: a CustomQuery that implements the post_filter method.
  • The stupidest thing that works for security with Lucene: OR together what datasets the user has access to and pass it as a filter-query. This has the side benefit that it can/will be cached until the user's security changes.
  • Etsy takes this sort of security even further by applying a security bitset to each document which is then compared against the user's security.
  • PANDA: Guids are probably unnecessarily complex unique ID's. Use auto-incrementing integral key instead?
  • In Lucene 4.0 DocValues may allow for selected updatable fields (exclusive of searchable fields). It's likely this feature will not make it into Solr 4.0.

Page 4

  • facet.range can be used with date math.
  • PANDA: datasets should be immutable while an import/delete is in process.
  • PANDA: it would probably be a perf/code win to punt on Sunburnt and use Solr's R/W JSON API. (&wt=json)
  • explain.solr.pl will parse and render an explain query in a more digestible format.
  • PANDA: Dataset metadata search and simply be implemented as another core.
  • Random thought: it would be awesome to surface Solr explain details in django-debug-toolbar.
  • PANDA: Reordering of search results (grouping) may be more effective at API layer.
  • PANDA: Should create "constraints documentation" for various server configs. Example: max # of docs effectively indexed, max file size, basic perf expectations, available disk space, etc. for each of the main EC2 instance sizes.

Page 5

  • Advice from Simon: A RAM buffer larger than 512 MB never increases performance, 256 MB is a good default that will keep Solr from hitting disk too frequently.
  • Simon: On light hardware use SerialMergeScheduler to ensure only a single index merge is occurring at once (minimize concurrent threads)
  • Simon: commit as infrequently as possible.
  • Use omitNorms for all fields since document length is irrelevant to scoring our results.
  • Write a custom Similarity implementation that disables TF/IDF computations.
  • Solr's WordDelimiterFilterFactory has tokenization options that will generate the "maximum spread" of letter/number combinations. Example: "FY-09" can have indexed terms for "FY", "09", "FY09" and "FY-09".

Page 6

  • Do we even need metadata fields (Name) if full-text works well? What is the value-add?
  • What about phrase queries? How are these computed? Does killing TF/IDF scoring effect this?