-
Notifications
You must be signed in to change notification settings - Fork 49
Notes from Lucene Eurocon
onyxfish edited this page Oct 24, 2011
·
4 revisions
Notes taken at Lucene Eurocon 2011.
- Faceting provides very similar functionality to grouping, could be used instead.
- Do we need query logs? "Discovery analytics"
- Carrot2 is a clustering engine that integrates with Lucene.
- There is an eBay/SigIR paper about debugging strategies and good UX for zero-result queries. (link?)
- The Solr stats module can perform descriptive statistics over fields.
- Better "join" support is coming in Lucene/Solr 4.0. (Simulated via implicit chained queries.)
- HCatalog is an Apache project to provide homogenous access to various heterogenous tabular datastores (HBase, Pig, etc.) May someday support Solr and traditional relational datastores.
- Subtracting the results of two faceted queries can provide an effective way of illustrating the differences between result sets.
- Ajax-Solr is a native HTTP/80 interface to Solr for quickly building internal UIs.
- A possible domain with similar problems to ours: text message search.
- Lucene has a Multiple Index Reader that will span queries across indexes.
- Tika is an Apache project for extracting metadata from text, i.e. language detection.
- It would be possible to auto-generate a complete Solr schema using csvkit's type inference.
- PANDA: Boost search results by dataset creation date?
- Sort by drop-down in search results? (relevance, date, etc.)
- Interesting idea for search performance testing: randomly generate queries from terms in the index.
- It can sometimes be effective to build different indexes for different datasets as the smaller size can improve performance and the cost of duplicating data may be negligible.
- NLP for queries? "police reports about Rod Blagojevich".
- Tools for doing NLP with Lucene: Apache UIMA.
- Solr meter is a testing tool for Solr.
- Lucene 4.0 will make it possible to specify custom index encoding codecs. For some applications this could be a significant performance gain.
- TF/IDF scoring is not appropriate for metadata--of which 90% of PANDA essentially is. Terms are unlikely to appear more than once in a document (TF) and even if they do it shouldn't matter. And the relative scarcity of terms (IDF) is completely irrelevant since the documents span unrelated domains.
- Lucene "payloads" allow for per-field value boosts for multivalued fields.
- One way to post filter search results: a CustomQuery that implements the post_filter method.
- The stupidest thing that works for security with Lucene: OR together what datasets the user has access to and pass it as a filter-query. This has the side benefit that it can/will be cached until the user's security changes.
- Etsy takes this sort of security even further by applying a security bitset to each document which is then compared against the user's security.
- PANDA: Guids are probably unnecessarily complex unique ID's. Use auto-incrementing integral key instead?
- In Lucene 4.0 DocValues may allow for selected updatable fields (exclusive of searchable fields). It's likely this feature will not make it into Solr 4.0.
- facet.range can be used with date math.
- PANDA: datasets should be immutable while an import/delete is in process.
- PANDA: it would probably be a perf/code win to punt on Sunburnt and use Solr's R/W JSON API. (&wt=json)
- explain.solr.pl will parse and render an explain query in a more digestible format.
- PANDA: Dataset metadata search and simply be implemented as another core.
- Random thought: it would be awesome to surface Solr explain details in django-debug-toolbar.
- PANDA: Reordering of search results (grouping) may be more effective at API layer.
- PANDA: Should create "constraints documentation" for various server configs. Example: max # of docs effectively indexed, max file size, basic perf expectations, available disk space, etc. for each of the main EC2 instance sizes.
- Advice from Simon: A RAM buffer larger than 512 MB never increases performance, 256 MB is a good default that will keep Solr from hitting disk too frequently.
- Simon: On light hardware use SerialMergeScheduler to ensure only a single index merge is occurring at once (minimize concurrent threads)
- Simon: commit as infrequently as possible.
- Use omitNorms for all fields since document length is irrelevant to scoring our results.
- Write a custom Similarity implementation that disables TF/IDF computations.
- Solr's WordDelimiterFilterFactory has tokenization options that will generate the "maximum spread" of letter/number combinations. Example: "FY-09" can have indexed terms for "FY", "09", "FY09" and "FY-09".
- Do we even need metadata fields (Name) if full-text works well? What is the value-add?
- What about phrase queries? How are these computed? Does killing TF/IDF scoring effect this?