Notes from Lucene Eurocon

Page 1

Faceting provides very similar functionality to grouping, could be used instead.
Do we need query logs? "Discovery analytics"
Carrot2 is a clustering engine that integrates with Lucene.
There is an eBay/SigIR paper about debugging strategies and good UX for zero-result queries. (link?)
The Solr stats module can perform descriptive statistics over fields.
Better "join" support is coming in Lucene/Solr 4.0. (Simulated via implicit chained queries.)
HCatalog is an Apache project to provide homogenous access to various heterogenous tabular datastores (HBase, Pig, etc.) May someday support Solr and traditional relational datastores.
Subtracting the results of two faceted queries can provide an effective way of illustrating the differences between result sets.
Ajax-Solr is a native HTTP/80 interface to Solr for quickly building internal UIs.
A possible domain with similar problems to ours: text message search.
Lucene has a Multiple Index Reader that will span queries across indexes.

Tika is an Apache project for extracting metadata from text, i.e. language detection.
It would be possible to auto-generate a complete Solr schema using csvkit's type inference.
PANDA: Boost search results by dataset creation date?
Sort by drop-down in search results? (relevance, date, etc.)
Interesting idea for search performance testing: randomly generate queries from terms in the index.
It can sometimes be effective to build different indexes for different datasets as the smaller size can improve performance and the cost of duplicating data may be negligible.
NLP for queries? "police reports about Rod Blagojevich".
Tools for doing NLP with Lucene: Apache UIMA.
Solr meter is a testing tool for Solr.

Lucene 4.0 will make it possible to specify custom index encoding codecs. For some applications this could be a significant performance gain.
TF/IDF scoring is not appropriate for metadata--of which 90% of PANDA essentially is. Terms are unlikely to appear more than once in a document (TF) and even if they do it shouldn't matter. And the relative scarcity of terms (IDF) is completely irrelevant since the documents span unrelated domains.
Lucene "payloads" allow for per-field value boosts for multivalued fields.
One way to post filter search results: a CustomQuery that implements the post_filter method.
The stupidest thing that works for security with Lucene: OR together what datasets the user has access to and pass it as a filter-query. This has the side benefit that it can/will be cached until the user's security changes.
Etsy takes this sort of security even further by applying a security bitset to each document which is then compared against the user's security.
PANDA: Guids are probably unnecessarily complex unique ID's. Use auto-incrementing integral key instead?
In Lucene 4.0 DocValues may allow for selected updatable fields (exclusive of searchable fields). It's likely this feature will not make it into Solr 4.0.

facet.range can be used with date math.
PANDA: datasets should be immutable while an import/delete is in process.
PANDA: it would probably be a perf/code win to punt on Sunburnt and use Solr's R/W JSON API. (&wt=json)
explain.solr.pl will parse and render an explain query in a more digestible format.
PANDA: Dataset metadata search and simply be implemented as another core.
Random thought: it would be awesome to surface Solr explain details in django-debug-toolbar.
PANDA: Reordering of search results (grouping) may be more effective at API layer.
PANDA: Should create "constraints documentation" for various server configs. Example: max # of docs effectively indexed, max file size, basic perf expectations, available disk space, etc. for each of the main EC2 instance sizes.

Advice from Simon: A RAM buffer larger than 512 MB never increases performance, 256 MB is a good default that will keep Solr from hitting disk too frequently.
Simon: On light hardware use SerialMergeScheduler to ensure only a single index merge is occurring at once (minimize concurrent threads)
Simon: commit as infrequently as possible.
Use omitNorms for all fields since document length is irrelevant to scoring our results.
Write a custom Similarity implementation that disables TF/IDF computations.
Solr's WordDelimiterFilterFactory has tokenization options that will generate the "maximum spread" of letter/number combinations. Example: "FY-09" can have indexed terms for "FY", "09", "FY09" and "FY-09".

Do we even need metadata fields (Name) if full-text works well? What is the value-add?
What about phrase queries? How are these computed? Does killing TF/IDF scoring effect this?