diff --git a/010_Intro/45_Distributed.asciidoc b/010_Intro/45_Distributed.asciidoc index d119ed2ed..903a57fdc 100644 --- a/010_Intro/45_Distributed.asciidoc +++ b/010_Intro/45_Distributed.asciidoc @@ -34,8 +34,9 @@ the operations happening automatically under the hood include: As you read through this book, you'll encounter supplemental chapters about the distributed nature of Elasticsearch. These chapters will teach you about how the cluster scales and deals with failover (<>), -handles document storage (<>) and executes distributed search -(<>). +handles document storage (<>), executes distributed search +(<>), and what a shard is and how it works +(<>). These chapters are not required reading -- you can use Elasticsearch without understanding these internals -- but they will provide insight that will make diff --git a/300_Aggregations/110_docvalues.asciidoc b/300_Aggregations/110_docvalues.asciidoc index ee1c26137..128f3c236 100644 --- a/300_Aggregations/110_docvalues.asciidoc +++ b/300_Aggregations/110_docvalues.asciidoc @@ -1,4 +1,4 @@ - +[[doc-values]] === Doc Values The default data structure for field data is called _paged-bytes_, and it is @@ -11,13 +11,13 @@ There is an alternative format known as _doc values_. Doc values are special data structures which are built at index-time and written to disk. They are then loaded to memory and accessed in place of the standard paged-bytes implementation. -The main benefit of doc values is lower memory footprint. With the default +The main benefit of doc values is lower memory footprint. With the default paged-bytes format, if you attempt to load more field data to memory than available heap space...you'll get an OutOfMemoryException. -By contrast, doc values can stream from disk efficiently and do not require +By contrast, doc values can stream from disk efficiently and do not require processing at query-time (unlike paged-bytes, which must be generated). This -allows you to work with field data that would normally be too large to fit in +allows you to work with field data that would normally be too large to fit in memory. The trade-off is a larger index size and potentially slower field data access. @@ -35,7 +35,7 @@ tradeoff for truly massive data. ==== Enabling Doc Values Doc values can be enabled for numeric fields, geopoints and `not_analyzed` string fields. -They do not currently work with `analyzed` string fields. Doc values are +They do not currently work with `analyzed` string fields. Doc values are enabled in the mapping of a particular field, which means that some fields can use doc values while the rest use the default paged-bytes. @@ -56,7 +56,7 @@ PUT /fielddata/filtering/_mapping } } ---- -<1> Doc values can only be enabled on `not_analyzed` string fields, numerics and +<1> Doc values can only be enabled on `not_analyzed` string fields, numerics and geopoints <2> Doc values are enabled by setting the `"fielddata.format"` parameter to `doc_values` diff --git a/310_Geolocation.asciidoc b/310_Geolocation.asciidoc index 74b2b0a3b..e99077056 100644 --- a/310_Geolocation.asciidoc +++ b/310_Geolocation.asciidoc @@ -1,32 +1,56 @@ -[[geoloc]] -== Geolocation (TODO) +:ref: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/ -The web is increasingly location aware – users expect to see local results, -or to be able to filter results by their position on a map. +include::310_Geolocation/10_Intro.asciidoc[] -This chapter explains how to use geolocation in Elasticsearch, including -optimization tips. +include::310_Geolocation/20_Geopoints.asciidoc[] +include::310_Geolocation/30_Filter_by_geopoint.asciidoc[] -=== Adding geolocation to your documents -* Mapping the geo-point type -* Indexing documents with geo-points +include::310_Geolocation/32_Bounding_box.asciidoc[] -[[geoloc-filters]] -=== Geolocation-aware search -* geo-distance and geo-distance-range filters -* geo-bounding-box filter -* geo-polygon filter +include::310_Geolocation/34_Geo_distance.asciidoc[] -=== Sorting by distance -. +include::310_Geolocation/36_Caching_geofilters.asciidoc[] +include::310_Geolocation/38_Reducing_memory.asciidoc[] -=== Geo-shapes -. +include::310_Geolocation/40_Geohashes.asciidoc[] +include::310_Geolocation/50_Sorting_by_distance.asciidoc[] -=== Optimizing geo-queries -. +include::310_Geolocation/60_Geo_aggs.asciidoc[] +include::310_Geolocation/62_Geo_distance_agg.asciidoc[] +include::310_Geolocation/64_Geohash_grid_agg.asciidoc[] + +include::310_Geolocation/66_Geo_bounds_agg.asciidoc[] + +include::310_Geolocation/70_Geoshapes.asciidoc[] + +include::310_Geolocation/72_Mapping_geo_shapes.asciidoc[] + +include::310_Geolocation/74_Indexing_geo_shapes.asciidoc[] + +include::310_Geolocation/76_Querying_geo_shapes.asciidoc[] + +include::310_Geolocation/78_Indexed_geo_shapes.asciidoc[] + +include::310_Geolocation/80_Caching_geo_shapes.asciidoc[] + + +//////// + + + +geo_shape: + mapping + tree + precision + type of shapes + indexing + indexed shapes + filters + geoshape + +//////// diff --git a/310_Geolocation/10_Intro.asciidoc b/310_Geolocation/10_Intro.asciidoc new file mode 100644 index 000000000..334fb74c5 --- /dev/null +++ b/310_Geolocation/10_Intro.asciidoc @@ -0,0 +1,33 @@ +[[geoloc]] +== Geolocation + +Gone are the days when we wander around a city with paper maps. Thanks to +smartphones, we now know exactly where we are all of the time, and we expect +websites to use that information. I'm not interested in restaurants in +Greater London -- I want to know about restaurants within 5 minutes walk of my +current location. + +But geolocation is only one part of the puzzle. The beauty of Elasticsearch +is that it allows you to combine geolocation with full text search, structured +search, and analytics. + +For instance: show me restaurants that mention _vitello tonnato_, are within 5 +minutes walk, and are open at 11pm, and rank them by a combination of user +rating, distance and price. Another example: show me a map of holiday rental +properties available in August throughout the city, and calculate the average +price per zone. + +Elasticsearch offers two ways of representing geolocations: latitude-longitude +points using the `geo_point` field type, and complex shapes defined in +http://en.wikipedia.org/wiki/GeoJSON[GeoJSON], using the `geo_shape` field +type. + +Geo-points allow you to find points within a certain distance of another +point, to calculate distances between two points for sorting or relevance +scoring, or to aggregate into a grid to display on a map. Geo-shapes, on the +other hand, are used purely for filtering. They can be used to decide whether +two shapes overlap or not, or whether one shape completely contains other +shapes. + + + diff --git a/310_Geolocation/20_Geopoints.asciidoc b/310_Geolocation/20_Geopoints.asciidoc new file mode 100644 index 000000000..55117ca3a --- /dev/null +++ b/310_Geolocation/20_Geopoints.asciidoc @@ -0,0 +1,76 @@ +[[indexing-geopoints]] +=== Indexing geo-points + +Geo-points cannot be automatically detected with +<>. Instead, geo-points fields should be +mapped explicitly: + +[source,json] +----------------------- +PUT /attractions +{ + "mappings": { + "restaurant": { + "properties": { + "name": { + "type": "string" + }, + "location": { + "type": "geo_point" + } + } + } + } +} +----------------------- + +[[lat-lon-formats]] +==== Lat/Lon formats + +With the `location` field defined as a `geo_point`, we can proceed to index +documents containing latitude/longitude pairs, which can be formatted as +strings, arrays, or objects: + +[source,json] +----------------------- +PUT /attractions/restaurant/1 +{ + "name": "Chipotle Mexican Grill", + "location": "40.715, -74.011" <1> +} + +PUT /attractions/restaurant/2 +{ + "name": "Pala Pizza", + "location": { <2> + "lat": 40.722, + "lon": -73.989 + } +} + +PUT /attractions/restaurant/3 +{ + "name": "Mini Munchies Pizza", + "location": [ -73.983, 40.719 ] <3> +} +----------------------- +<1> A string representation, with `"lat,lon"`. +<2> An object representation with `lat` and `lon` explicitly named. +<3> An array representation with `[lon,lat]`. + +[IMPORTANT] +======================== + +Everybody gets caught at least once: string geo-points are +`"latitude,longitude"`, while array geo-points are `[longitude,latitude]` -- +the opposite order! + +Originally, both strings and arrays in Elasticsearch used latitude followed by +longitude. However, it was decided early on to switch the order for arrays in +order to conform with GeoJSON. + +The result is a bear trap that captures all unsuspecting users on their +journey to full geo-location nirvana. + +======================== + diff --git a/310_Geolocation/30_Filter_by_geopoint.asciidoc b/310_Geolocation/30_Filter_by_geopoint.asciidoc new file mode 100644 index 000000000..765d20a3e --- /dev/null +++ b/310_Geolocation/30_Filter_by_geopoint.asciidoc @@ -0,0 +1,44 @@ +[[filter-by-geopoint]] +=== Filtering by geo-point + +Four geo-filters filters can be used to include or exclude documents by +geo-location: + +<>:: + + Find geo-points which fall within the specified rectangle. + +<>:: + + Find geo-points within the specified distance of a central point. + +<>:: + + Find geo-points within a specified minimum and maximum distance from a + central point. + +`geo_polygon`:: + + Find geo-points which fall within the specified polygon. *This filter is + very expensive*. If you find yourself wanting to use it, you should be + looking at <> instead. + +All of these filters work in a similar way: the `lat/lon` values are loaded +into memory for *all documents in the index*, not just the documents which +match the query (see <>). Each filter performs a slightly +different calculation to check whether a point falls into the containing area +or not. + +[TIP] +============================ + +Geo-filters are expensive -- they should be used on as few documents as +possible. First remove as many documents as you can with cheaper filters, like +`term` or `range` filters, and apply the geo filters last. + +The <> will do this for you automatically. First it +applies any bitset-based filters (see <>) to exclude as many +documents as it can as cheaply as possible. Then it applies the more +expensive geo or script filters to each remaining document in turn. + +============================ diff --git a/310_Geolocation/32_Bounding_box.asciidoc b/310_Geolocation/32_Bounding_box.asciidoc new file mode 100644 index 000000000..f4bf71ec8 --- /dev/null +++ b/310_Geolocation/32_Bounding_box.asciidoc @@ -0,0 +1,96 @@ +[[geo-bounding-box]] +=== `geo_bounding_box` filter + +This is by far the most performant geo-filter because its calculation is very +simple. You provide it with the `top`, `bottom`, `left`, and `right` +coordinates of a rectangle and all it does is compare the latitude with the +left and right coordinates, and the longitude with the top and bottom +coordinates. + +[source,json] +--------------------- +GET /attractions/restaurant/_search +{ + "query": { + "filtered": { + "filter": { + "geo_bounding_box": { + "location": { <1> + "top": 40.8, + "bottom": 40.7, + "left": -74.0, + "right": -73.0 + } + } + } + } + } +} +--------------------- +<1> These coordinates can also be specified as `top_left` and `bottom_right` + pairs, or `bottom_left` and `top_right` pairs. + +[[optimize-bounding-box]] +==== Optimizing bounding boxes + +The `geo_bounding_box` is the one geo-filter which doesn't require all +geo-points to be loaded into memory. Because all it has to do is to check +whether the `lat` and `lon` values fall within the specified ranges, it can +use the inverted index to do a glorified `range` filter. + +In order to use this optimization, the `geo_point` field must be mapped to +index the `lat` and `lon` values separately: + +[source,json] +----------------------- +PUT /attractions +{ + "mappings": { + "restaurant": { + "properties": { + "name": { + "type": "string" + }, + "location": { + "type": "geo_point", + "lat_lon": true <1> + } + } + } + } +} +----------------------- +<1> The `location.lat` and `location.lon` fields will be indexed separately. + These fields can be used for searching, but their values cannot be retrieved. + +Now, when we run our query, we have to tell Elasticsearch to use the indexed +`lat` and `lon` values: + +[source,json] +--------------------- +GET /attractions/restaurant/_search +{ + "query": { + "filtered": { + "filter": { + "geo_bounding_box": { + "type": "indexed", <1> + "location": { + "top": 40.8, + "bottom": 40.7, + "left": -74.0, + "right": -73.0 + } + } + } + } + } +} +--------------------- +<1> Setting the `type` parameter to `indexed` (instead of the default + `memory`) tells Elasticsearch to use the inverted index for this filter. + +IMPORTANT: While a `geo_point` field can contain multiple geo-points, the +`lat_lon` optimization can only be used on fields which contain a single +geo-point. + diff --git a/310_Geolocation/34_Geo_distance.asciidoc b/310_Geolocation/34_Geo_distance.asciidoc new file mode 100644 index 000000000..83656fff6 --- /dev/null +++ b/310_Geolocation/34_Geo_distance.asciidoc @@ -0,0 +1,130 @@ +[[geo-distance]] +=== `geo_distance` filter + +The `geo_distance` filter draws a circle around the specified location and +finds all documents that have a geo-point within that circle: + +[source,json] +--------------------- +GET /attractions/restaurant/_search +{ + "query": { + "filtered": { + "filter": { + "geo_distance": { + "distance": "1km", <1> + "location": { <2> + "lat": 40.715, + "lon": -73.988 + } + } + } + } + } +} +--------------------- +<1> Find all `location` fields within `1km` of the specified point. + See {ref}common-options.html#distance-units[Distance Units] for + a list of the accepted units. +<2> The central point can be specified as a string, an array, or (as in this + example) as an object. See <>. + +A geo-distance calculation is expensive. To optimize performance, +Elasticsearch draws a box around the circle and first uses the less expensive +bounding-box calculation to exclude as many documents as it can. It only runs +the geo-distance calculation on those points that fall within the bounding +box. + +TIP: Do your users really require an accurate circular filter to be applied to +their results? Using a rectangular <> is much +more efficient than geo-distance and will usually serve their purposes just as +well. + +==== Faster geo-distance calculations + +The distance between two points can be calculated using different algorithms, +which trade performance for accuracy. + +`arc`:: + +The slowest but most accurate is the `arc` calculation, which treats the world +as a sphere. Accuracy is still limited because the world isn't really a sphere. + +`plane`:: + +The `plane` calculation, which treats the world as if it were flat, is faster +but less accurate. It is most accurate at the equator and becomes less +accurate towards the poles. + +`sloppy_arc`:: + +So called because it uses the `SloppyMath` Lucene class to trade accuracy for speed, +the `sloppy_arc` calculation uses the +http://en.wikipedia.org/wiki/Haversine_formula[Haversine formula] to calculate +distance. It is four to five times as fast as `arc`, and distances are 99.9% accurate. +This is the default calculation. + +You can specify a different calculation as follows: + +[source,json] +--------------------- +GET /attractions/restaurant/_search +{ + "query": { + "filtered": { + "filter": { + "geo_distance": { + "distance": "1km", + "distance_type": "plane", <1> + "location": { + "lat": 40.715, + "lon": -73.988 + } + } + } + } + } +} +--------------------- +<1> Use the faster but less accurate `plane` calculation. + +TIP: Will your users really care if a restaurant is a few metres outside of +their specified radius? While some geo applications require great accuracy, +less accurate but faster calculations will suit the majority of use cases just +fine. + +[[geo-distance-range]] +==== `geo_distance_range` filter + +The only difference between the `geo_distance` and `geo_distance_range` +filters is that the latter has a doughnut shape and excludes documents within +the central hole. + +Instead of specifying a single `distance` from the centre, you specify a +minimum distance (with `gt` or `gte`) and maximum distance (with `lt` or +`lte`), just like a `range` filter: + +[source,json] +--------------------- +GET /attractions/restaurant/_search +{ + "query": { + "filtered": { + "filter": { + "geo_distance_range": { + "gte": "1km", <1> + "lt": "2km", <1> + "location": { + "lat": 40.715, + "lon": -73.988 + } + } + } + } + } +} +--------------------- +<1> Matches locations which are at least `1km` from the centre, and less than + `2km` from the centre. + + diff --git a/310_Geolocation/36_Caching_geofilters.asciidoc b/310_Geolocation/36_Caching_geofilters.asciidoc new file mode 100644 index 000000000..cd10802b5 --- /dev/null +++ b/310_Geolocation/36_Caching_geofilters.asciidoc @@ -0,0 +1,71 @@ +[[geo-caching]] +=== Caching geo-filters + +The results of geo-filters are not cached by default, for two reasons: + +1. Geo-filters are usually used to find entities that are near to a user's + current location. The problem is that users move, and that no two users + are in exactly the same location. A cached filter would have little + chance of being reused. + +2. Filters are cached as bitsets which represent all documents in a + <>. Imagine that our query excludes all + documents but one in a particular segment. An uncached geo-filter just + needs to check the one remaining document, but a cached geo-filter would + need to check all of the documents in the segment. + +That said, caching can be used to good effect with geo-filters. Imagine that +your index contains restaurants from all over the United States. A user in New +York is not interested in restaurants in San Francisco. We can treat New York +as a ``hot spot'' and draw a big bounding box around the city and neighbouring +areas. + +This `geo_bounding_box` filter can be cached and reused whenever we have a +user within the city limits of New York. It will exclude all restaurants +from the rest of the country. We can then use an uncached more specific +`geo_bounding_box` or `geo_distance` filter to narrow the remaining results +down to those which are close to the user. + +[source,json] +--------------------- +GET /attractions/restaurant/_search +{ + "query": { + "filtered": { + "filter": { + "bool": { + "must": [ + { + "geo_bounding_box": { + "type": "indexed", + "_cache": true, <1> + "location": { + "top": 40.8, + "bottom": 40.4, + "left": -74.1, + "right": -73.7 + } + } + }, + { + "geo_distance": { <2> + "distance": "1km", + "location": { + "lat": 40.715, + "lon": -73.988 + } + } + } + ] + } + } + } + } +} +--------------------- +<1> The cached bounding box filter reduces all results down to those in the + greater New York area. +<2> The more costly `geo_distance` filter narrows down the results to those + within 1km of the user. + + diff --git a/310_Geolocation/38_Reducing_memory.asciidoc b/310_Geolocation/38_Reducing_memory.asciidoc new file mode 100644 index 000000000..0a2eabedc --- /dev/null +++ b/310_Geolocation/38_Reducing_memory.asciidoc @@ -0,0 +1,62 @@ +[[geo-memory]] +=== Reducing memory usage + +Each `lat/lon` pair requires 16 bytes of memory, memory which is in short +supply. It needs this much memory in order to provide very accurate results. +But as we have commented before, such exacting precision is seldom required. + +You can reduce the amount of memory that is used by switching to a +`compressed` fielddata format and by specifying how precise you need your geo- +points to be. Even reducing precision to `1mm` reduces memory usage by a +third. A more realistic setting of `3m` reduces usage by 62%, and `1km` saves +a massive 75%! + +This setting can be changed on a live index with the `update-mapping` API: + +[source,json] +---------------------------- +POST /attractions/_mapping/restaurant +{ + "location": { + "type": "geo_point", + "fielddata": { + "format": "compressed", + "precision": "1km" <1> + } + } +} +---------------------------- +<1> Each `lat/lon` pair will only require 4 bytes, instead of 16. + +Alternatively, you can avoid using memory for geo-points altogether, either by +using the technique described in <>, or by storing +geo-points as <>. + +[source,json] +---------------------------- +PUT /attractions +{ + "mappings": { + "restaurant": { + "properties": { + "name": { + "type": "string" + }, + "location": { + "type": "geo_point", + "doc_values": true <1> + } + } + } + } +} +---------------------------- +<1> Geo-points will not be loaded into memory, but instead stored on disk. + +Mapping a geo-point to use doc values can only be done when the field is first +created. There is a small performance cost in using doc values instead of +fielddata, but with memory in such short supply, it is often worth doing. + + + + diff --git a/310_Geolocation/40_Geohashes.asciidoc b/310_Geolocation/40_Geohashes.asciidoc new file mode 100644 index 000000000..7a9081750 --- /dev/null +++ b/310_Geolocation/40_Geohashes.asciidoc @@ -0,0 +1,177 @@ +[[geohashes]] +=== Geohashes + +http://en.wikipedia.org/wiki/Geohash[Geohashes] are a way of encoding +`lat/lon` points as strings. The original intention was to have a +URL-friendly way of specifying geolocations, but geohashes have turned out to +be a useful way of indexing geo-points and geo- shapes in databases. + +Geohashes divide the world up into a grid of 32 cells -- 4 rows and 8 columns +-- each represented by a letter or number. The `g` cell covers half of +Greenland, all of Iceland and most of Great Britian. Each cell can be further +divided into anokther 32 cells, which can be divided into another 32 cells, +and so on. The `gc`, cell covers Ireland and England, `gcp` covers most of +London and part of Southern England, and `gcpuuz94k` is the entrance to +Buckingham Palace, accurate to about 5 metres. + +In other words, the longer the geohash string, the more accurate it is. If +two geohashes share a prefix -- `gcpuux` and `gcpuuz` -- then it implies that +they are near to each other. The longer the shared prefix, the closer they +are. + +That said, two locations that are right next to each other may have completely +different geohashes. For instance, the +http://en.wikipedia.org/wiki/Millennium_Dome[Millenium Dome] in London has +geohash `u10hbp`, because it falls into the `u` cell, the next top-level cell +to the east of the `g` cell. + +Geo-points can index their associated geohashes automatically, but more +importantly, they can also index all geohash *prefixes*. Indexing the location +of the entrance to Buckingham Palace -- latitude `51.501568` and longitude +`-0.141257` -- would index all of the geohashes listed in the table below, +along with the approximate dimensions of each geohash cell: + +[cols="1m,1m,3d",options="header"] +|============================================= +|Geohash |Level| Dimensions +|g |1 | ~ 5,004km x 5,004km +|gc |2 | ~ 1,251km x 625km +|gcp |3 | ~ 156km x 156km +|gcpu |4 | ~ 39km x 19.5km +|gcpuu |5 | ~ 4.9km x 4.9km +|gcpuuz |6 | ~ 1.2km x 0.61km +|gcpuuz9 |7 | ~ 152.8m x 152.8m +|gcpuuz94 |8 | ~ 38.2m x 19.1m +|gcpuuz94k |9 | ~ 4.78m x 4.78m +|gcpuuz94kk |10 | ~ 1.19m x 0.60m +|gcpuuz94kkp |11 | ~ 14.9cm x 14.9cm +|gcpuuz94kkp5 |12 | ~ 3.7cm x 1.8cm +|============================================= + +The {ref}query-dsl-geohash-cell-filter.html[`geohash_cell` filter] can use +these geohash prefixes to find locations near a specified `lat/lon` point. + +[[geohash-mapping]] +==== Mapping geohashes + +The first step is to decide just how much precision you need. While you could +index all geo-points with the default full 12 levels of precision, do you +really need to be accurate to within a few centimeters? You can save yourself +a lot of space in the index by reducing your precision requirements to +something more realistic, such as `1km`. + +[source,json] +---------------------------- +PUT /attractions +{ + "mappings": { + "restaurant": { + "properties": { + "name": { + "type": "string" + }, + "location": { + "type": "geo_point", + "geohash_prefix": true, <1> + "geohash_precision": "1km" <2> + } + } + } + } +} +---------------------------- +<1> Setting `geohash_prefix` to `true` tells Elasticsearch to index + all geohash prefixes, up to the specified precision. +<2> The precision can be specified as an absolute number, representing the + length of the geohash, or as a distance. A precision of `1km` corresponds + to a geohash of length `7`. + +With this mapping in place, geohash prefixes of lengths 1 to 7 will be indexed, +providing geohashes accuracate to about 150 meters. + +[[geohash-cell-filter]] +==== `geohash_cell` filter + +The `geohash_cell` filter simply translates a `lat/lon` location into a +geohash with the specified precision and finds all locations which contain +that geohash -- a very efficient filter indeed. + +[source,json] +---------------------------- +GET /attractions/restaurant/_search +{ + "query": { + "filtered": { + "filter": { + "geohash_cell": { + "location": { + "lat": 40.718, + "lon": -73.983 + }, + "precision": "2km" <1> + } + } + } + } +} +---------------------------- +<1> The `precision` cannot be more precise than that specified in the + `geohash_precision` mapping. + +This filter translates the `lat/lon` point into a geohash of the appropriate +length -- in this example `dr5rsk` -- and looks for all locations that contain +that exact term. + +However, the filter as written above may not return all restaurants within 5km +of the specified point. Remember that a geohash is just a rectangle, and the +point may fall anywhere within that rectangle. If the point happens to fall +near the edge of a geohash cell, then the filter may well exclude any +restaurants in the adjacent cell. + +To fix that, we can tell the filter to include the neigbouring cells, by +setting `neighbors` to `true`: + +[source,json] +---------------------------- +GET /attractions/restaurant/_search +{ + "query": { + "filtered": { + "filter": { + "geohash_cell": { + "location": { + "lat": 40.718, + "lon": -73.983 + }, + "neighbors": true, <1> + "precision": "2km" + } + } + } + } +} +---------------------------- + +<1> This filter will look for the resolved geohash and all of the surrounding + geohashes. + +Clearly, looking for a geohash with precision `2km` plus all the neighbouring +cells, results in quite a large search area. This filter is not built for +accuracy, but it is very efficient and can be used as a pre-filtering step +before applying a more accurate geo-filter. + +TIP: Specifying the `precision` as a distance can be misleading. A `precision` +of `2km` is converted to a geohash of length 6, which actually has dimensions +of about 1.2km x 0.6km. You may find it more understandable to specify an +actual length like `5` or `6`. + +The other advantage that this filter has over a `geo_bounding_box` filter is +that it supports multiple locations per field. The `lat_lon` option that we +discussed in <> is very efficient, but only when there +is a single `lat/lon` point per field. + + + + + + diff --git a/310_Geolocation/50_Sorting_by_distance.asciidoc b/310_Geolocation/50_Sorting_by_distance.asciidoc new file mode 100644 index 000000000..0bdab809f --- /dev/null +++ b/310_Geolocation/50_Sorting_by_distance.asciidoc @@ -0,0 +1,105 @@ +[[sorting-by-distance]] +=== Sorting by distance + +Search results can be sorted by distance from a point: + +TIP: While you *can* sort by distance, <> is usually a +better solution. + +[source,json] +---------------------------- +GET /attractions/restaurant/_search +{ + "query": { + "filtered": { + "filter": { + "geo_bounding_box": { + "type": "indexed", + "location": { + "top": 40.8, + "bottom": 40.7, + "left": -74, + "right": -73 + } + } + } + } + }, + "sort": [ + { + "_geo_distance": { + "location": { <1> + "lat": 40.715, + "lon": -73.998 + }, + "order": "asc", + "unit": "km", <2> + "distance_type": "plane" <3> + } + } + ] +} +---------------------------- +<1> Calculate the distance between the specified `lat/lon` point and the + geo-point in the `location` field of each document. +<2> Return the distance in `km` in the `sort` keys for each result. +<3> Use the faster but less accurate `plane` calculation. + +You may ask yourself: why do we specify the distance `unit`? For sorting, it +doesn't matter whether we compare distances in miles, kilometres or light +years. The reason is that the actual value used for sorting is returned with +each result, in the `sort` element: + +[source,json] +---------------------------- +... + "hits": [ + { + "_index": "attractions", + "_type": "restaurant", + "_id": "2", + "_score": null, + "_source": { + "name": "New Malaysia", + "location": { + "lat": 40.715, + "lon": -73.997 + } + }, + "sort": [ + 0.08425653647614346 <1> + ] + }, +... +---------------------------- +<1> This restaurant is 0.084km from the location we specified. + +You can set the `unit` to return these values in whatever form makes sense for +your application. + +.Multi-location sorting +**************************** + +Geo-distance sorting can also handle multiple geo-points, both in the document +and in the sort parameters. Use the `sort_mode` to specify whether it should +use the `min`, `max`, or `avg` distance between each combination of locations. +This can be used to return ``friends nearest to my work and home locations''. + +**************************** + +[[scoring-by-distance]] +==== Scoring by distance + +It may be that distance is the only important factor in deciding the order in +which results are returned, but more frequently we need to combine distance +with other factors, such as full text relevance, popularity, and price. + +In these situations we should reach for the +<> which allows us to blend all +of these factors into an overall score. See <> for an +example which uses geo-distance to influence scoring. + +The other drawback of sorting by distance is performance: the distance has to +be calculated for all matching documents. The `function_score` query, on the +other hand, can be executed during the <>, +limiting the number of calculations to just the top _N_ results. diff --git a/310_Geolocation/60_Geo_aggs.asciidoc b/310_Geolocation/60_Geo_aggs.asciidoc new file mode 100644 index 000000000..470c1ed15 --- /dev/null +++ b/310_Geolocation/60_Geo_aggs.asciidoc @@ -0,0 +1,25 @@ +[[geo-aggs]] +=== Geo-aggregations + +While filtering or scoring results by geolocation is useful, it is often more +useful to be able to present information to the user on a map. A search may +return way too many results to be able to display each geo-point individually, +but geo-aggregations can be used to cluster geo-points into more manageable +buckets. + +There are three aggregations which work with fields of type `geo_point`: + +<>:: + + Buckets documents into concentric circles around a central point. + +<>:: + + Buckets documents by geohash cell, for display on a map. + +<>:: + + Returns the `lat/lon` coordinates of a bounding box that would + encompass all of the geo-points. This is useful for choosing + the correct zoom level when displaying a map. + diff --git a/310_Geolocation/62_Geo_distance_agg.asciidoc b/310_Geolocation/62_Geo_distance_agg.asciidoc new file mode 100644 index 000000000..b73a861fb --- /dev/null +++ b/310_Geolocation/62_Geo_distance_agg.asciidoc @@ -0,0 +1,117 @@ +[[geo-distance-agg]] +=== `geo_distance` aggregation + +The `geo_distance` agg is useful for the types of searches where a user wants +to find ``all pizza restaurants within 1km of me''. The search results +should, indeed, be limited to the 1km radius specified by the user, but we can +add: ``Another result found within 2km'': + +[source,json] +---------------------------- +GET /attractions/restaurant/_search +{ + "query": { + "filtered": { + "query": { + "match": { <1> + "name": "pizza" + } + }, + "filter": { + "geo_bounding_box": { + "location": { <2> + "top": 40.8, + "bottom": 40.4, + "left": -74.1, + "right": -73.7 + } + } + } + } + }, + "aggs": { + "per_ring": { + "geo_distance": { <3> + "field": "location", + "unit": "km", + "origin": { + "lat": 40.712, + "lon": -73.988 + }, + "ranges": [ + { "from": 0, "to": 1 }, + { "from": 1, "to": 2 } + ] + } + } + }, + "post_filter": { <4> + "geo_distance": { + "distance": "1km", + "location": { + "lat": 40.712, + "lon": -73.988 + } + } + } +} +---------------------------- +<1> The main query looks for restaurants with `pizza` in the name. +<2> The bounding box filters these results down to just those in + the greater New York area. +<3> The `geo_distance` agg counts the number of results within + 1km of the user, and between 1km and 2km from the user. +<4> Finally, the `post_filter` reduces the search results to just + those restaurants within 1km of the user. + +The response from the above request is as follows: + +[source,json] +---------------------------- +"hits": { + "total": 1, + "max_score": 0.15342641, + "hits": [ <1> + { + "_index": "attractions", + "_type": "restaurant", + "_id": "3", + "_score": 0.15342641, + "_source": { + "name": "Mini Munchies Pizza", + "location": [ + -73.983, + 40.719 + ] + } + } + ] +}, +"aggregations": { + "per_ring": { <2> + "buckets": [ + { + "key": "*-1.0", + "from": 0, + "to": 1, + "doc_count": 1 + }, + { + "key": "1.0-2.0", + "from": 1, + "to": 2, + "doc_count": 1 + } + ] + } +} +---------------------------- +<1> The `post_filter` has reduced the search hits to just the single + pizza restaurant within 1km of the user. +<2> The aggregation includes the search result plus the other pizza + restaurant within 2km of the user. + +In this example, we have just counted the number of restaurants which fall +into each concentric ring. Of course, we could nest sub-aggregations under +the `per_rings` aggregation to calculate the average price per ring, the +maximium popularity, etc. diff --git a/310_Geolocation/64_Geohash_grid_agg.asciidoc b/310_Geolocation/64_Geohash_grid_agg.asciidoc new file mode 100644 index 000000000..5fe498b74 --- /dev/null +++ b/310_Geolocation/64_Geohash_grid_agg.asciidoc @@ -0,0 +1,90 @@ +[[geohash-grid-agg]] +=== `geohash_grid` aggregation + +The number of results returned by a query may be far too many to display each +geo-point individually on a map. The `geohash_grid` aggregation buckets nearby +geo-points together by calculating the geohash for each point, at the level of +precision that you define. + +The result is a grid of cells -- one cell per geohash -- which can be +displayed on a map. By changing the precision of the geohash, you can +summarise information across the whole world, by country, or by city block. + +The aggregation is _sparse_ -- it only returns cells that contain documents. +If your geohashes are too precise and too many buckets are generated, it will +return, by default, the 10,000 most populous cells -- those containing the +most documents. However, it still needs to generate *all* of the buckets in +order to figure out which are the most populous 10,000. You need to control +the number of buckets generated by: + +1. limiting the result with a `geo_bounding_box` filter. +2. choosing an appropriate `precision` for the size of your bounding box. + +[source,json] +---------------------------- +GET /attractions/restaurant/_search?search_type=count +{ + "query": { + "filtered": { + "filter": { + "geo_bounding_box": { + "location": { <1> + "top": 40.8, + "bottom": 40.4, + "left": -74.1, + "right": -73.7 + } + } + } + } + }, + "aggs": { + "new_york": { + "geohash_grid": { <2> + "field": "location", + "precision": 5 + } + } + } +} +---------------------------- +<1> The bounding box limits the scope of the search to the greater New York area. +<2> Geohashes of precision `5` are approximately 5km x 5km. + +Geohashes with precision `5` measure about 25km^2^ each, so 10,000 cells at +this precision would cover 250,000km^2^. The bounding box that we specified +measure approximately 44km x 33km, or about 1,452km^2^, so we are well within +safe limits -- we definitely won't create too many buckets in memory. + +The response from the above request looks like this: + +[source,json] +---------------------------- +... +"aggregations": { + "new_york": { + "buckets": [ <1> + { + "key": "dr5rs", + "doc_count": 2 + }, + { + "key": "dr5re", + "doc_count": 1 + } + ] + } +} +... +---------------------------- +<1> Each bucket contains the geohash as the `key`. + +Again, we didn't specify any sub-aggregations so all we got back was the +document count, but we could have asked for popular restaurant types, average +price, etc. + +TIP: In order to plot these buckets on a map, you will need a library that +understands how to convert a geohash into the equivalent bounding box or +central point. A number of libraries exist in Javascript and other languages +which will perform this conversion for you, but you can also use the +<> to perform a similar job. diff --git a/310_Geolocation/66_Geo_bounds_agg.asciidoc b/310_Geolocation/66_Geo_bounds_agg.asciidoc new file mode 100644 index 000000000..10f2d492b --- /dev/null +++ b/310_Geolocation/66_Geo_bounds_agg.asciidoc @@ -0,0 +1,136 @@ +[[geo-bounds-agg]] +=== `geo_bound` aggregation + +In our <>, we filtered our results using a +bounding box that covered the greater New York area. However, our results +were all located in downton Manhattan. When displaying a map for our user, it +makes sense to zoom in to the area of the map that contains the data -- there +is no point in showing lots of empty space. + +The `geo_bounds` aggregation does exactly this -- it calculates the smallest +bounding box that is needed to encapsulate all of the geo-points: + +[source,json] +---------------------------- +GET /attractions/restaurant/_search?search_type=count +{ + "query": { + "filtered": { + "filter": { + "geo_bounding_box": { + "location": { + "top": 40.8, + "bottom": 40.7, + "left": -74.1, + "right": -73.9 + } + } + } + } + }, + "aggs": { + "new_york": { + "geohash_grid": { + "field": "location", + "precision": 5 + } + }, + "map_zoom": { <1> + "geo_bounds": { + "field": "location" + } + } + } +} +---------------------------- + +The response now includes a bounding box which we can use to zoom our map: + +[source,json] +---------------------------- +... +"aggregations": { + "map_zoom": { + "bounds": { + "top_left": { + "lat": 40.722, + "lon": -74.011 + }, + "bottom_right": { + "lat": 40.715, + "lon": -73.983 + } + } + }, +... +---------------------------- + +In fact, we could even use the `geo_bounds` aggregation inside each geohash +cell, in case the geo-points inside a cell are clustered in just a part of the +cell: + +[source,json] +---------------------------- +GET /attractions/restaurant/_search?search_type=count +{ + "query": { + "filtered": { + "filter": { + "geo_bounding_box": { + "location": { + "top": 40.8, + "bottom": 40.7, + "left": -74.1, + "right": -73.9 + } + } + } + } + }, + "aggs": { + "new_york": { + "geohash_grid": { + "field": "location", + "precision": 5 + }, + "aggs": { + "cell": { <1> + "geo_bounds": { + "field": "location" + } + } + } + } + } +} +---------------------------- +<1> The `cell_bounds` sub-aggregation is calculated for every geohash cell. + +Now the points in each cell have a bounding box: + +[source,json] +---------------------------- +... +"aggregations": { + "new_york": { + "buckets": [ + { + "key": "dr5rs", + "doc_count": 2, + "cell": { + "bounds": { + "top_left": { + "lat": 40.722, + "lon": -73.989 + }, + "bottom_right": { + "lat": 40.719, + "lon": -73.983 + } + } + } + }, +... +---------------------------- + + diff --git a/310_Geolocation/70_Geoshapes.asciidoc b/310_Geolocation/70_Geoshapes.asciidoc new file mode 100644 index 000000000..c793e592d --- /dev/null +++ b/310_Geolocation/70_Geoshapes.asciidoc @@ -0,0 +1,47 @@ +[[geo-shapes]] +=== Geo-shapes + +Geo-shapes use a completely different approach to geo-points. A circle on a +computer screen does not consist of a perfect continuous line. Instead it is +drawn by colouring adjacent pixels as an approximation of a circle. Geo-shapes +work in much the same way. + +Complex shapes -- points, lines, polygons, multi-polygons, polygons with +holes, etc -- are ``painted'' onto a grid of geohash cells, and the shape is +converted into a list of the geohashes of all the cells that it touches. + +.Quad trees +*************************************** + +Actually, there are two types of grids that can be used with geo-shapes: +geohashes, which we have already discussed and which are the default encoding, +and _quad trees_. Quad trees are similar to geohashes except that there are +only four cells at each level, instead of 32. The difference comes down to a +choice of encoding. + +*************************************** + + +All of the geohashes that comprise a shape are indexed as if they were terms. +With this information in the index, it is easy to determine whether one shape +intersects with another, as they will share the same geohash terms. + +That is the extent of what you can do with geo-shapes: determine the +relationship between a query shape and a shape in the index. The `relation` +can be one of: + +`intersects`:: + + The query shape overlaps with the indexed shape. (default) + +`disjoint`:: + + The query shape does *not* overlap at all with the indexed shape. + +`within`:: + + The indexed shape is entirely within the query shape. + +Geo-shapes cannot be used to caculate distance, they cannot be used for +sorting or scoring, and they cannot be used in aggregations. + diff --git a/310_Geolocation/72_Mapping_geo_shapes.asciidoc b/310_Geolocation/72_Mapping_geo_shapes.asciidoc new file mode 100644 index 000000000..6964bb4f5 --- /dev/null +++ b/310_Geolocation/72_Mapping_geo_shapes.asciidoc @@ -0,0 +1,65 @@ +[[mapping-geo-shapes]] +=== Mapping geo-shapes + +Like fields of type `geo_point`, geo-shapes have to be mapped explicitly +before they can be used: + +[source,json] +----------------------- +PUT /attractions +{ + "mappings": { + "landmark": { + "properties": { + "name": { + "type": "string" + }, + "location": { + "type": "geo_shape" + } + } + } + } +} +----------------------- + +There are two important settings that you should consider changing: + +==== `precision` + +The `precision` parameter controls the maximum length of the geohashes that +are generated. It defaults to a precision of `9`, which equates to a +<> with dimensions of about 5m x 5m. That is probably far +more precise than you actually need. + +The lower the precision, the fewer terms that will be indexed and the faster +search will be. But of course, the lower the precision, the less accurate are +your geo-shapes. Consider just how accurate you need your shapes to be -- +even 1 or 2 levels of precision can represent a significant saving. + +You can specify precisions using distances -- e.g. `50m` or `2km` -- but +ultimately these distances are converted to the same levels as described in +<>. + +==== `distance_error_pct` + +When indexing a polygon, the big central continuous part can be represented +cheaply by a short geohash. It is the edges that matter. Edges require much +smaller geohashes to represent them with any accuracy. + +If you're indexing a small landmark, you want the edges to be quite accurate. +It wouldn't be good to have the one monument overlapping with the next. When +indexing an entire country, you don't need quite as much precision. Fifty +meters here or there isn't likely to start any wars. + +The `distance_error_pct` specifies the maximum allowable error based on the +size of the shape. It defaults to `0.025` or 2.5%. In other words, big shapes +like countries are allowed to have fuzzier edges than small shapes, like +monuments. + +The default of `0.025` is a good starting point but the more error that is +allowed, the fewer terms that are required to index a shape. + + + + diff --git a/310_Geolocation/74_Indexing_geo_shapes.asciidoc b/310_Geolocation/74_Indexing_geo_shapes.asciidoc new file mode 100644 index 000000000..c8f2e0a6f --- /dev/null +++ b/310_Geolocation/74_Indexing_geo_shapes.asciidoc @@ -0,0 +1,62 @@ +[[indexing-geo-shapes]] +=== Indexing geo-shapes + +Shapes are represented using http://geojson.org/[GeoJSON], a simple open +standard for encoding two dimensional shapes in JSON. Each shape definition +contains the type of shape -- `point`, `line`, `polygon`, `envelope`, etc. -- +and one or more arrays of longitude/latitude points. + +IMPORTANT: In GeoJSON, coordinates are always written as *longitude* followed +by *latitude*. + +For instance, we can index a polygon representing Dam Square in Amsterdam as +follows: + +[source,json] +----------------------- +PUT /attractions/landmark/dam_square +{ + "name" : "Dam Square, Amsterdam", + "location" : { + "type" : "polygon", <1> + "coordinates" : [[ <2> + [ 4.89218, 52.37356 ], + [ 4.89205, 52.37276 ], + [ 4.89301, 52.37274 ], + [ 4.89392, 52.37250 ], + [ 4.89431, 52.37287 ], + [ 4.89331, 52.37346 ], + [ 4.89305, 52.37326 ], + [ 4.89218, 52.37356 ] + ]] + } +} +----------------------- +<1> The `type` parameter indicates the type of shape that the coordinates + represent. +<2> The list of `lon/lat` points which describe the polygon. + +The excess of square brackets in the example may look confusing, but the +GeoJSON syntax is quite simple: + +1. Each `lon/lat` point is represented as an array: ++ + [lon,lat] + +2. A list of points is wrapped in an array to represent a polygon: ++ + [[lon,lat],[lon,lat], ... ] + +3. A shape of type `polygon` can optionally contain several polygons: the + first represents the polygon proper while any subsequent polygons represent + holes in the first: ++ + [ + [[lon,lat],[lon,lat], ... ], # main polygon + [[lon,lat],[lon,lat], ... ], # hole in main polygon + ... + ] + +See the {ref}mapping-geo-shape-type.html[Geo-shape mapping documentation] for +more details about the supported shapes. + diff --git a/310_Geolocation/76_Querying_geo_shapes.asciidoc b/310_Geolocation/76_Querying_geo_shapes.asciidoc new file mode 100644 index 000000000..ea287f058 --- /dev/null +++ b/310_Geolocation/76_Querying_geo_shapes.asciidoc @@ -0,0 +1,77 @@ +[[querying-geo-shapes]] +=== Querying geo-shapes + +The unusual thing about the {ref}query-dsl-geo-shape-query.html[`geo_shape` +query] and {ref}query-dsl-geo-shape-filter.html[`geo_shape` filter] is that +they allow us to query using shapes, rather than just points. + +For instance, if our user steps out of the central train station in Amsterdam, +we could find all landmarks within a 1km radius with a query like this: + +[source,json] +----------------------- +GET /attractions/landmark/_search +{ + "query": { + "geo_shape": { + "location": { <1> + "shape": { <2> + "type": "circle", <3> + "radius": "1km" + "coordinates": [ <4> + 4.89994, + 52.37815 + ] + } + } + } + } +} +----------------------- +<1> The query looks at geo-shapes in the `location` field. +<2> The `shape` key indicates that the shape is specified inline in the query. +<3> The shape is a circle, with a radius of 1km. +<4> This point is situated at the entrance of the central train station in + Amsterdam. + +By default, the query (or filter -- they do the same job) looks for indexed +shapes which intersect with the query shape. The `relation` parameter can be +set to `disjoint` to find indexed shapes which don't intersect with the query +shape, or `within` to find indexed shapes that are completely contained by the +query shape. + +For instance, we could find all landmarks in the centre of Amsterdam with this +query: + +[source,json] +----------------------- +GET /attractions/landmark/_search +{ + "query": { + "geo_shape": { + "location": { + "relation": "within", <1> + "shape": { + "type": "polygon", + "coordinates": [[ <2> + [4.88330,52.38617], + [4.87463,52.37254], + [4.87875,52.36369], + [4.88939,52.35850], + [4.89840,52.35755], + [4.91909,52.36217], + [4.92656,52.36594], + [4.93368,52.36615], + [4.93342,52.37275], + [4.92690,52.37632], + [4.88330,52.38617] + ]] + } + } + } + } +} +----------------------- +<1> Only match indexed shapes that are completely within the query shape. +<2> This polygon represents the centre of Amsterdam. + diff --git a/310_Geolocation/78_Indexed_geo_shapes.asciidoc b/310_Geolocation/78_Indexed_geo_shapes.asciidoc new file mode 100644 index 000000000..e32373990 --- /dev/null +++ b/310_Geolocation/78_Indexed_geo_shapes.asciidoc @@ -0,0 +1,102 @@ +[[indexed-geo-shapes]] +=== Querying with indexed shapes + +With shapes that are often used in queries, it can be more convenient to store +them in the index and to refer to them by name in the query. Take our example +of central Amsterdam in the previous example. We could store it as a document +of type `neighbourhood`. + +First, we set up the mapping in the same way as we did for `landmark`: + +[source,json] +----------------------- +PUT /attractions/_mapping/neighbourhood +{ + "properties": { + "name": { + "type": "string" + }, + "location": { + "type": "geo_shape" + } + } +} +----------------------- + +Then we can index a shape for central Amsterdam: + +[source,json] +----------------------- +PUT /attractions/neighbourhood/central_amsterdam +{ + "name" : "Central Amsterdam", + "location" : { + "type" : "polygon", + "coordinates" : [[ + [4.88330,52.38617], + [4.87463,52.37254], + [4.87875,52.36369], + [4.88939,52.35850], + [4.89840,52.35755], + [4.91909,52.36217], + [4.92656,52.36594], + [4.93368,52.36615], + [4.93342,52.37275], + [4.92690,52.37632], + [4.88330,52.38617] + ]] + } +} +----------------------- + +Once indexed, we can refer to this shape by `index`, `type`, and `id` in the +query itself: + +[source,json] +----------------------- +GET /attractions/landmark/_search +{ + "query": { + "geo_shape": { + "location": { + "relation": "within", + "indexed_shape": { <1> + "index": "attractions", + "type": "neighbourhood", + "id": "central_amsterdam", + "path": "location" + } + } + } + } +} +----------------------- +<1> By specifying `indexed_shape` instead of `shape`, Elasticsearch knows that + it needs to retrieve the query shape from the specified document and + `path`. + +There is nothing special about the shape for central Amsterdam. We could +equally use our existing shape for Dam Square in queries. This query finds +neighbourhoods which intersect with Dam Square: + +[source,json] +----------------------- +GET /attractions/neighbourhood/_search +{ + "query": { + "geo_shape": { + "location": { + "indexed_shape": { + "index": "attractions", + "type": "landmark", + "id": "dam_square", + "path": "location" + } + } + } + } +} +----------------------- + + + diff --git a/310_Geolocation/80_Caching_geo_shapes.asciidoc b/310_Geolocation/80_Caching_geo_shapes.asciidoc new file mode 100644 index 000000000..d7f4cd048 --- /dev/null +++ b/310_Geolocation/80_Caching_geo_shapes.asciidoc @@ -0,0 +1,38 @@ +[[geo-shape-caching]] +=== Geo-shape filters and caching + +The `geo_shape` query and filter perform the same function. The query simply +acts as a filter -- any matching documents receive a relevance `_score` of +`1`. Query results cannot be cached, but filter results can be. + +The results are not cached by default. Just as with geo-points, any +change in the coordinates in a shape are likely to produce a different set of +geohashes, so there is little point in caching filter results. That said, if +you filter using the same shapes repeatedly, it can be worth caching the +results, by setting `_cache` to `true`: + +[source,json] +----------------------- +GET /attractions/neighbourhood/_search +{ + "query": { + "filtered": { + "filter": { + "geo_shape": { + "_cache": true, <1> + "location": { + "indexed_shape": { + "index": "attractions", + "type": "landmark", + "id": "dam_square", + "path": "location" + } + } + } + } + } + } +} +----------------------- +<1> The results of this `geo_shape` filter will be cached. +