Skip to content

Commit

Permalink
Added geolocation chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
clintongormley committed Aug 15, 2014
1 parent 0e79bfa commit 8f1ed17
Show file tree
Hide file tree
Showing 22 changed files with 1,606 additions and 28 deletions.
5 changes: 3 additions & 2 deletions 010_Intro/45_Distributed.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,9 @@ the operations happening automatically under the hood include:
As you read through this book, you'll encounter supplemental chapters about the
distributed nature of Elasticsearch. These chapters will teach you about
how the cluster scales and deals with failover (<<distributed-cluster>>),
handles document storage (<<distributed-docs>>) and executes distributed search
(<<distributed-search>>).
handles document storage (<<distributed-docs>>), executes distributed search
(<<distributed-search>>), and what a shard is and how it works
(<<inside-a-shard>>).

These chapters are not required reading -- you can use Elasticsearch without
understanding these internals -- but they will provide insight that will make
Expand Down
12 changes: 6 additions & 6 deletions 300_Aggregations/110_docvalues.asciidoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@

[[doc-values]]
=== Doc Values

The default data structure for field data is called _paged-bytes_, and it is
Expand All @@ -11,13 +11,13 @@ There is an alternative format known as _doc values_. Doc values are special
data structures which are built at index-time and written to disk. They are then
loaded to memory and accessed in place of the standard paged-bytes implementation.

The main benefit of doc values is lower memory footprint. With the default
The main benefit of doc values is lower memory footprint. With the default
paged-bytes format, if you attempt to load more field data to memory than available
heap space...you'll get an OutOfMemoryException.

By contrast, doc values can stream from disk efficiently and do not require
By contrast, doc values can stream from disk efficiently and do not require
processing at query-time (unlike paged-bytes, which must be generated). This
allows you to work with field data that would normally be too large to fit in
allows you to work with field data that would normally be too large to fit in
memory.

The trade-off is a larger index size and potentially slower field data access.
Expand All @@ -35,7 +35,7 @@ tradeoff for truly massive data.
==== Enabling Doc Values

Doc values can be enabled for numeric fields, geopoints and `not_analyzed` string fields.
They do not currently work with `analyzed` string fields. Doc values are
They do not currently work with `analyzed` string fields. Doc values are
enabled in the mapping of a particular field, which means that some fields can
use doc values while the rest use the default paged-bytes.

Expand All @@ -56,7 +56,7 @@ PUT /fielddata/filtering/_mapping
}
}
----
<1> Doc values can only be enabled on `not_analyzed` string fields, numerics and
<1> Doc values can only be enabled on `not_analyzed` string fields, numerics and
geopoints
<2> Doc values are enabled by setting the `"fielddata.format"` parameter to
`doc_values`
Expand Down
64 changes: 44 additions & 20 deletions 310_Geolocation.asciidoc
Original file line number Diff line number Diff line change
@@ -1,32 +1,56 @@
[[geoloc]]
== Geolocation (TODO)
:ref: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/

The web is increasingly location aware – users expect to see local results,
or to be able to filter results by their position on a map.
include::310_Geolocation/10_Intro.asciidoc[]

This chapter explains how to use geolocation in Elasticsearch, including
optimization tips.
include::310_Geolocation/20_Geopoints.asciidoc[]

include::310_Geolocation/30_Filter_by_geopoint.asciidoc[]

=== Adding geolocation to your documents
* Mapping the geo-point type
* Indexing documents with geo-points
include::310_Geolocation/32_Bounding_box.asciidoc[]

[[geoloc-filters]]
=== Geolocation-aware search
* geo-distance and geo-distance-range filters
* geo-bounding-box filter
* geo-polygon filter
include::310_Geolocation/34_Geo_distance.asciidoc[]

=== Sorting by distance
.
include::310_Geolocation/36_Caching_geofilters.asciidoc[]

include::310_Geolocation/38_Reducing_memory.asciidoc[]

=== Geo-shapes
.
include::310_Geolocation/40_Geohashes.asciidoc[]

include::310_Geolocation/50_Sorting_by_distance.asciidoc[]

=== Optimizing geo-queries
.
include::310_Geolocation/60_Geo_aggs.asciidoc[]

include::310_Geolocation/62_Geo_distance_agg.asciidoc[]

include::310_Geolocation/64_Geohash_grid_agg.asciidoc[]

include::310_Geolocation/66_Geo_bounds_agg.asciidoc[]

include::310_Geolocation/70_Geoshapes.asciidoc[]

include::310_Geolocation/72_Mapping_geo_shapes.asciidoc[]

include::310_Geolocation/74_Indexing_geo_shapes.asciidoc[]

include::310_Geolocation/76_Querying_geo_shapes.asciidoc[]

include::310_Geolocation/78_Indexed_geo_shapes.asciidoc[]

include::310_Geolocation/80_Caching_geo_shapes.asciidoc[]


////////



geo_shape:
mapping
tree
precision
type of shapes
indexing
indexed shapes
filters
geoshape

////////
33 changes: 33 additions & 0 deletions 310_Geolocation/10_Intro.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
[[geoloc]]
== Geolocation

Gone are the days when we wander around a city with paper maps. Thanks to
smartphones, we now know exactly where we are all of the time, and we expect
websites to use that information. I'm not interested in restaurants in
Greater London -- I want to know about restaurants within 5 minutes walk of my
current location.

But geolocation is only one part of the puzzle. The beauty of Elasticsearch
is that it allows you to combine geolocation with full text search, structured
search, and analytics.

For instance: show me restaurants that mention _vitello tonnato_, are within 5
minutes walk, and are open at 11pm, and rank them by a combination of user
rating, distance and price. Another example: show me a map of holiday rental
properties available in August throughout the city, and calculate the average
price per zone.

Elasticsearch offers two ways of representing geolocations: latitude-longitude
points using the `geo_point` field type, and complex shapes defined in
http://en.wikipedia.org/wiki/GeoJSON[GeoJSON], using the `geo_shape` field
type.

Geo-points allow you to find points within a certain distance of another
point, to calculate distances between two points for sorting or relevance
scoring, or to aggregate into a grid to display on a map. Geo-shapes, on the
other hand, are used purely for filtering. They can be used to decide whether
two shapes overlap or not, or whether one shape completely contains other
shapes.



76 changes: 76 additions & 0 deletions 310_Geolocation/20_Geopoints.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
[[indexing-geopoints]]
=== Indexing geo-points

Geo-points cannot be automatically detected with
<<dynamic-mapping,dynamic mapping>>. Instead, geo-points fields should be
mapped explicitly:

[source,json]
-----------------------
PUT /attractions
{
"mappings": {
"restaurant": {
"properties": {
"name": {
"type": "string"
},
"location": {
"type": "geo_point"
}
}
}
}
}
-----------------------

[[lat-lon-formats]]
==== Lat/Lon formats

With the `location` field defined as a `geo_point`, we can proceed to index
documents containing latitude/longitude pairs, which can be formatted as
strings, arrays, or objects:

[source,json]
-----------------------
PUT /attractions/restaurant/1
{
"name": "Chipotle Mexican Grill",
"location": "40.715, -74.011" <1>
}
PUT /attractions/restaurant/2
{
"name": "Pala Pizza",
"location": { <2>
"lat": 40.722,
"lon": -73.989
}
}
PUT /attractions/restaurant/3
{
"name": "Mini Munchies Pizza",
"location": [ -73.983, 40.719 ] <3>
}
-----------------------
<1> A string representation, with `"lat,lon"`.
<2> An object representation with `lat` and `lon` explicitly named.
<3> An array representation with `[lon,lat]`.

[IMPORTANT]
========================
Everybody gets caught at least once: string geo-points are
`"latitude,longitude"`, while array geo-points are `[longitude,latitude]` --
the opposite order!
Originally, both strings and arrays in Elasticsearch used latitude followed by
longitude. However, it was decided early on to switch the order for arrays in
order to conform with GeoJSON.
The result is a bear trap that captures all unsuspecting users on their
journey to full geo-location nirvana.
========================

44 changes: 44 additions & 0 deletions 310_Geolocation/30_Filter_by_geopoint.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
[[filter-by-geopoint]]
=== Filtering by geo-point

Four geo-filters filters can be used to include or exclude documents by
geo-location:

<<geo-bounding-box,`geo_bounding_box`>>::

Find geo-points which fall within the specified rectangle.

<<geo-distance,`geo_distance`>>::

Find geo-points within the specified distance of a central point.

<<geo-distance-range,`geo_distance_range`>>::

Find geo-points within a specified minimum and maximum distance from a
central point.

`geo_polygon`::

Find geo-points which fall within the specified polygon. *This filter is
very expensive*. If you find yourself wanting to use it, you should be
looking at <<geo-shapes,geo-shapes>> instead.

All of these filters work in a similar way: the `lat/lon` values are loaded
into memory for *all documents in the index*, not just the documents which
match the query (see <<fielddata-intro>>). Each filter performs a slightly
different calculation to check whether a point falls into the containing area
or not.

[TIP]
============================
Geo-filters are expensive -- they should be used on as few documents as
possible. First remove as many documents as you can with cheaper filters, like
`term` or `range` filters, and apply the geo filters last.
The <<bool-filter,`bool` filter>> will do this for you automatically. First it
applies any bitset-based filters (see <<filter-caching>>) to exclude as many
documents as it can as cheaply as possible. Then it applies the more
expensive geo or script filters to each remaining document in turn.
============================
96 changes: 96 additions & 0 deletions 310_Geolocation/32_Bounding_box.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
[[geo-bounding-box]]
=== `geo_bounding_box` filter

This is by far the most performant geo-filter because its calculation is very
simple. You provide it with the `top`, `bottom`, `left`, and `right`
coordinates of a rectangle and all it does is compare the latitude with the
left and right coordinates, and the longitude with the top and bottom
coordinates.

[source,json]
---------------------
GET /attractions/restaurant/_search
{
"query": {
"filtered": {
"filter": {
"geo_bounding_box": {
"location": { <1>
"top": 40.8,
"bottom": 40.7,
"left": -74.0,
"right": -73.0
}
}
}
}
}
}
---------------------
<1> These coordinates can also be specified as `top_left` and `bottom_right`
pairs, or `bottom_left` and `top_right` pairs.

[[optimize-bounding-box]]
==== Optimizing bounding boxes

The `geo_bounding_box` is the one geo-filter which doesn't require all
geo-points to be loaded into memory. Because all it has to do is to check
whether the `lat` and `lon` values fall within the specified ranges, it can
use the inverted index to do a glorified `range` filter.

In order to use this optimization, the `geo_point` field must be mapped to
index the `lat` and `lon` values separately:

[source,json]
-----------------------
PUT /attractions
{
"mappings": {
"restaurant": {
"properties": {
"name": {
"type": "string"
},
"location": {
"type": "geo_point",
"lat_lon": true <1>
}
}
}
}
}
-----------------------
<1> The `location.lat` and `location.lon` fields will be indexed separately.
These fields can be used for searching, but their values cannot be retrieved.

Now, when we run our query, we have to tell Elasticsearch to use the indexed
`lat` and `lon` values:

[source,json]
---------------------
GET /attractions/restaurant/_search
{
"query": {
"filtered": {
"filter": {
"geo_bounding_box": {
"type": "indexed", <1>
"location": {
"top": 40.8,
"bottom": 40.7,
"left": -74.0,
"right": -73.0
}
}
}
}
}
}
---------------------
<1> Setting the `type` parameter to `indexed` (instead of the default
`memory`) tells Elasticsearch to use the inverted index for this filter.

IMPORTANT: While a `geo_point` field can contain multiple geo-points, the
`lat_lon` optimization can only be used on fields which contain a single
geo-point.

Loading

0 comments on commit 8f1ed17

Please sign in to comment.