Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add classify_language #11

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .formatter.exs
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Used by "mix format"
[
inputs: ["{mix,.formatter}.exs", "{config,lib,test}/**/*.{ex,exs}"]
]
25 changes: 20 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,9 @@ The package can be installed by adding `expostal` to your list of dependencies i

```elixir
def deps do
[{:expostal, "~> 0.2.0"}]
[
{:expostal, "~> 0.3.0"}
]
end
```

Expand All @@ -27,17 +29,22 @@ Depends on [system-wide installation of libpostal](https://github.com/openvenues

## Usage

Parsing an address:
Parsing an address:

```
iex> Expostal.parse_address("615 Rene Levesque Ouest, Montreal, QC, Canada")
%Expostal.Address{
city: "montreal",
country: "canada",
house_number: "615",
road: "rene levesque ouest",
state: "qc"
}

%{city: "montreal", country: "canada", house_number: "615",
road: "rene levesque ouest", state: "qc"}

```

Expanding an address:
Expanding an address:

```
iex> Expostal.expand_address("781 Franklin Ave Crown Hts Brooklyn NY")
Expand All @@ -46,6 +53,14 @@ iex> Expostal.expand_address("781 Franklin Ave Crown Hts Brooklyn NY")
"781 franklin avenue crown heights brooklyn ny"]
```

Classifying language:
Returns a tuple with probability of the most probable language for a given address and a language list

```
iex> Expostal.classify_language("agricola pl.")
{0.508300861587544, ["en", "fr", "es", "de"]}

```
## Documentation

View the docs on [https://hexdocs.pm/expostal](https://hexdocs.pm/expostal), or
Expand Down
2 changes: 1 addition & 1 deletion config/config.exs
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# This file is responsible for configuring your application
# and its dependencies with the aid of the Mix.Config module.
use Mix.Config
# use Mix.Config

# This configuration is loaded before any dependency and is restricted
# to this project. If another project depends on this project, this
Expand Down
52 changes: 52 additions & 0 deletions lib/address.ex
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
defmodule Expostal.Address do
@shortdoc "Struct for holding the results returned by libpostal."
@moduledoc """
Struct for holding the results returned by libpostal.
From the upstream README.md:
https://github.com/openvenues/libpostal

* house: venue name e.g. "Brooklyn Academy of Music", and building names e.g. "Empire State Building"
* category: for category queries like "restaurants", etc.
* near: phrases like "in", "near", etc. used after a category phrase to help with parsing queries like "restaurants in Brooklyn"
* house_number: usually refers to the external (street-facing) building number. In some countries this may be a compount, hyphenated number which also includes an apartment number, or a block number (a la Japan), but libpostal will just call it the house_number for simplicity.
* road: street name(s)
* unit: an apartment, unit, office, lot, or other secondary unit designator
* level: expressions indicating a floor number e.g. "3rd Floor", "Ground Floor", etc.
* staircase: numbered/lettered staircase
* entrance: numbered/lettered entrance
* po_box: post office box: typically found in non-physical (mail-only) addresses
* postcode: postal codes used for mail sorting
* suburb: usually an unofficial neighborhood name like "Harlem", "South Bronx", or "Crown Heights"
* city_district: these are usually boroughs or districts within a city that serve some official purpose e.g. "Brooklyn" or "Hackney" or "Bratislava IV"
* city: any human settlement including cities, towns, villages, hamlets, localities, etc.
* island: named islands e.g. "Maui"
* state_district: usually a second-level administrative division or county.
* state: a first-level administrative division. Scotland, Northern Ireland, Wales, and England in the UK are mapped to "state" as well (convention used in OSM, GeoPlanet, etc.)
* country_region: informal subdivision of a country without any political status
* country: sovereign nations and their dependent territories, anything with an ISO-3166 code.
* world_region: currently only used for appending “West Indies” after the country name, a pattern frequently used in the English-speaking Caribbean e.g. “Jamaica, West Indies”
"""

defstruct [
:house,
:category,
:near,
:house_number,
:road,
:unit,
:level,
:staircase,
:entrance,
:po_box,
:postcode,
:suburb,
:city_district,
:city,
:island,
:state_district,
:state,
:country_region,
:country,
:world_region
]
end
76 changes: 60 additions & 16 deletions lib/expostal.ex
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,19 @@ defmodule Expostal do
Address parsing and expansion module for Openvenue's Libpostal, which does parses addresses.
"""

@compile { :autoload, false }
@on_load { :init, 0 }
@compile {:autoload, false}
@on_load {:init, 0}

app = Mix.Project.config[:app]
app = Mix.Project.config()[:app]

defp init do
path = :filename.join(:code.priv_dir(unquote(app)), 'expostal')
def init do
path = :filename.join(:code.priv_dir(unquote(app)), ~c"expostal")
:ok = :erlang.load_nif(path, 0)
end

@doc """
Loads the large dataset from disk for libpostal and prepares it for future calls.

Loads the large dataset from disk for libpostal and prepares it for future calls.
If you do not run this explicitly, then it will be run by `parse_address/1` or `expand_address/1`
on their first run. This is a very slow process (it can take 10s of seconds), so if you value
the responsiveness of your application, you can spawn a secondary thread to run this bootstrap
Expand All @@ -38,39 +38,83 @@ defmodule Expostal do
end

@doc """

Parse given address into a map of address components

## Examples

iex> Expostal.parse_address("615 Rene Levesque Ouest, Montreal, QC, Canada")
%{city: "montreal", country: "canada", house_number: "615",
road: "rene levesque ouest", state: "qc"}
%Expostal.Address{
city: "montreal",
country: "canada",
house_number: "615",
road: "rene levesque ouest",
state: "qc"
}

"""
@spec parse_address(address :: String.t) :: map
def parse_address(address)
def parse_address(_) do
@spec parse_address(address :: String.t()) :: map
def parse_address(address) do
parsed = _parse_address(address)
struct(Expostal.Address, parsed)
end

defp _parse_address(_address) do
case :erlang.phash2(1, 1) do
0 -> raise "Nif not loaded"
1 -> %{}
end
end

@doc """

Expand given address into a list of expansions

## Examples

iex> Expostal.expand_address("781 Franklin Ave Crown Hts Brooklyn NY")
["781 franklin avenue crown heights brooklyn new york",
"781 franklin avenue crown heights brooklyn ny"]
["781 franklin avenue crown heights brooklyn ny",
"781 franklin avenue crown heights brooklyn new york"]

"""
@spec expand_address(address :: String.t) :: [String.t]
def expand_address(address) do
@spec expand_address(address :: String.t()) :: [String.t()]
def expand_address(address), do: _expand_address(address)

defp _expand_address(_address) do
case :erlang.phash2(1, 1) do
0 -> raise "Nif not loaded"
1 -> [address]
1 -> []
end
end

@doc """

Returns a tuple with probability of the most probable language
for a given address and a language list

## Examples

iex> Expostal.classify_language("agricola pl.")
{0.508300861587544, ["en", "fr", "es", "de"]}

"""
@spec classify_language(address :: String.t()) ::
{float, [String.t()]} | {:error, :argument_error}
def classify_language(address) when is_binary(address) do
try do
_classify_language(address)
rescue
_ ->
{:error, :argument_error}
end
end

defp _classify_language(""), do: {:error, :argument_error}

defp _classify_language(_address) do
case :erlang.phash2(1, 1) do
0 -> raise "Nif not loaded"
1 -> {0.0, []}
end
end
end
39 changes: 21 additions & 18 deletions mix.exs
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
defmodule Mix.Tasks.Compile.Libpostal do
def run(_) do
if match? {:win32, _}, :os.type do
if match?({:win32, _}, :os.type()) do
# libpostal does not support Windows unfortunately.
IO.warn("Windows is not supported.")
exit(1)
else
File.mkdir_p("priv")
{result, _error_code} = System.cmd("make", ["priv/expostal.so"], stderr_to_stdout: true)
IO.binwrite result
IO.binwrite(result)
end

:ok
end
end
Expand All @@ -17,20 +18,21 @@ defmodule Expostal.Mixfile do
use Mix.Project

def project do
[app: :expostal,
version: "0.2.0",
elixir: "~> 1.4",
build_embedded: Mix.env == :prod,
start_permanent: Mix.env == :prod,
compilers: [:libpostal, :elixir, :app],
docs: [main: "readme",
extras: ["README.md"]],
deps: deps(),
description: description(),
package: package(),
name: "Expostal",
source_url: "https://github.com/SweetIQ/expostal",
homepage_url: "https://github.com/SweetIQ/expostal"]
[
app: :expostal,
version: "0.3.0",
elixir: "~> 1.13",
build_embedded: Mix.env() == :prod,
start_permanent: Mix.env() == :prod,
compilers: [:libpostal, :elixir, :app],
docs: [main: "readme", extras: ["README.md"]],
deps: deps(),
description: description(),
package: package(),
name: "Expostal",
source_url: "https://github.com/SweetIQ/expostal",
homepage_url: "https://github.com/SweetIQ/expostal"
]
end

# Configuration for the OTP application
Expand All @@ -52,8 +54,8 @@ defmodule Expostal.Mixfile do
# Type "mix help deps" for more examples and options
defp deps do
[
{:ex_doc, "~> 0.14", only: :dev, runtime: false},
{:dialyxir, "~> 0.5", only: :dev, runtime: false}
{:ex_doc, "~> 0.30", only: :dev, runtime: false},
{:dialyxir, "~> 1.3", only: :dev, runtime: false}
]
end

Expand All @@ -63,6 +65,7 @@ defmodule Expostal.Mixfile do
Expostal parses street address and expand address acroymes with high accuracy.
"""
end

defp package do
# These are the default files included in the package
[
Expand Down
14 changes: 10 additions & 4 deletions mix.lock
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
%{"dialyxir": {:hex, :dialyxir, "0.5.0", "5bc543f9c28ecd51b99cc1a685a3c2a1a93216990347f259406a910cf048d1d7", [:mix], [], "hexpm"},
"earmark": {:hex, :earmark, "1.2.2", "f718159d6b65068e8daeef709ccddae5f7fdc770707d82e7d126f584cd925b74", [:mix], [], "hexpm"},
"ex_doc": {:hex, :ex_doc, "0.16.1", "b4b8a23602b4ce0e9a5a960a81260d1f7b29635b9652c67e95b0c2f7ccee5e81", [:mix], [{:earmark, "~> 1.1", [hex: :earmark, repo: "hexpm", optional: false]}], "hexpm"},
"libpostal": {:git, "https://github.com/openvenues/libpostal.git", "8dd84b71bad4c70150e60c7bda8071b9bd8902f8", []}}
%{
"dialyxir": {:hex, :dialyxir, "1.3.0", "fd1672f0922b7648ff9ce7b1b26fcf0ef56dda964a459892ad15f6b4410b5284", [:mix], [{:erlex, ">= 0.2.6", [hex: :erlex, repo: "hexpm", optional: false]}], "hexpm", "00b2a4bcd6aa8db9dcb0b38c1225b7277dca9bc370b6438715667071a304696f"},
"earmark_parser": {:hex, :earmark_parser, "1.4.33", "3c3fd9673bb5dcc9edc28dd90f50c87ce506d1f71b70e3de69aa8154bc695d44", [:mix], [], "hexpm", "2d526833729b59b9fdb85785078697c72ac5e5066350663e5be6a1182da61b8f"},
"erlex": {:hex, :erlex, "0.2.6", "c7987d15e899c7a2f34f5420d2a2ea0d659682c06ac607572df55a43753aa12e", [:mix], [], "hexpm", "2ed2e25711feb44d52b17d2780eabf998452f6efda104877a3881c2f8c0c0c75"},
"ex_doc": {:hex, :ex_doc, "0.30.3", "bfca4d340e3b95f2eb26e72e4890da83e2b3a5c5b0e52607333bf5017284b063", [:mix], [{:earmark_parser, "~> 1.4.31", [hex: :earmark_parser, repo: "hexpm", optional: false]}, {:makeup_elixir, "~> 0.14", [hex: :makeup_elixir, repo: "hexpm", optional: false]}, {:makeup_erlang, "~> 0.1", [hex: :makeup_erlang, repo: "hexpm", optional: false]}], "hexpm", "fbc8702046c1d25edf79de376297e608ac78cdc3a29f075484773ad1718918b6"},
"makeup": {:hex, :makeup, "1.1.0", "6b67c8bc2882a6b6a445859952a602afc1a41c2e08379ca057c0f525366fc3ca", [:mix], [{:nimble_parsec, "~> 1.2.2 or ~> 1.3", [hex: :nimble_parsec, repo: "hexpm", optional: false]}], "hexpm", "0a45ed501f4a8897f580eabf99a2e5234ea3e75a4373c8a52824f6e873be57a6"},
"makeup_elixir": {:hex, :makeup_elixir, "0.16.1", "cc9e3ca312f1cfeccc572b37a09980287e243648108384b97ff2b76e505c3555", [:mix], [{:makeup, "~> 1.0", [hex: :makeup, repo: "hexpm", optional: false]}, {:nimble_parsec, "~> 1.2.3 or ~> 1.3", [hex: :nimble_parsec, repo: "hexpm", optional: false]}], "hexpm", "e127a341ad1b209bd80f7bd1620a15693a9908ed780c3b763bccf7d200c767c6"},
"makeup_erlang": {:hex, :makeup_erlang, "0.1.2", "ad87296a092a46e03b7e9b0be7631ddcf64c790fa68a9ef5323b6cbb36affc72", [:mix], [{:makeup, "~> 1.0", [hex: :makeup, repo: "hexpm", optional: false]}], "hexpm", "f3f5a1ca93ce6e092d92b6d9c049bcda58a3b617a8d888f8e7231c85630e8108"},
"nimble_parsec": {:hex, :nimble_parsec, "1.3.1", "2c54013ecf170e249e9291ed0a62e5832f70a476c61da16f6aac6dca0189f2af", [:mix], [], "hexpm", "2682e3c0b2eb58d90c6375fc0cc30bc7be06f365bf72608804fb9cffa5e1b167"},
}
Loading