urns and language models #9

balmas · 2020-09-21T20:55:25Z

Fixes #2
Adds direction parameter to api for #4
Fixes #6
beginning of docker-compose setup for production deployment
ports functionality and tests from llt-tokenizer to new Ancient Greek and Latin language models (incomplete for Latin see #7)
adds support for other available Spacy language models

and more tests

add direction to api input parameters

with nginx and gunicorn for production

balmas · 2020-09-21T20:58:15Z

tokenizer/schemas.py

+    direction = fields.Str(
+        required=False,
+        description=gettext("Text direction."),
+        missing = "ltr",
+        validate = validate.OneOf(["ltr","rtl"])
+    )


it's not really used at the moment, but it's in the api in case we need it in the future.

irina060981

Looks ok

irina060981 · 2020-09-22T00:10:48Z

tokenizer/lib/spacy/lang/mapper.py

@@ -0,0 +1,169 @@
+import importlib
+
+LANGUAGE = {


I think, that it would be more flexible to place such long lists to separate json files.

probably true :-) I will enter an item for that.

irina060981 · 2020-09-22T00:13:28Z

tokenizer/lib/spacy/processor.py

@@ -179,8 +189,7 @@ def _segmentize(
                  'index': tokenIndex,
                  'docIndex': token.i,
                  'text': token.text,
-                  'punct': token.is_punct,
-                  'metadata': {}
+                  'punct': token.is_punct


did you decide to map metadata to the other data model? not to a segment?

no, it's just that originally I had some of the metadata in a sub object. In the end I made all metadata direct properties of the segment. I went back and forth on how to do this so the code wasn't quite the same everywhere due to my wishy washy refactoring.

Bridget Almas added 25 commits September 16, 2020 08:59

starting work on supporting language-specific tokenization

4ded8ce

fix #2 - tokenizer exception for urns and urls

7852bf3

wip on latin enclytics

bdb22e5

some support for Latin enclytics

49a9bb1

more wip on latin model

247d303

basic structure in place for adding new languages to spacy

801684a

update default model for tests and docker

f58a9e7

fix model name

e1477b6

basic que enclytics working for Latin

23d0c09

fixes to latin enclytics

3bb8729

and more tests

update processor test for enclytics

b9357cf

add the spacy languages

17fa735

remove the install of the default model

1271bdf

latin cleanup

ef2d23b

latin cleanup

7d68413

latin cleanup

def61e2

basic greek use cases working

991fcc4

add capitals to greek character classes

1fc42ac

rename languages module to lang to match spacy

7846c91

update comments

c709404

fix #6

a3a9d3d

for #4

3911e91

add direction to api input parameters

switching to docker-compose setup

7b5bcd0

with nginx and gunicorn for production

Merge branch 'master' into i2-urns

297cfce

remove unused file

e662cc2

balmas changed the title ~~urns and language modles~~ urns and language models Sep 21, 2020

balmas commented Sep 21, 2020

View reviewed changes

balmas requested a review from irina060981 September 21, 2020 20:58

irina060981 approved these changes Sep 22, 2020

View reviewed changes

balmas mentioned this pull request Sep 22, 2020

Externalize language map #10

Open

balmas merged commit 3d05d5f into master Sep 22, 2020

balmas pushed a commit that referenced this pull request Dec 7, 2020

#9 - make doubleline default for segment for text

f61ba51

balmas pushed a commit that referenced this pull request Dec 7, 2020

#9 parameter description tweak

7da2f0b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urns and language models #9

urns and language models #9

balmas commented Sep 21, 2020

balmas Sep 21, 2020

irina060981 left a comment

irina060981 Sep 22, 2020

balmas Sep 22, 2020

irina060981 Sep 22, 2020

balmas Sep 22, 2020

urns and language models #9

urns and language models #9

Conversation

balmas commented Sep 21, 2020

balmas Sep 21, 2020

Choose a reason for hiding this comment

irina060981 left a comment

Choose a reason for hiding this comment

irina060981 Sep 22, 2020

Choose a reason for hiding this comment

balmas Sep 22, 2020

Choose a reason for hiding this comment

irina060981 Sep 22, 2020

Choose a reason for hiding this comment

balmas Sep 22, 2020

Choose a reason for hiding this comment