Skip to content

Commit

Permalink
update orientation
Browse files Browse the repository at this point in the history
Former-commit-id: 4cc72cc5badbc9196c21fa5d9b103792f7cf5768 [formerly a74e525]
Former-commit-id: 1d2f7c6
  • Loading branch information
daniel committed Jan 15, 2016
1 parent 79c077d commit c7d9309
Showing 1 changed file with 193 additions and 28 deletions.
221 changes: 193 additions & 28 deletions orientation/orientation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,10 @@
"- [What's in here?](#whats-in-here)\n",
" - [`Corpus()`](#corpus)\n",
" - [`interrogate()` method](#interrogate-method)\n",
" - [`concordance()` method](#concordance-method)\n",
" - [`Interrogation()`](#interrogation)\n",
" - [`edit() method`](#edit-method)\n",
" - [`plot() method`](#plot-method)\n",
" - [`edit()` method](#edit-method)\n",
" - [`plot()` method](#plot-method)\n",
" - [Functions, lists, etc.](#functions-lists-etc)\n",
"- [Installation](#installation)\n",
" - [By downloading the repository](#by-downloading-the-repository)\n",
Expand All @@ -32,12 +33,16 @@
"- [More detailed examples](#more-detailed-examples)\n",
" - [Building corpora](#building-corpora)\n",
" - [Speaker IDs](#speaker-ids)\n",
" - [Getting general stats](#getting-general-stats)\n",
" - [Concordancing](#concordancing)\n",
" - [Systemic functional stuff](#systemic-functional-stuff)\n",
" - [Keywording](#keywording)\n",
" - [Plotting keywords](#plotting-keywords)\n",
" - [Traditional reference corpora](#traditional-reference-corpora)\n",
" - [Parallel processing](#parallel-processing)\n",
" - [Multiple corpora](#multiple-corpora)\n",
" - [Multiple speakers](#multiple-speakers)\n",
" - [Multiple queries](#multiple-queries)\n",
" - [More complex queries and plots](#more-complex-queries-and-plots)\n",
" - [Visualisation options](#visualisation-options)\n",
"- [More information](#more-information)\n",
Expand Down Expand Up @@ -76,8 +81,8 @@
"\n",
"| Attribute | Purpose |\n",
"|-----------|---------|\n",
"| `corpus.subcorpora` | list of subcorpus objects |\n",
"| `corpus.files` | list of corpus file objects |\n",
"| `corpus.subcorpora` | list of subcorpus objects with indexing/slicing methods |\n",
"| `corpus.files` | list of corpus file objects with indexing/slicing methods |\n",
"| `corpus.structure` | `dict` containing subcorpora and their files |\n",
"| `corpus.features` | Where feature counting will be stored, `None` initially |\n",
"\n",
Expand All @@ -97,14 +102,22 @@
"\n",
"* Use [Tregex](http://nlp.stanford.edu/~manning/courses/ling289/Tregex.html) or regular expressions to search parse trees, dependencies or plain text for complex lexicogrammatical phenomena\n",
"* Search for, exclude and show word, lemma, POS tag, semantic role, governor, dependent, index (etc) of a token matching a regular expression or wordlist\n",
"* Return words or phrases, POS/group/phrase tags, raw counts, or all three.\n",
"* N-gramming options\n",
"* N-gramming\n",
"* Two-way UK-US spelling conversion, and the ability to add words manually\n",
"* Output Pandas DataFrames that can be easily edited and visualised\n",
"* Use parallel processing to search for a number of patterns, or search for the same pattern in multiple corpora\n",
"* Restrict searches to particular speakers in a corpus\n",
"* Quickly save to and load from disk with `save()` and `load()`\n",
"\n",
"The code below demonstrates the complex kinds of queries that can be handled by the `interrogate()` (and `concordance()`) methods:"
"<a name=\"concordance-method\"></a>\n",
"#### `concordance()` method\n",
"\n",
"* Equivalent API to `interrogate()`, but return DataFrame of concordance lines\n",
"* Return any combination and order of words, lemmas, indices, functions, or POS tags\n",
"* Editable and saveable\n",
"* Output to LaTeX, CSV or string with `format()`\n",
"\n",
"The code below demonstrates the complex kinds of queries that can be handled by the `interrogate()` and `concordance()` methods:"
]
},
{
Expand All @@ -113,17 +126,27 @@
"metadata": {},
"outputs": [],
"source": [
"# select parsed corpus\n",
">>> corpus = Corpus('data/postcounts-parsed')\n",
"\n",
"# import process type lists and closed class wordlists\n",
">>> from dictionaries.process_types import processes\n",
">>> from dictionaries.wordlists import wordlists\n",
"\n",
"# match tokens with governor that is in relational process wordlist, \n",
"# and whose function is `nsubj(pass)` or `csubj(pass)`:\n",
">>> criteria = {'g': processes.relational, 'f': r'^.subj'}\n",
"\n",
"# exclude tokens whose part-of-speech is verbal, \n",
"# or whose word is in a list of pronouns\n",
">>> exc = {'p': r'^V', 'w': wordlists.pronouns}\n",
"# return slash delimited function/lemma\n",
">>> data = corpus.interrogate(criteria, exclude = exc, show = ['f', 'l'])"
"\n",
"# interrogate, returning slash-delimited function/lemma\n",
">>> data = corpus.interrogate(criteria, exclude = exc, show = ['f', 'l'])\n",
">>> lines = corpus.concordance(criteria, exclude = exc, show = ['f', 'l'])\n",
"\n",
"# show results\n",
">>> print data, lines.format(n = 10, window = 40, columns = ['l', 'm', 'r'])"
]
},
{
Expand All @@ -144,7 +167,18 @@
"02 233 147 88 70 70 \n",
"03 250 160 95 80 67 \n",
"04 247 205 88 93 71 \n",
"05 275 193 68 75 61 "
"05 275 193 68 75 61 \n",
"\n",
"0 nk nsubj/it cop/be ccomp/sad advmod/when nsubj/person aux/do neg/not advcl/look ./at prep_at/w\n",
"1 /my dobj/Fluoxetine advmod/now mark/that nsubj/spring ccomp/be advmod/here ./, ./but nsubj/I a\n",
"2 y mark/because expl/there advcl/be det/a nsubj/woman ./across det/the prep_across/hall ./from\n",
"3 num/114 ccomp/pound ./, mark/so det/any nsubj/med nsubj/I rcmod/take aux/can advcl/have de\n",
"4 nsubj/Kat ./, root/be nsubj/you dep/taper ./off ./\n",
"5 /to xcomp/explain prep_from/what det/the nsubj/mark ./on poss/my prep_on/arm ./, conj_and/ne\n",
"6 det/the amod/first ./and conj_and/third nsubj/hospital nsubj/I rcmod/be advmod/at root/have num\n",
"7 e dobj/tv mark/while det/the amod/second nsubj/hospital nsubj/I cop/be rcmod/IP prep/at pcomp/in\n",
"8 nsubj/Ben ./, mark/if nsubj/you cop/be advcl/unhap\n",
"9 h ./of prep_of/sleep advmod/when det/the nsubj/reality advcl/be ./, nsubj/everyone ccomp/need n\n"
]
},
{
Expand Down Expand Up @@ -174,7 +208,7 @@
"These methods have been monkey-patched to Pandas' DataFrame and Series objects, as well.\n",
"\n",
"<a name=\"edit-method\"></a>\n",
"#### `edit() method`\n",
"#### `edit()` method\n",
"\n",
"* Remove, keep or merge interrogation results or subcorpora using indices, words or regular expressions (see below)\n",
"* Sort results by name or total frequency\n",
Expand All @@ -189,7 +223,7 @@
"* Plot more advanced kinds of relative frequency: for example, find all proper nouns that are subjects of clauses, and plot each word as a percentage of all instances of that word in the corpus (see below)\n",
"\n",
"<a name=\"plot-method\"></a>\n",
"#### `plot() method`\n",
"#### `plot()` method\n",
"\n",
"* Plot using *Matplotlib*\n",
"* Interactive plots (hover-over text, interactive legends) using *mpld3* (examples in the [*Risk Semantics* notebook](https://github.com/interrogator/risk/blob/master/risk.ipynb))\n",
Expand Down Expand Up @@ -390,8 +424,8 @@
"# parse it, return the new parsed corpus object\n",
">>> corpus = unparsed.parse()\n",
"\n",
"# search nyt for modal auxiliaries:\n",
">>> interroplot(corpus, r'MD')"
"# search corpus for modal auxiliaries:\n",
">>> corpus.interroplot('MD')"
]
},
{
Expand Down Expand Up @@ -457,15 +491,18 @@
"metadata": {},
"outputs": [],
"source": [
">>> corpus = unparsed.parse(parse = True, tokenise = True,\n",
"... corenlppath = 'Downloads/corenlp', nltk_data_path = 'Downloads/nltk_data')"
"# to parse, you can set a path to corenlp\n",
">>> corpus = unparsed.parse(corenlppath = 'Downloads/corenlp')\n",
"\n",
"# to tokenise, turn parsing off, and point to nltk:\n",
"# >>> corpus = unparsed.parse(parse = False, tokenise = True, nltk_data_path = 'Downloads/nltk_data')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"which creates the parsed corpora, and returns `Corpus()` objects representing them. You can also optionally pass in a string of annotators:"
"which creates the parsed/tokenised corpora, and returns `Corpus()` objects representing them. You can also optionally pass in a string of annotators:"
]
},
{
Expand Down Expand Up @@ -545,6 +582,58 @@
"source": [
"This makes it possible to not only investigate individual speakers, but to form an understanding of the overall tenor/tone of the text as well: *Who does most of the talking? Who is asking the questions? Who issues commands?*\n",
"\n",
"<a name=\"getting-general-stats\"></a>\n",
"### Getting general stats\n",
"\n",
"Once you have a parsed `Corpus()` object, you can use `corpus.get_stats()` to fill `corpus.features` with data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
">>> corpus = Corpus('data/sessions-parsed')\n",
">>> corpus.get_stats()\n",
">>> corpus.features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Output:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
" Characters Tokens Words Closed class words Open class words Clauses Sentences Unmodalised declarative Mental processes Relational processes Interrogative Passives Verbal processes Modalised declarative Open interrogative Imperative Closed interrogative \n",
"01 26873 8513 7308 4809 3704 2212 577 280 156 98 76 35 39 26 8 2 3 \n",
"02 25844 7933 6920 4313 3620 2270 266 130 195 109 29 19 35 11 5 1 3 \n",
"03 18376 5683 4877 3067 2616 1640 330 174 132 68 30 40 29 8 12 6 1 \n",
"04 20066 6354 5366 3587 2767 1775 319 174 176 83 33 30 20 9 9 4 1 \n",
"05 23461 7627 6217 4400 3227 1978 479 245 154 93 45 51 28 20 5 3 1 \n",
"06 19164 6777 5200 4151 2626 1684 298 111 165 83 43 56 14 10 6 6 2 \n",
"07 22349 7039 5951 4012 3027 1947 343 183 195 82 29 30 38 12 5 5 0 \n",
"08 26494 8760 7124 4960 3800 2379 545 263 170 87 66 36 32 10 6 5 4 \n",
"09 23073 7747 6193 4524 3223 2056 310 149 164 88 21 26 22 10 5 3 0 \n",
"10 20648 6789 5608 3817 2972 1795 437 265 139 101 34 34 39 18 5 3 2 \n",
"11 25366 8533 6899 4925 3608 2207 457 230 203 116 39 48 47 15 10 4 0 \n",
"12 16976 5742 4624 3274 2468 1567 258 135 183 72 23 43 22 4 3 1 6 \n",
"13 25807 8546 6966 4768 3778 2345 477 257 200 124 45 50 36 15 12 3 2 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This data can be very helpful when using `edit()` to generate relative frequencies, for example.\n",
"\n",
"<a name=\"concordancing\"></a>\n",
"### Concordancing\n",
"\n",
Expand All @@ -557,7 +646,9 @@
"metadata": {},
"outputs": [],
"source": [
">>> subcorpus = corpus.subcorpora['2005']\n",
">>> subcorpus = corpus.subcorpora.c2005\n",
"# can also be accessed as corpus.subcorpora['2005']\n",
"# or corpus.subcorpora[index]\n",
">>> query = r'/JJ.?/ > (NP <<# (/NN.?/ < /\\brisk/))'\n",
"# 't' option for tree searching\n",
">>> lines = subcorpus.concordance('t', query, window = 50, n = 10, random = True)"
Expand Down Expand Up @@ -646,10 +737,10 @@
"metadata": {},
"outputs": [],
"source": [
"r_query = r'fr?iends?'\n",
"r_query = r'^fr?iends?$'\n",
"l_query = ['friend', 'friends', 'fiend', 'fiends']\n",
">>> lines = subcorpus.concordance(r_query)\n",
">>> lines = subcorpus.concordance(l_query)"
">>> lines = subcorpus.concordance({'w': r_query})\n",
">>> lines = subcorpus.concordance({'w': l_query})"
]
},
{
Expand Down Expand Up @@ -784,6 +875,8 @@
"outputs": [],
"source": [
"# sort with edit()\n",
"# use scipy.linregress to sort by 'increase', 'decrease', 'static', 'turbulent' or 'p'\n",
"# other sort_by options: 'name', 'total', 'infreq'\n",
">>> sayers_no_prp = sayers_no_prp.edit('%', sayers.totals, sort_by = 'increase')\n",
"\n",
"# make an area chart with custom y label\n",
Expand Down Expand Up @@ -818,6 +911,9 @@
">>> sayers = sayers.edit(merge_subcorpora = merges)\n",
"\n",
"# now, get relative frequencies for he and she\n",
"# 'self' calculates percentage after merging/removing etc has been performed,\n",
"# so that he and she will sum to 100%.\n",
"# pass in `sayers.totals` to calculate he/she as percentage of all sayers\n",
">>> genders = sayers.edit('%', 'self', just_entries = ['he', 'she'])\n",
"\n",
"# and plot it as a series of pie charts, showing totals on the slices:\n",
Expand Down Expand Up @@ -1040,8 +1136,7 @@
"source": [
"# arbitrary list of common/boring words\n",
">>> from dictionaries.stopwords import stopwords\n",
">>> print p.results.ix['2013'].edit('k', 'bnc.p', \n",
"... skip_entries = stopwords).results\n",
">>> print p.results.ix['2013'].edit('k', 'bnc.p', skip_entries = stopwords).results\n",
">>> print p.results.ix['2013'].edit('k', 'bnc.p', calc_all = False).results"
]
},
Expand Down Expand Up @@ -1079,13 +1174,83 @@
"<a name=\"parallel-processing\"></a>\n",
"### Parallel processing\n",
"\n",
"`interrogate()` can also parallel-process multiple queries or corpora. Parallel processing will be automatically enabled if you pass in:\n",
"`interrogate()` can also parallel-process multiple corpora, speaker IDs, or queries.\n",
"\n",
"<a name=\"multiple-corpora\"></a>\n",
"#### Multiple corpora\n",
"\n",
"To parallel-process multiple corpora, first, wrap them up as a `Corpora()` object:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
">>> import os\n",
">>> from corpkit.corpus import Corpora\n",
"\n",
"# make a list of Corpus objects, then pass it to Corpora()\n",
">>> corpus_list = [Corpus(os.path.join(datadir, d)) for d in os.listdir(datadir)]\n",
">>> corpora = Corpora(corpus_list)\n",
"\n",
"1. a `list` of paths as `path` (i.e. `['path/to/corpus1', 'path/to/corpus2']`)\n",
"2. a `dict` as `query` (i.e. `{'Noun phrases': r'NP', 'Verb phrases': r'VP'}`)\n",
"3. A `list` of speakers, with speaker-segmented data (i.e. `['LEAR', 'KENT', 'FOOL']`)\n",
"# interrogate by parallel processing, 4 at a time\n",
">>> output = corpora.interrogate('t', r'/NN.?/ < /(?i)^h/', show = 'l', num_proc = 4)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`num_proc` dictates the number of parallel processes to start. If omitted, you'll get as many processes as your machine has cores.\n",
"\n",
"The output of a multiprocessed interrogation will generally be a `dict` with corpus/speaker/query names as keys. The only exception to this is if you use `show = 'count'`, which will concatenate results from each query into a single `Interrogation()` object, using corpus/speaker/query names as column names.\n",
"\n",
"<a name=\"multiple-speakers\"></a>\n",
"#### Multiple speakers\n",
"\n",
"Let's look at different risk processes (e.g. *risk*, *take risk*, *run risk*, *pose risk*, *put at risk*) using constituency parses:"
"Passing in a list of speaker names will also trigger multiprocessing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
">>> from dictionary.wordlists import wordlists\n",
">>> spkrs = ['MEYER', 'JAY']\n",
">>> each_speaker = corpus.interrogate('w', wordlists.closedclass, just_speakers = spkrs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is also `just_speakers = 'each'`, which will be automatically expanded to include every speaker name found in the corpus.\n",
"\n",
"<a name=\"multiple-queries\"></a>\n",
"#### Multiple queries\n",
"\n",
"You can also run a number of queries over the same corpus in parallel."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = {'Noun phrases': r'NP', 'Verb phrases': r'VP'}`}\n",
"phrases = corpus.interrogate('trees', query, show = 'c')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try multiprocessing with multiple queries, showing count (i.e. returning a single results DataFrame). We can look at different risk processes (e.g. *risk*, *take risk*, *run risk*, *pose risk*, *put at risk*) using constituency parses:"
]
},
{
Expand Down

0 comments on commit c7d9309

Please sign in to comment.