update orientation

Former-commit-id: 4cc72cc5badbc9196c21fa5d9b103792f7cf5768 [formerly a74e525] Former-commit-id: 1d2f7c6
interrogator · Jan 15, 2016 · c7d9309 · c7d9309
1 parent 79c077d
commit c7d9309
Showing 1 changed file with 193 additions and 28 deletions.
diff --git a/orientation/orientation.ipynb b/orientation/orientation.ipynb
@@ -19,9 +19,10 @@
     "- [What's in here?](#whats-in-here)\n",
     "  - [`Corpus()`](#corpus)\n",
     "    - [`interrogate()` method](#interrogate-method)\n",
+    "    - [`concordance()` method](#concordance-method)\n",
     "  - [`Interrogation()`](#interrogation)\n",
-    "    - [`edit() method`](#edit-method)\n",
-    "    - [`plot() method`](#plot-method)\n",
+    "    - [`edit()` method](#edit-method)\n",
+    "    - [`plot()` method](#plot-method)\n",
     "  - [Functions, lists, etc.](#functions-lists-etc)\n",
     "- [Installation](#installation)\n",
     "  - [By downloading the repository](#by-downloading-the-repository)\n",
@@ -32,12 +33,16 @@
     "- [More detailed examples](#more-detailed-examples)\n",
     "  - [Building corpora](#building-corpora)\n",
     "    - [Speaker IDs](#speaker-ids)\n",
+    "  - [Getting general stats](#getting-general-stats)\n",
     "  - [Concordancing](#concordancing)\n",
     "  - [Systemic functional stuff](#systemic-functional-stuff)\n",
     "  - [Keywording](#keywording)\n",
     "    - [Plotting keywords](#plotting-keywords)\n",
     "    - [Traditional reference corpora](#traditional-reference-corpora)\n",
     "  - [Parallel processing](#parallel-processing)\n",
+    "    - [Multiple corpora](#multiple-corpora)\n",
+    "    - [Multiple speakers](#multiple-speakers)\n",
+    "    - [Multiple queries](#multiple-queries)\n",
     "  - [More complex queries and plots](#more-complex-queries-and-plots)\n",
     "  - [Visualisation options](#visualisation-options)\n",
     "- [More information](#more-information)\n",
@@ -76,8 +81,8 @@
     "\n",
     "| Attribute | Purpose |\n",
     "|-----------|---------|\n",
-    "| `corpus.subcorpora` | list of subcorpus objects |\n",
-    "| `corpus.files` | list of corpus file objects |\n",
+    "| `corpus.subcorpora` | list of subcorpus objects with indexing/slicing methods |\n",
+    "| `corpus.files` | list of corpus file objects with indexing/slicing methods |\n",
     "| `corpus.structure` | `dict` containing subcorpora and their files |\n",
     "| `corpus.features` | Where feature counting will be stored, `None` initially |\n",
     "\n",
@@ -97,14 +102,22 @@
     "\n",
     "* Use [Tregex](http://nlp.stanford.edu/~manning/courses/ling289/Tregex.html) or regular expressions to search parse trees, dependencies or plain text for complex lexicogrammatical phenomena\n",
     "* Search for, exclude and show word, lemma, POS tag, semantic role, governor, dependent, index (etc) of a token matching a regular expression or wordlist\n",
-    "* Return words or phrases, POS/group/phrase tags, raw counts, or all three.\n",
-    "* N-gramming options\n",
+    "* N-gramming\n",
     "* Two-way UK-US spelling conversion, and the ability to add words manually\n",
     "* Output Pandas DataFrames that can be easily edited and visualised\n",
     "* Use parallel processing to search for a number of patterns, or search for the same pattern in multiple corpora\n",
     "* Restrict searches to particular speakers in a corpus\n",
+    "* Quickly save to and load from disk with `save()` and `load()`\n",
     "\n",
-    "The code below demonstrates the complex kinds of queries that can be handled by the `interrogate()` (and `concordance()`) methods:"
+    "<a name=\"concordance-method\"></a>\n",
+    "#### `concordance()` method\n",
+    "\n",
+    "* Equivalent API to `interrogate()`, but return DataFrame of concordance lines\n",
+    "* Return any combination and order of words, lemmas, indices, functions, or POS tags\n",
+    "* Editable and saveable\n",
+    "* Output to LaTeX, CSV or string with `format()`\n",
+    "\n",
+    "The code below demonstrates the complex kinds of queries that can be handled by the `interrogate()` and `concordance()` methods:"
    ]
   },
   {
@@ -113,17 +126,27 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# select parsed corpus\n",
+    ">>> corpus = Corpus('data/postcounts-parsed')\n",
+    "\n",
     "# import process type lists and closed class wordlists\n",
     ">>> from dictionaries.process_types import processes\n",
     ">>> from dictionaries.wordlists import wordlists\n",
+    "\n",
     "# match tokens with governor that is in relational process wordlist, \n",
     "# and whose function is `nsubj(pass)` or `csubj(pass)`:\n",
     ">>> criteria = {'g': processes.relational, 'f': r'^.subj'}\n",
+    "\n",
     "# exclude tokens whose part-of-speech is verbal, \n",
     "# or whose word is in a list of pronouns\n",
     ">>> exc = {'p': r'^V', 'w': wordlists.pronouns}\n",
-    "# return slash delimited function/lemma\n",
-    ">>> data = corpus.interrogate(criteria, exclude = exc, show = ['f', 'l'])"
+    "\n",
+    "# interrogate, returning slash-delimited function/lemma\n",
+    ">>> data = corpus.interrogate(criteria, exclude = exc, show = ['f', 'l'])\n",
+    ">>> lines = corpus.concordance(criteria, exclude = exc, show = ['f', 'l'])\n",
+    "\n",
+    "# show results\n",
+    ">>> print data, lines.format(n = 10, window = 40, columns = ['l', 'm', 'r'])"
    ]
   },
   {
@@ -144,7 +167,18 @@
     "02          233           147             88         70         70   \n",
     "03          250           160             95         80         67   \n",
     "04          247           205             88         93         71   \n",
-    "05          275           193             68         75         61   "
+    "05          275           193             68         75         61   \n",
+    "\n",
+    "0  nk nsubj/it cop/be ccomp/sad advmod/when    nsubj/person  aux/do neg/not advcl/look ./at prep_at/w\n",
+    "1  /my dobj/Fluoxetine advmod/now mark/that    nsubj/spring  ccomp/be advmod/here ./, ./but nsubj/I a\n",
+    "2  y mark/because expl/there advcl/be det/a     nsubj/woman  ./across det/the prep_across/hall ./from\n",
+    "3   num/114 ccomp/pound ./, mark/so det/any       nsubj/med  nsubj/I rcmod/take aux/can advcl/have de\n",
+    "4                                                 nsubj/Kat  ./, root/be nsubj/you dep/taper ./off ./\n",
+    "5  /to xcomp/explain prep_from/what det/the      nsubj/mark  ./on poss/my prep_on/arm ./, conj_and/ne\n",
+    "6   det/the amod/first ./and conj_and/third  nsubj/hospital  nsubj/I rcmod/be advmod/at root/have num\n",
+    "7  e dobj/tv mark/while det/the amod/second  nsubj/hospital  nsubj/I cop/be rcmod/IP prep/at pcomp/in\n",
+    "8                                                 nsubj/Ben  ./, mark/if nsubj/you cop/be advcl/unhap\n",
+    "9  h ./of prep_of/sleep advmod/when det/the   nsubj/reality  advcl/be ./, nsubj/everyone ccomp/need n\n"
    ]
   },
   {
@@ -174,7 +208,7 @@
     "These methods have been monkey-patched to Pandas' DataFrame and Series objects, as well.\n",
     "\n",
     "<a name=\"edit-method\"></a>\n",
-    "#### `edit() method`\n",
+    "#### `edit()` method\n",
     "\n",
     "* Remove, keep or merge interrogation results or subcorpora using indices, words or regular expressions (see below)\n",
     "* Sort results by name or total frequency\n",
@@ -189,7 +223,7 @@
     "* Plot more advanced kinds of relative frequency: for example, find all proper nouns that are subjects of clauses, and plot each word as a percentage of all instances of that word in the corpus (see below)\n",
     "\n",
     "<a name=\"plot-method\"></a>\n",
-    "#### `plot() method`\n",
+    "#### `plot()` method\n",
     "\n",
     "* Plot using *Matplotlib*\n",
     "* Interactive plots (hover-over text, interactive legends) using *mpld3* (examples in the [*Risk Semantics* notebook](https://github.com/interrogator/risk/blob/master/risk.ipynb))\n",
@@ -390,8 +424,8 @@
     "# parse it, return the new parsed corpus object\n",
     ">>> corpus = unparsed.parse()\n",
     "\n",
-    "# search nyt for modal auxiliaries:\n",
-    ">>> interroplot(corpus, r'MD')"
+    "# search corpus for modal auxiliaries:\n",
+    ">>> corpus.interroplot('MD')"
    ]
   },
   {
@@ -457,15 +491,18 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    ">>> corpus = unparsed.parse(parse = True, tokenise = True,\n",
-    "...    corenlppath = 'Downloads/corenlp', nltk_data_path = 'Downloads/nltk_data')"
+    "# to parse, you can set a path to corenlp\n",
+    ">>> corpus = unparsed.parse(corenlppath = 'Downloads/corenlp')\n",
+    "\n",
+    "# to tokenise, turn parsing off, and point to nltk:\n",
+    "# >>> corpus = unparsed.parse(parse = False, tokenise = True, nltk_data_path = 'Downloads/nltk_data')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "which creates the parsed corpora, and returns `Corpus()` objects representing them. You can also optionally pass in a string of annotators:"
+    "which creates the parsed/tokenised corpora, and returns `Corpus()` objects representing them. You can also optionally pass in a string of annotators:"
    ]
   },
   {
@@ -545,6 +582,58 @@
    "source": [
     "This makes it possible to not only investigate individual speakers, but to form an understanding of the overall tenor/tone of the text as well: *Who does most of the talking? Who is asking the questions? Who issues commands?*\n",
     "\n",
+    "<a name=\"getting-general-stats\"></a>\n",
+    "### Getting general stats\n",
+    "\n",
+    "Once you have a parsed `Corpus()` object, you can use `corpus.get_stats()` to fill `corpus.features` with data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    ">>> corpus = Corpus('data/sessions-parsed')\n",
+    ">>> corpus.get_stats()\n",
+    ">>> corpus.features"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Output:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "    Characters  Tokens  Words  Closed class words  Open class words  Clauses  Sentences  Unmodalised declarative  Mental processes   Relational processes  Interrogative  Passives  Verbal processes   Modalised declarative  Open interrogative  Imperative  Closed interrogative  \n",
+    "01       26873    8513   7308                4809              3704     2212        577                      280               156                     98             76        35                39                      26                   8           2                      3    \n",
+    "02       25844    7933   6920                4313              3620     2270        266                      130               195                    109             29        19                35                      11                   5           1                      3    \n",
+    "03       18376    5683   4877                3067              2616     1640        330                      174               132                     68             30        40                29                       8                  12           6                      1    \n",
+    "04       20066    6354   5366                3587              2767     1775        319                      174               176                     83             33        30                20                       9                   9           4                      1    \n",
+    "05       23461    7627   6217                4400              3227     1978        479                      245               154                     93             45        51                28                      20                   5           3                      1    \n",
+    "06       19164    6777   5200                4151              2626     1684        298                      111               165                     83             43        56                14                      10                   6           6                      2    \n",
+    "07       22349    7039   5951                4012              3027     1947        343                      183               195                     82             29        30                38                      12                   5           5                      0    \n",
+    "08       26494    8760   7124                4960              3800     2379        545                      263               170                     87             66        36                32                      10                   6           5                      4    \n",
+    "09       23073    7747   6193                4524              3223     2056        310                      149               164                     88             21        26                22                      10                   5           3                      0    \n",
+    "10       20648    6789   5608                3817              2972     1795        437                      265               139                    101             34        34                39                      18                   5           3                      2    \n",
+    "11       25366    8533   6899                4925              3608     2207        457                      230               203                    116             39        48                47                      15                  10           4                      0    \n",
+    "12       16976    5742   4624                3274              2468     1567        258                      135               183                     72             23        43                22                       4                   3           1                      6    \n",
+    "13       25807    8546   6966                4768              3778     2345        477                      257               200                    124             45        50                36                      15                  12           3                      2    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This data can be very helpful when using `edit()` to generate relative frequencies, for example.\n",
+    "\n",
     "<a name=\"concordancing\"></a>\n",
     "### Concordancing\n",
     "\n",
@@ -557,7 +646,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    ">>> subcorpus = corpus.subcorpora['2005']\n",
+    ">>> subcorpus = corpus.subcorpora.c2005\n",
+    "# can also be accessed as corpus.subcorpora['2005']\n",
+    "# or corpus.subcorpora[index]\n",
     ">>> query = r'/JJ.?/ > (NP <<# (/NN.?/ < /\\brisk/))'\n",
     "# 't' option for tree searching\n",
     ">>> lines = subcorpus.concordance('t', query, window = 50, n = 10, random = True)"
@@ -646,10 +737,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "r_query = r'fr?iends?'\n",
+    "r_query = r'^fr?iends?$'\n",
     "l_query = ['friend', 'friends', 'fiend', 'fiends']\n",
-    ">>> lines = subcorpus.concordance(r_query)\n",
-    ">>> lines = subcorpus.concordance(l_query)"
+    ">>> lines = subcorpus.concordance({'w': r_query})\n",
+    ">>> lines = subcorpus.concordance({'w': l_query})"
    ]
   },
   {
@@ -784,6 +875,8 @@
    "outputs": [],
    "source": [
     "# sort with edit()\n",
+    "# use scipy.linregress to sort by 'increase', 'decrease', 'static', 'turbulent' or 'p'\n",
+    "# other sort_by options: 'name', 'total', 'infreq'\n",
     ">>> sayers_no_prp = sayers_no_prp.edit('%', sayers.totals, sort_by = 'increase')\n",
     "\n",
     "# make an area chart with custom y label\n",
@@ -818,6 +911,9 @@
     ">>> sayers = sayers.edit(merge_subcorpora = merges)\n",
     "\n",
     "# now, get relative frequencies for he and she\n",
+    "# 'self' calculates percentage after merging/removing etc has been performed,\n",
+    "# so that he and she will sum to 100%.\n",
+    "# pass in `sayers.totals` to calculate he/she as percentage of all sayers\n",
     ">>> genders = sayers.edit('%', 'self', just_entries = ['he', 'she'])\n",
     "\n",
     "# and plot it as a series of pie charts, showing totals on the slices:\n",
@@ -1040,8 +1136,7 @@
    "source": [
     "# arbitrary list of common/boring words\n",
     ">>> from dictionaries.stopwords import stopwords\n",
-    ">>> print p.results.ix['2013'].edit('k', 'bnc.p', \n",
-    "...    skip_entries = stopwords).results\n",
+    ">>> print p.results.ix['2013'].edit('k', 'bnc.p', skip_entries = stopwords).results\n",
     ">>> print p.results.ix['2013'].edit('k', 'bnc.p', calc_all = False).results"
    ]
   },
@@ -1079,13 +1174,83 @@
     "<a name=\"parallel-processing\"></a>\n",
     "### Parallel processing\n",
     "\n",
-    "`interrogate()` can also parallel-process multiple queries or corpora. Parallel processing will be automatically enabled if you pass in:\n",
+    "`interrogate()` can also parallel-process multiple corpora, speaker IDs, or queries.\n",
+    "\n",
+    "<a name=\"multiple-corpora\"></a>\n",
+    "#### Multiple corpora\n",
+    "\n",
+    "To parallel-process multiple corpora, first, wrap them up as a `Corpora()` object:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    ">>> import os\n",
+    ">>> from corpkit.corpus import Corpora\n",
+    "\n",
+    "# make a list of Corpus objects, then pass it to Corpora()\n",
+    ">>> corpus_list = [Corpus(os.path.join(datadir, d)) for d in os.listdir(datadir)]\n",
+    ">>> corpora = Corpora(corpus_list)\n",
     "\n",
-    "1. a `list` of paths as `path` (i.e. `['path/to/corpus1', 'path/to/corpus2']`)\n",
-    "2. a `dict` as `query` (i.e. `{'Noun phrases': r'NP', 'Verb phrases': r'VP'}`)\n",
-    "3. A `list` of speakers, with speaker-segmented data (i.e. `['LEAR', 'KENT', 'FOOL']`)\n",
+    "# interrogate by parallel processing, 4 at a time\n",
+    ">>> output = corpora.interrogate('t', r'/NN.?/ < /(?i)^h/', show = 'l', num_proc = 4)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`num_proc` dictates the number of parallel processes to start. If omitted, you'll get as many processes as your machine has cores.\n",
+    "\n",
+    "The output of a multiprocessed interrogation will generally be a `dict` with  corpus/speaker/query names as keys. The only exception to this is if you use `show = 'count'`, which will concatenate results from each query into a single `Interrogation()` object, using corpus/speaker/query names as column names.\n",
+    "\n",
+    "<a name=\"multiple-speakers\"></a>\n",
+    "#### Multiple speakers\n",
     "\n",
-    "Let's look at different risk processes (e.g. *risk*, *take risk*, *run risk*, *pose risk*, *put at risk*) using constituency parses:"
+    "Passing in a list of speaker names will also trigger multiprocessing:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    ">>> from dictionary.wordlists import wordlists\n",
+    ">>> spkrs = ['MEYER', 'JAY']\n",
+    ">>> each_speaker = corpus.interrogate('w', wordlists.closedclass, just_speakers = spkrs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There is also `just_speakers = 'each'`, which will be automatically expanded to include every speaker name found in the corpus.\n",
+    "\n",
+    "<a name=\"multiple-queries\"></a>\n",
+    "#### Multiple queries\n",
+    "\n",
+    "You can also run a number of queries over the same corpus in parallel."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = {'Noun phrases': r'NP', 'Verb phrases': r'VP'}`}\n",
+    "phrases = corpus.interrogate('trees', query, show = 'c')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's try multiprocessing with multiple queries, showing count (i.e. returning a single results DataFrame). We can look at different risk processes (e.g. *risk*, *take risk*, *run risk*, *pose risk*, *put at risk*) using constituency parses:"
    ]
   },
   {