Releases: interrogator/corpkit
Fixed memory problems
There were some issues with large XML file processing that have now been resolved.
Have fun!
Fast, efficient, documented, factored
- Speed increases, especially for feature counting
- Multiprocessing for parsing, very useful when you have access to a big machine
- Improved searching for CoreNLP (looking in all paths), automating download and installation
- Simpler backend implementation of keywords and ngrams
- Better documentation, especially at ReadTheDocs
- Code has been refactored and made largely PEP8 compliant, aiding collaboration
- Can now sort by subcorpus name in
interrogation.edit()
method
Very little difference to the API, however!
Major release
In this major release, stability and performance have been improved in dozens of ways:
- Python 2/3 compatibility
- Smart multiprocessing
- Useful documentation, ReadTheDocs site generation
- Much smaller repository size
- Compatible with multiple versions of CoreNLP
- Increased object orientation generally
- Nose tests
- Travis CI integration
- Faster save/load via cPickle
- Countless bugfixes
Levels of abstraction have been added beyond Corpus
(Corpora
) and Interrogation
(Interrodict
), with useful methods attached to each. Interrogation and concordancing have become two sides of the same coin, rather than separate tasks, helping to build computational workflows that investigate functional linguistic notions of probabilistic grammar and lexis as delicate grammar.
One emerging part of corpkit is the configurations()
method, which automatically analyses the behaviour of a lexical item or items in the corpus. This will be very useful in automated workflows that seek to identify key participants and processes, and then to generate an overview of how each behaves. A little more work is still needed here, however. Also on the horizon are multilingual support and the use of spaCy ... but perhaps some of this needs to wait until I've made peace with my thesis.
corpkit plus ReadTheDocs
The main thing going on now is some decent docstrings, which allow for some decent documentation via http://corpkit.readthedocs.org/en/latest/. Since the last release, things have also gotten more stable. Corpus
class, and its subclasses, are working really nicely: it's now easy to search particular subcorpora, multiprocess, or treat files as subcorpora. the interrogate
method has also impoved a lot. conc
has been subsumed within interrogate
. All is well.
Classes, methods, improved concordancing
This release marks a transition to a class-and-method structure, rather than a collection of functions. Users now instantiate a Corpus
object with methods for parsing, interrogating and concordancing. Interrogations output Interrogation
objects, which have methods for editing, plotting, saving, etc.
Another major update is that the concordance()
method takes the same core arguments as the interrogate()
method. This means that users can quickly check that their interrogation is counting what they think it is.
There have also been some bugfixes, documentation updates, and that kind of usual stuff.
New interrogation options
This release is designed to reflect a change from purpose-built interrogator()
search functions to the search
and show
arguments, which are much more powerful. Users can construct a dict
object with one or more dependency criteria to match, and elect to match all criteria or any criterion with searchmode = 'any'/'all'
.
>>> criteria = {'lemma': ['think', 'feel', 'want'],
... 'pos': r'^V',
... 'function': 'root'}
>>> r = interrogator(corpus, search = criteria, show = ['word'], searchmode = 'all')
>>> list(r.results.columns)[:5]
might return:
['think', 'thinking', 'want', 'wants', 'feel']
Passing in a longer list for the show
argument will set what is given in the output, as well as its order:
>>> r = interrogator(corpus, search = criteria, show = ['f', 'p', 'l'], searchmode = 'all')
>>> list(r.results.columns)[:3]
will produce column names with concatenated function, pos and lemma:
['root/vbp/think', 'root/vbg/thinking', 'root/vb/want']
Another improvement is the exclude
argument, which takes the place of blacklist
, function_filter
and pos_filter
. Alongside excludemode = 'any'/'all'
, it operates just like search
, allowing the user to exclude results matching one or more criteria:
>>> excs = {'pos': r'^V', 'word': r'ing$'}
>>> r = interrogator(corpus, search = criteria, show = ['f', 'p', 'l'],
... searchmode = 'all', exclude = excs, excludemode = 'all')
would remove any verbal token ending in ing
. Changing excludemode
to 'any'
would remove all verbs and all words ending in ing
.
The release has various other bugfixes, code cleanup, and some miscellaneous bits and pieces, such as a function for turning results into Pandas Multi Index DataFrames. Full API documentation is forthcoming.
corpkit user interface
This release contains the beta version of an OSX .app version of corpkit.
First proper release
I'm releasing corpkit
today as 1.0
mostly so that it can get a DOI and be cited.
The toolkit's interrogator()
, editor()
, plotter()
, conc()
and keywords()
functions are now in a fairly useable state, though documentation of some options may still be lacking. I also haven't really testing the toolkit on single subcorpora and plain text files, because the main aim is to work with parsed and structured corpora.
A major issue at present is that dependency querying is quite slow. Though I think it could be sped up by multiprocessing, and by parsing CoreNLP output with lxml
/ corenlp_xml
. Because Knuth warns against premature optimisation, and because I have a thesis to finish, I'm going to try not to spend too much time on this issue yet.
This release also marks the start of my transition toward developing:
- Tools for getting data parsed and structured
- Tools for connecting concordance lines to HTML
Once these are done, I'll ideally like to wrap everything up as some kind of web-service/application. These future goals, however, score me very few points for my thesis, so I'm not going to be developing them as furiously as I'd like to be.
Be in touch if you have any questions or comments!