Version 3.1.8
buzz is a linguistics tool for parsing and then exploring plain or metadata-rich text. This README provides an overview of functionality. Visit the full documentation for a more complete user guide.
buzz requires Python 3.6 or higher. A virtual environment is recommended.
pip install buzz[word]
# or
git clone http://github.com/interrogator/buzz
cd buzz
python setup.py install
buzz has an optional frontend, buzzword, currently under development, for exploring parsed corpora. To use it, install:
pip install buzz[word]
Documentation is emerging here, as well from the main page of the app itself.
A URL will be printed, which can be used to access the app in your browser.
buzz models plain text, or CONLL-U formatted files. The remainder of this guide will assume that you are have plain text data, and want to process and analyse it on the command line using buzz.
First, you need to make sure that your corpus is in a format and structure that buzz can work with. Text files should be plain text, with a .txt
extension. Put one or more in a folder called txt
, inside a folder that will hold all corpus data (i.e. mycorpus/txt/e01.txt
). You can have intermediate subdirectories inside txt
to represent subcorpora, but this is now deprecated (use file-level metadata instead).
Importantly, your text files they can be augmented with metadata, which can be stored in two ways. First, speaker names can be added by using capital letters and a colon, much like in a script. Second, you can use XML style metadata markup. Here is an example file that uses both kinds of metadata annotation:
sopranos/s1/e01.txt
:
<meta aired="10.01.1999" />
MELFI: My understanding from Dr. Cusamano, your family physician, is you collapsed? Possibly a panic attack? <meta exposition=true interrogative-type="intonation" move="info-request">
TONY: <meta emph=true>They</meta> said it was a panic attack <meta move="refute" />
MELFI: You don't agree that you had a panic attack? <meta move="info-request" question=type="in" />
...
If you add a meta
element at the start of the text file, it will be understood as file-level metadata. For sentence-specific metadata, the element should follow the sentence, ideally at the end of a line. Span- and token-level metadata should wrap the tokens you want to annotate. All metadata will be searchable later, so the more you can add, the more you can do with your corpus.
To load corpora as buzz objects, use the Collection
class:
from buzz import Collection
corpus = Collection("sopranos")
The plaintext corpus now is available at corpus.txt
.
You can also make virtual corpora from strings, optionally saving the corpus to disk.
corpus = Corpus.from_string("Some sentences here.", save_as="corpusname")
buzz uses spaCy
to parse your text, saving the results as CONLL-U files to your hard drive. Parsing by default is only for dependencies, but constituency parsing can be added with a keyword argument:
# only dependency parsing
parsed = corpus.parse()
# if you also want constituency parsing
parsed = corpus.parse(constituencies=True)
# select language and parse with four cores
parsed = corpus.parse(language="en", multiprocess=4)
You can also parse text strings on-the-fly, optionally passing in a name under which to save the corpus:
from buzz import Parser
parser = Parser()
for text in list_of_strings:
dataset = parser.run(text, save_as=False)
The main advantages of parsing with buzz are that:
- Parse results are stored as valid CONLL-U 2.0
- Metadata is respected, and transferred into the output files
- You can do constituency and dependency parsing at the same time (with parse trees being stored as CONLL-U metadata)
the parse()
method return a Corpus
object, representing the newly created files. It also creates corpus.conllu
, which also points to this Corpus
object.
We can explore this corpus via accessors like:
corpus.conllu.subcorpora.s1.files.e01
corpus.conllu.files[0]
corpus.conllu.subcorpora.s1[:5]
corpus.conllu.subcorpora["s1"]
You can also parse corpora without entering a Python session by using the parse
command:
parse --language en --constituencies=true|false path/to/corpus --multiprocess=n
# or
python -m buzz.parse path/to/corpus
Both commands will create path/to/corpus/conllu
, a folder containing CONLL-U files.
Once a corpus is parsed, you can use the load()
method to load it, or parts of it, into memory. Loading corpora into memory creates a Dataset
object, which extends the pandas DataFrame.
loaded = corpus.load()
# same as: corpus.conllu.load()
You can also load unparsed txt
files using the read
method: corpus.txt.file1.read()
.
You don't need to load corpora into memory to work on them, but it's great for smaller corpora. As a rule of thumb, datasets under a million words should be easily loadable on a personal computer.
The loaded corpus is a Dataset
object, which is based on the pandas DataFrame. So, you can use pandas methods on it:
loaded.head()
w | l | x | p | g | f | e | aired | emph | ent_id | ent_iob | ent_type | exposition | interrogative_type | move | question | sent_id | sent_len | speaker | text | _n | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
file | s | i | |||||||||||||||||||||
text | 1 | 1 | My | -PRON- | DET | PRP$ | 2 | poss | _ | 10.01.1999 | _ | 2 | O | _ | True | intonation | info-request | _ | 1 | 14 | MELFI | My understanding from Dr. Cusamano, your family physician, is you collapsed? | 0 |
2 | understanding | understanding | NOUN | NN | 13 | nsubjpass | _ | 10.01.1999 | _ | 2 | O | _ | True | intonation | info-request | _ | 1 | 14 | MELFI | My understanding from Dr. Cusamano, your family physician, is you collapsed? | 1 | ||
3 | from | from | ADP | IN | 2 | prep | _ | 10.01.1999 | _ | 2 | O | _ | True | intonation | info-request | _ | 1 | 14 | MELFI | My understanding from Dr. Cusamano, your family physician, is you collapsed? | 2 | ||
4 | Dr. | Dr. | PROPN | NNP | 5 | compound | _ | 10.01.1999 | _ | 2 | O | _ | True | intonation | info-request | _ | 1 | 14 | MELFI | My understanding from Dr. Cusamano, your family physician, is you collapsed? | 3 | ||
5 | Cusamano | Cusamano | PROPN | NNP | 3 | pobj | _ | 10.01.1999 | _ | 3 | B | PERSON | True | intonation | info-request | _ | 1 | 14 | MELFI | My understanding from Dr. Cusamano, your family physician, is you collapsed? | 4 |
You can also interactively explore the corpus with tabview using the view()
method:
loaded.view()
The interactive view has a number of cool features, such as the ability to sort by row or column. Also, pressing enter
on a given line will generate a concordance based on that line's contents. Neat!
A corpus is a pandas DataFrame
object. The index is a MultiIndex
, comprised of filename
, sent_id
and token
as levels. Each token in the corpus is therefore uniquely identifiable through this index. The columns for the loaded copus are all the CONLL columns, plus anything included as metadata.
# get the first sentence using buzz.dataset.sent()
first = loaded.sent(0)
# using pandas syntax to get first 5 words
first.iloc[:5]["w"]
# join the wordclasses (column x) and words
print(" ".join(first.x.str.cat(first.w, sep="/")))
"DET/My NOUN/understanding ADP/from PROPN/Dr. PROPN/Cusamano PUNCT/, DET/your NOUN/family NOUN/physician PUNCT/, VERB/is PRON/you VERB/collapsed PUNCT/?
You don't need to know pandas, however, in order to use buzz, because buzz makes possible some more intuitive measures with linguistics in mind. For example, if you want to slice the corpus some way, you can easily do this using the just
and skip
accessors, combined with the column/metadata feature you want to filter by:
tony = loaded.just.speaker.TONY
# you can use a bracket syntax too (i.e. for regular expressions):
no_punct = loaded.skip.lemmata("^[^a-zA-Z0-9]")
# or you can pass in a list/set/tuple:
end_in_s = loaded.just.pos(["NNS", "NNPS", "VBZ"])
Note that columns can be accessed by single letter or full names: loaded.skip.words.doctor
is the same as loaded.skip.w.doctor
.
Any Dataset
object created by buzz has a .view()
method, which launches a tabview
interactive space where you can explore corpora, frequencies or concordances.
spaCy
is used under the hood for dependency parsing, and a couple of other things. spaCy bring with it a lot of state of the art methods in NLP. You can access the spaCy
representation of your data with:
corpus.to_spacy()
# or
loaded.to_spacy()
To search the dependency graph generated by spaCy during parsing, you can use the depgrep method.
# search dependencies for nominal subjects with definite articles
nsubj = loaded.depgrep('f/nsubj.*/ -> (w"the" & x"DET")')
depgrep is a query language, which works by modelling nodes and the links between them. Specifying a node, like f/nsubj/
, is done by specifying the feature you want to match (f
for function
), and a query inside slashes (for regular expressions) or inside quotation marks (for literal matches).
The arrow-like link specifies that the nsubj
must govern the determiner. The &
relation specifies that the two nodes are actually the same node. Brackets may be necessary to contain the query, since queries can be arbitrarily complex.
This language is based on Tgrep2
, syntax, customised for dependencies. It is still a work in progress, but documentation should emerge here, with repository here.
When you search a Corpus
or Dataset
, the result is simply another Dataset, representing a subset of the Corpus. Therefore, rather than trying to construct one query string that gets everything you want, it is often easier to perform multiple small searches:
query = 'f/nsubj/ <- f/ROOT/' # get nominal subjects dependent on sentence root
tony_subjects = loaded.skip.wordclass.PUNCT.just.speaker.TONY.depgrep(query)
Note that for any searches that do not require traversal of the grammatical structure, you should use the skip
and just
methods. tgrep/depgrep only need to be used when your search involves the grammar, and not just token features.
This is deprecated right now, due to lack of use (combined with requiring a lot of special handling). Make an issue if you really need this functionality and we can consider bringing it back, probably via BLLIP or Benepar. If you're making corpora with constituency parses, please use parse = (S ...)
as sentence-level metadata to encode the parse.
An important principle in buzz is the separation of searching and viewing results. Unlike many other tools, you do not search for a concordance---instead, you search the corpus, and then visualise the result of the search as a concordance.
Concordancing is a nice way of looking at results. The main thing you have to do is tell buzz how you want the match column to look---it can be just the matching words, but also any combination of things. To show words and their parts of speech, you can do:
nsubj = loaded.just.function.nsubj
nsubj.conc(show=["w", "p"])
You can turn your dataset into frequency tables, both before or after searching or filtering. Tabling takes a show
argument similar to the show
argument for concordancing, as well as an additional subcorpora
argument. show
represents the how the columns will be formatted, and subcorpora
is used as the index. Below we create a frequency table of nsubj
tokens, in lemma form, organised by speaker.
nsubj = loaded.just.function.nsubj
tab = nsubj.table(show="l", subcorpora=["speaker"])
Possible keyword arguments for the .table()
method are as follows:
Argument | Description | Default |
---|---|---|
subcorpora |
Feature(s) to use as the index of the table. Passing in a list of multiple features will create a multiindex | ['file'] |
show |
Feature(s) to use as the columns of the table. Passing a list will join the features with slash, so ['w', 'p'] results in columns with names like 'friend/NN' |
['w'] |
sort |
How to sort the results. 'total'/'infreq', 'increase/'decrease', 'static/turbulent', 'name'/'inverse' | 'total' |
relative |
Use relative, rather than absolute frequencies with True . You can also pass in Series, DataFrame or buzz objects to calculate relative frequencies against the passed in data. |
False |
remove_above_p |
Sorting by increase/decrease/static/turbulent calculates the slope of the frequencies across each subcorpus, and p-values where the null hypothesis is no slope. If you pass in a float, entries with p-values above this float are dropped from the results. Passing in True will use 0.05 . |
False |
keep_stats |
If True, keep generated statistics related to the trajectory calculation | False |
preserve_case |
Keep the original case for show (column) values |
False |
multiindex_columns |
When show is a list with multiple features, rather than joining show with slashes, build a multiindex |
False |
This creates a Table
object, which is also based on DataFrame. You can use its .view()
method to quickly explore results. Pressing enter on a given frequency will bring up a concordance of instances of this entry.
You can also use buzz to create high-quality visualisations of frequency data. Once you have generated a frequency table, use the table.plot()
to call pandas' plotting method.
tab.plot(...)
More experimentally, you can also use a purpose-built chart function, tailored for the kinds of frequency tables produced by buzz.table()
. It takes any combination of the following argumnets:
tab.chart(title=False, # title the figure
kind='line', # line/bar/hbar/heatmap/area/pie...
x_label=None, # label for x axis
y_label=None, # label for y axis
style='ggplot', # plot appearance (list below)
figsize=(8, 4), # x and y sizes
save=False, # path to save figure to
legend_pos='best', # 'upper right', 'outside right', 'lower right', etc
reverse_legend='guess', # should legend order be flipped
num_to_plot=6, # plot first n entries
tex='try', # render fonts with tex
colours='default', # colourmap name (e.g. viridis) or a LinearSegmentedColormap
cumulative=False, # show frequencies cumulatively, rather than separately
pie_legend=True, # turn off legend for pie charts
partial_pie=False, # allow pie slices when data does not sum to 100
show_totals=False, # print frequencies in legend
transparent=False, # transparent backgounds
output_format='png', # save file as type
black_and_white=False, # try to make a readable b+w figure
show_p_val=False, # print p values in legend
transpose=False, # transpose data before plotting
rot=False # rotate x tick labels
)
Supported plot styles: 'Solarize_Light2', '_classic_test_patch', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark', 'seaborn-dark-palette', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'tableau-colorblind10'
If you find bugs, feel free to create an issue. The project is open-source, so pull requests are also welcome. Code style is black
, and versioning is handled by bump2version
.