OSU Twitter NLP Tools

Example Usage:

UPDATED: : Added support for reading from file and writing to a tab seperated file which can have text in any column.

export TWITTER_NLP=./
python python/ner/extractEntities.py test.1k.txt -o output.txt

If the file is a tab separated file. Use the i-th (starting from 0) column as a text column to read from. Output file will have that column data replaced with the annotated text.

CAUTION: Make sure there are no newline characters in the text column. This will break the format.

Shortened options for other features:

$ python/ner/extractEntities.py -h
usage: extractEntities.py [-h] [--text-pos TEXT_POS]
                          [--output-file OUTPUT_FILE] [--chunk] [--pos]
                          [--event] [--classify]
                          input_file

positional arguments:
  input_file            Path to the input file. Each line should have the
                        text.Optionally it can be a tab delimited file.

optional arguments:
  -h, --help            show this help message and exit
  --text-pos TEXT_POS, -t TEXT_POS
                        Column number (starting from 0) of the column
                        containing text
  --output-file OUTPUT_FILE, -o OUTPUT_FILE
                        Path to the output file
  --chunk, -k
  --pos, -p
  --event, -e
  --classify, -c

Alternate Usage (Reading from stdin):

export TWITTER_NLP=./
cat test.1k.txt | python python/ner/extractEntities2.py

note: this takes a minute or so to read in models from files

To include classification, simply add the --classify switch:

cat test.1k.txt | python python/ner/extractEntities2.py --classify

For higher quality, but slower results, optionally include features based on POS and chunk tags (chunk tags require POS)

cat test.1k.txt | python python/ner/extractEntities2.py --classify --pos
cat test.1k.txt | python python/ner/extractEntities2.py --classify --pos --chunk

Also has the ability to include event tags (requires POS):

cat test.1k.txt | python python/ner/extractEntities2.py --classify --pos --event

Output:

The output contains the tokenized and tagged words separated by spaces with tags separated by forward slash '/' Example output:

The/B-movie/DT/B-NP/O Town/I-movie/NNP/I-NP/O might/O/MD/B-VP/O be/O/VB/I-VP/O one/O/CD/B-NP/O of/O/IN/B-PP/O the/O/DT/B-NP/O best/O/JJS/I-NP/O movies/O/NNS/I-NP/O I/O/PRP/B-NP/O have/O/VBP/B-VP/O seen/O/VBN/I-VP/O all/O/DT/B-NP/O year/O/NN/I-NP/O ./O/./O/O So/O/RB/O/O ,/O/,/O/O so/O/RB/B-ADJP/O good/O/JJ/I-ADJP/O ./O/./O/O And/O/CC/O/O don't/O/NN/B-NP/O worry/O/NN/I-NP/O Ben/B-person/NNP/I-NP/O ,/O/,/O/O we/O/PRP/B-NP/O already/O/RB/B-ADVP/O forgave/O/VBP/B-VP/B-EVENT you/O/PRP/B-NP/O for/O/IN/B-PP/O Gigli/B-movie/NNP/B-NP/O ./O/./O/O Really/O/RB/B-INTJ/O ./O/./I-INTJ/O

Looking at just one word:

The/B-movie/DT/B-NP/O

The fields are as follows:

Word:	The
Entity:	B-movie	Begins a named entity of type "movie"
Chunk:	B-NP	Begins a noun phrase
Event:	O	Not part of an event phrase

The BIO encoding is used for encoding phrases (Named Entities, event phrases, and chunks), for example:

The/B-movie Town/I-movie might/O ...

Indicates that the word "The" begins a named entity of type movie, "Town" continues that entity, and "might" is outside of an entity mention. For more details see: http://nltk.org/book/ch07.html

Requirements:

Linux
Libraries and executables can be compiled with build.sh

Relevant papers:

@inproceedings{Ritter11,
  author = {Ritter, Alan and Clark, Sam and Mausam and Etzioni, Oren},
  title = {Named Entity Recognition in Tweets: An Experimental Study},
  booktitle = {EMNLP},
  year = {2011}
}

@inproceedings{Ritter12,
  author = {Ritter, Alan and Mausam and Etzioni, Oren and Clark, Sam},
  title = {Open Domain Event Extraction from Twitter},
  booktitle = {KDD},
  year = {2012}
}

Demo:

statuscalendar.com

Acknowledgments (bug fixes, etc...):

Junming Sui

Ming-Wei Chang

Tuan Anh Hoang Vu

sumant81

Yiye Ruan

Lu Wang

napsternxg

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
data		data
hbc		hbc
lib		lib
mallet-2.0.6		mallet-2.0.6
models		models
python		python
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TinySVM-0.09.tar.gz		TinySVM-0.09.tar.gz
build.sh		build.sh
test.1k.txt		test.1k.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSU Twitter NLP Tools

Example Usage:

Alternate Usage (Reading from stdin):

Output:

Requirements:

Relevant papers:

Demo:

Acknowledgments (bug fixes, etc...):

About

Releases

Packages

Contributors 3

Languages

License

aritter/twitter_nlp

Folders and files

Latest commit

History

Repository files navigation

OSU Twitter NLP Tools

Example Usage:

Alternate Usage (Reading from stdin):

Output:

Requirements:

Relevant papers:

Demo:

Acknowledgments (bug fixes, etc...):

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages