This repo holds example datasets for testing Wukong -- and in many cases useful beyond that.
To keep the git repo from bloating too much, some datasets are put up as downloads and not versioned directly.
geo/wikigrounder_toponyms
(5 MB/23 MB) -- grounded place names with rough categories (county, province, etc) download
-
wikipedia
wikipedia_articles
-- article textwikipedia_pageinfos
-- article metadatawikipedia_pagelinks
-- pagelinkswikipedia_pageviews
-- pageview counts by hour- (geolocated)
- (geoimplied)
- (dbpedia)
-
geo
countries
timezones
geonames_places
geonames_postal
iso_currencies
iso_languages
iso_langscripts
natural_earth
- https://github.com/zmaril/Visualization-Data[GeoJSON world boundaries]
cia_factbook
-
scaffold
- fakered_customer_data
- integers
- lorem
-
airline flights
airline_flights
airline_airports
airline_airlines
airline_airfares
-
weather
weather_hourly
-- hourly globalweather_stations
-- weather stations
-
access logs
weblogs_waxydotorg
weblogs_worldcup
soon:
-
words
- twl/CSW(sowpods) -- lang/corpora/scrabble
- BNC --
- quackle -- misc/words_quackle
- wordnet
- dirty_words
- nltk
- stopwords
- color_names
-
text
- short
- gutenberg
-
UFO sightings
-
retrosheet game logs
- parks
- teams
- franchises
- players
- games