Wiktionary parses but may be missing some pages #56

e501 · 2018-08-05T20:39:21Z

Many thanks again for building and sharing such a convenient (and fast) parser !

This afternoon, I decided to explore loading Wiktionary via dumpster-dive and was pleasantly surprised at how quickly the workers loaded the respective wiki pages (i.e. 13.8 minutes).

The following is the summary provided at the end of the run:
#1 +1,000 pages - 27ms - "lautioris"
#0 +898 pages - 29ms - "meilėmis"

💪  worker #0 has finished 💪 
  - 1 workers still running -

#1 +140 pages - 4ms - "irascebare"

💪  worker #1 has finished 💪 
  - 0 workers still running -



  👍  closing down.

 -- final count is 5,702,608 pages --
   took 13.8 minutes
          🎉

When I checked the Wiktionary statistics page (https://en.wiktionary.org/wiki/Wiktionary:Statistics), the following statistics were listed for Wiktionary:

Wiktionary:Statistics
(Redirected from Wiktionary:STAT)
Jump to navigation
Jump to search
Shortcut:
WT:STATS
Contents

1 Selected language breakdowns
2 See also

See also: Special:Statistics

Number of entries: 5,721,450

Number of total pages: 6,322,904

Number of encoded languages: 8052

Number of uploaded files: 29

Number of user accounts: 3,446,188

Number of administrators: 98

Seems that approximately 19K (i.e. 5,721,450 - 5,702,608 = 18,842) entries were not parsed. For the current task, this is not a pressing issue but realized that I should provide feedback.

Ideally, I would like to develop an ability to utilize MongoDB to easily extract various portions of Wiktionary pages (e.g. synonyms), but the parsed results appear to have a variety of structurally different results. For example, from a quick spot check, there does not seem to be a consistent mapping for the section titles. Thus, the initial thought is that the parsed output is of limited value until I can figure out how to build the desired types of queries.

Look forward to feedback/comments and suggestions for how I might be able to utilize the parsed content.

Thanks again !

The text was updated successfully, but these errors were encountered:

spencermountain · 2018-08-06T21:54:32Z

wow! that's amazing!
man, if I was king of the world, I'd fix-up wiktionary right away. I know wikipedia has quirks, but i've found wiktionary to be maybe twice-as-sloppy, and difficult to get around. It would be amazing if we could get a handle on it with this project.

I recommend putting a console.log() here to see what pages it is skipping. Maybe our disambiguation check is being too-greedy or something.

I'd also be curious to see what are the templates we are missing. It's become straight-forward adding new templates to parse in wtf.

I love the initial success of the first-run. I'm happy to commit to getting it working.
🍰

e501 · 2018-08-07T02:58:31Z

Many thanks for your reply/feedback.

From an initial look at the parsed results, my thinking is that there may be limited value in trying to generate a consistent output that can provide a basis for MongoDB queries.After the page Id and title, the sections do not seem to have any consistency in their number and structure (e.g. nesting). The Wikipedia pages at least maintained a consistent structure that reflects the table of contents. For my current task, the main goal was to capture as many of the titles of the pages as feasible/practical.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wiktionary parses but may be missing some pages #56

Wiktionary parses but may be missing some pages #56

e501 commented Aug 5, 2018

spencermountain commented Aug 6, 2018

e501 commented Aug 7, 2018

Wiktionary parses but may be missing some pages #56

Wiktionary parses but may be missing some pages #56

Comments

e501 commented Aug 5, 2018

spencermountain commented Aug 6, 2018

e501 commented Aug 7, 2018