Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wiktionary parses but may be missing some pages #56

Open
e501 opened this issue Aug 5, 2018 · 2 comments
Open

Wiktionary parses but may be missing some pages #56

e501 opened this issue Aug 5, 2018 · 2 comments

Comments

@e501
Copy link

e501 commented Aug 5, 2018

Many thanks again for building and sharing such a convenient (and fast) parser !

This afternoon, I decided to explore loading Wiktionary via dumpster-dive and was pleasantly surprised at how quickly the workers loaded the respective wiki pages (i.e. 13.8 minutes).

The following is the summary provided at the end of the run:
#1 +1,000 pages - 27ms - "lautioris"
#0 +898 pages - 29ms - "meilėmis"

💪  worker #0 has finished 💪 
  - 1 workers still running -

#1 +140 pages - 4ms - "irascebare"

💪  worker #1 has finished 💪 
  - 0 workers still running -



  👍  closing down.

 -- final count is 5,702,608 pages --
   took 13.8 minutes
          🎉

When I checked the Wiktionary statistics page (https://en.wiktionary.org/wiki/Wiktionary:Statistics), the following statistics were listed for Wiktionary:

Wiktionary:Statistics
(Redirected from Wiktionary:STAT)
Jump to navigation
Jump to search
Shortcut:
WT:STATS
Contents

1 Selected language breakdowns
2 See also

See also: Special:Statistics

Number of entries: 5,721,450

Number of total pages: 6,322,904

Number of encoded languages: 8052

Number of uploaded files: 29

Number of user accounts: 3,446,188

Number of administrators: 98

Seems that approximately 19K (i.e. 5,721,450 - 5,702,608 = 18,842) entries were not parsed. For the current task, this is not a pressing issue but realized that I should provide feedback.

Ideally, I would like to develop an ability to utilize MongoDB to easily extract various portions of Wiktionary pages (e.g. synonyms), but the parsed results appear to have a variety of structurally different results. For example, from a quick spot check, there does not seem to be a consistent mapping for the section titles. Thus, the initial thought is that the parsed output is of limited value until I can figure out how to build the desired types of queries.

Look forward to feedback/comments and suggestions for how I might be able to utilize the parsed content.

Thanks again !

@spencermountain
Copy link
Owner

wow! that's amazing!
man, if I was king of the world, I'd fix-up wiktionary right away. I know wikipedia has quirks, but i've found wiktionary to be maybe twice-as-sloppy, and difficult to get around. It would be amazing if we could get a handle on it with this project.

I recommend putting a console.log() here to see what pages it is skipping. Maybe our disambiguation check is being too-greedy or something.

I'd also be curious to see what are the templates we are missing. It's become straight-forward adding new templates to parse in wtf.

I love the initial success of the first-run. I'm happy to commit to getting it working.
🍰

@e501
Copy link
Author

e501 commented Aug 7, 2018

Many thanks for your reply/feedback.

From an initial look at the parsed results, my thinking is that there may be limited value in trying to generate a consistent output that can provide a basis for MongoDB queries.After the page Id and title, the sections do not seem to have any consistency in their number and structure (e.g. nesting). The Wikipedia pages at least maintained a consistent structure that reflects the table of contents. For my current task, the main goal was to capture as many of the titles of the pages as feasible/practical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants