You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many thanks again for building and sharing such a convenient (and fast) parser !
This afternoon, I decided to explore loading Wiktionary via dumpster-dive and was pleasantly surprised at how quickly the workers loaded the respective wiki pages (i.e. 13.8 minutes).
The following is the summary provided at the end of the run: #1 +1,000 pages - 27ms - "lautioris"
#0 +898 pages - 29ms - "meilėmis"
💪 worker #0 has finished 💪
- 1 workers still running -
Wiktionary:Statistics
(Redirected from Wiktionary:STAT)
Jump to navigation
Jump to search
Shortcut:
WT:STATS
Contents
1 Selected language breakdowns
2 See also
See also: Special:Statistics
Number of entries: 5,721,450
Number of total pages: 6,322,904
Number of encoded languages: 8052
Number of uploaded files: 29
Number of user accounts: 3,446,188
Number of administrators: 98
Seems that approximately 19K (i.e. 5,721,450 - 5,702,608 = 18,842) entries were not parsed. For the current task, this is not a pressing issue but realized that I should provide feedback.
Ideally, I would like to develop an ability to utilize MongoDB to easily extract various portions of Wiktionary pages (e.g. synonyms), but the parsed results appear to have a variety of structurally different results. For example, from a quick spot check, there does not seem to be a consistent mapping for the section titles. Thus, the initial thought is that the parsed output is of limited value until I can figure out how to build the desired types of queries.
Look forward to feedback/comments and suggestions for how I might be able to utilize the parsed content.
Thanks again !
The text was updated successfully, but these errors were encountered:
wow! that's amazing!
man, if I was king of the world, I'd fix-up wiktionary right away. I know wikipedia has quirks, but i've found wiktionary to be maybe twice-as-sloppy, and difficult to get around. It would be amazing if we could get a handle on it with this project.
I recommend putting a console.log() here to see what pages it is skipping. Maybe our disambiguation check is being too-greedy or something.
I'd also be curious to see what are the templates we are missing. It's become straight-forward adding new templates to parse in wtf.
I love the initial success of the first-run. I'm happy to commit to getting it working.
🍰
From an initial look at the parsed results, my thinking is that there may be limited value in trying to generate a consistent output that can provide a basis for MongoDB queries.After the page Id and title, the sections do not seem to have any consistency in their number and structure (e.g. nesting). The Wikipedia pages at least maintained a consistent structure that reflects the table of contents. For my current task, the main goal was to capture as many of the titles of the pages as feasible/practical.
Many thanks again for building and sharing such a convenient (and fast) parser !
This afternoon, I decided to explore loading Wiktionary via dumpster-dive and was pleasantly surprised at how quickly the workers loaded the respective wiki pages (i.e. 13.8 minutes).
The following is the summary provided at the end of the run:
#1 +1,000 pages - 27ms - "lautioris"
#0 +898 pages - 29ms - "meilėmis"
#1 +140 pages - 4ms - "irascebare"
When I checked the Wiktionary statistics page (https://en.wiktionary.org/wiki/Wiktionary:Statistics), the following statistics were listed for Wiktionary:
Wiktionary:Statistics
(Redirected from Wiktionary:STAT)
Jump to navigation
Jump to search
Shortcut:
WT:STATS
Contents
Number of entries: 5,721,450
Number of total pages: 6,322,904
Number of encoded languages: 8052
Number of uploaded files: 29
Number of user accounts: 3,446,188
Number of administrators: 98
Seems that approximately 19K (i.e. 5,721,450 - 5,702,608 = 18,842) entries were not parsed. For the current task, this is not a pressing issue but realized that I should provide feedback.
Ideally, I would like to develop an ability to utilize MongoDB to easily extract various portions of Wiktionary pages (e.g. synonyms), but the parsed results appear to have a variety of structurally different results. For example, from a quick spot check, there does not seem to be a consistent mapping for the section titles. Thus, the initial thought is that the parsed output is of limited value until I can figure out how to build the desired types of queries.
Look forward to feedback/comments and suggestions for how I might be able to utilize the parsed content.
Thanks again !
The text was updated successfully, but these errors were encountered: