Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple problems in scraping of multimedia content #890

Open
kelson42 opened this issue Jul 10, 2019 · 9 comments · May be fixed by #1821
Open

Multiple problems in scraping of multimedia content #890

kelson42 opened this issue Jul 10, 2019 · 9 comments · May be fixed by #1821
Assignees
Milestone

Comments

@kelson42
Copy link
Collaborator

I have created a small test about multiple type of content and ways to include them, but everything is standard. It is available here https://en.m.wikipedia.org/wiki/User:Kelson/MWoffliner_CI_reference.

I have scarped it with 1.9.4 and this was a bit disappointing. We have a here many problems, most of them being that the content is simply not made available. I think such a page should be really tested properly to secure that we don't have anymore big problem around multimedia content displaying.

@kelson42 kelson42 added the bug label Jul 10, 2019
@kelson42 kelson42 added this to the 1.9-maintenance milestone Jul 10, 2019
@ISNIT0
Copy link
Contributor

ISNIT0 commented Jul 12, 2019

Many things are broken broken because of the keepEmptyParagraphs issue, fixed in #886

@ISNIT0
Copy link
Contributor

ISNIT0 commented Jul 12, 2019

Just merged a few pull requests and things are looking much better :)

@ISNIT0 ISNIT0 closed this as completed Jul 15, 2019
@kelson42
Copy link
Collaborator Author

kelson42 commented Jul 15, 2019

@ISNIT0 We need automated tests for this multimedia scraping... I don't count the number of tickets I have open in the past for multimedia content not mirrored properly... and I had to open one a week ago. I don't want to open new ones in the future. This has to be secured.

BTW, I'm quite sure there is way to inject wikicode to the parsoid/MSC API and get the HTML back. So the automated tests should use that instead of starting directly from HTML (which offer no garanty that this is the kind of HTML that the Mediawiki - still - deliver).

@kelson42 kelson42 reopened this Jul 15, 2019
@ISNIT0
Copy link
Contributor

ISNIT0 commented Jul 18, 2019

Testing this is not in 1.9 or 2.0

@kelson42
Copy link
Collaborator Author

@ISNIT0 ISNIT0 removed this from the 1.9-maintenance milestone Jul 22, 2019
@kelson42 kelson42 assigned kelson42 and unassigned ISNIT0 Aug 2, 2019
@kelson42
Copy link
Collaborator Author

kelson42 commented Aug 2, 2019

I will have a look in detail to that ticket to see if it works now.

@kelson42 kelson42 added this to the 1.9-maintenance milestone Aug 2, 2019
@stale
Copy link

stale bot commented Oct 1, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Oct 1, 2019
@kelson42 kelson42 modified the milestones: 1.9-maintenance, 2.0 Apr 9, 2020
@stale stale bot removed the stale label Apr 9, 2020
@stale
Copy link

stale bot commented Jun 8, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jun 8, 2020
@kelson42 kelson42 modified the milestones: 2.0, 1.13.0 Feb 1, 2023
@stale stale bot removed the stale label Feb 1, 2023
@kelson42 kelson42 modified the milestones: 1.13.0, 1.14.0 Mar 5, 2023
@stale
Copy link

stale bot commented May 28, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants