-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extractor problem #54
Comments
Thanks for raising this issue, I was not aware of the problem with boilerpipe. |
What if add a built-in parser as another choice? Like: https://github.com/ndaidong/article-parser, https://github.com/Tjatse/node-readability & https://github.com/mozilla/readability, and maybe an option to support setting customized Mercury/Boilerpipe service url for self-hosted service could help tolerate the quota issue. (Don't know if it's easy to setup one, yet.) |
That's another possibility, but this change would significantly affect the permissions of this extension. See my comment on #38. If we really want to use a built-in parser, I would prefer to create a new extension and avoid disruption for existing users. |
Looks like the GitHub repo of boilerpipe is no longer under maintenance, the issues and pull requests are just stale, and the web version started to return 500 error for a while, also saw a bunch of tweets asking about boilerpipe but got no response. Maybe should just consider to remove it for now? |
Thanks for the inputs! When we do that, we should make sure users are automatically/transparently moved to a different extractor. |
Hi there,
With all due respect, fullyfeedly seem to be a very awesome browser extension, can help saving time and focus on the interested content, with less browser tabs switching!
I just noticed that there is an issue: the recommended Mercury extractor by default isn't powerful enough to work on many websites, looks like it need many custom extractor/parser to deal with different websites, the non-default Boilerpipe is very powerful, but not only the limited quota issue mentioned in the README.md, I also found that the request from fullyfeedly to Boilerpipe web service will face CORS error issues, which means it's not working right now, combined the different situation together, fullyfeedly will only be 100% working on limited websites.
Not sure if it's because the websites I frequently visit can't be properly parsed by Mercury is a coincidence, but I do compare the extracted result with Boilerpipe's, Boilerpipe works pretty better, in contrast, Mercury sometimes just extracted not meaningful html tags.
For the first part, I guess I can only write custom extractors and send pull requests to Mercury, but it could really consumed time, and not pretty scalable.
For second part: I've opened an issue at kohlschutter/boilerpipe#28, if anyone is also looking for a workaround, here it is: https://add0n.com/access-control.html (CORS Unblock).
Not sure if there is anything we can do to help improve the issue, will hosting an individual Boilerpipe web service be a considerable option? Or it's better to find some alternatives?
Thanks a lot!
The text was updated successfully, but these errors were encountered: