Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extractor problem #54

Open
PeterDaveHello opened this issue Feb 19, 2021 · 5 comments
Open

Extractor problem #54

PeterDaveHello opened this issue Feb 19, 2021 · 5 comments

Comments

@PeterDaveHello
Copy link
Contributor

Hi there,

With all due respect, fullyfeedly seem to be a very awesome browser extension, can help saving time and focus on the interested content, with less browser tabs switching!

I just noticed that there is an issue: the recommended Mercury extractor by default isn't powerful enough to work on many websites, looks like it need many custom extractor/parser to deal with different websites, the non-default Boilerpipe is very powerful, but not only the limited quota issue mentioned in the README.md, I also found that the request from fullyfeedly to Boilerpipe web service will face CORS error issues, which means it's not working right now, combined the different situation together, fullyfeedly will only be 100% working on limited websites.

Not sure if it's because the websites I frequently visit can't be properly parsed by Mercury is a coincidence, but I do compare the extracted result with Boilerpipe's, Boilerpipe works pretty better, in contrast, Mercury sometimes just extracted not meaningful html tags.

For the first part, I guess I can only write custom extractors and send pull requests to Mercury, but it could really consumed time, and not pretty scalable.

For second part: I've opened an issue at kohlschutter/boilerpipe#28, if anyone is also looking for a workaround, here it is: https://add0n.com/access-control.html (CORS Unblock).

Not sure if there is anything we can do to help improve the issue, will hosting an individual Boilerpipe web service be a considerable option? Or it's better to find some alternatives?

Thanks a lot!

@Muffo
Copy link
Owner

Muffo commented Feb 21, 2021

Thanks for raising this issue, I was not aware of the problem with boilerpipe.
I suspect that by deploying this code on a different cloud service we will run into the same quota limits, unless we decide to pay for increased capacity.
On the other hand, I am happy to consider other free APIs in case there are new solutions released after I created this extension.

@PeterDaveHello
Copy link
Contributor Author

PeterDaveHello commented Feb 21, 2021

What if add a built-in parser as another choice? Like: https://github.com/ndaidong/article-parser, https://github.com/Tjatse/node-readability & https://github.com/mozilla/readability, and maybe an option to support setting customized Mercury/Boilerpipe service url for self-hosted service could help tolerate the quota issue. (Don't know if it's easy to setup one, yet.)

@Muffo
Copy link
Owner

Muffo commented Feb 22, 2021

That's another possibility, but this change would significantly affect the permissions of this extension. See my comment on #38.

If we really want to use a built-in parser, I would prefer to create a new extension and avoid disruption for existing users.

@PeterDaveHello
Copy link
Contributor Author

Looks like the GitHub repo of boilerpipe is no longer under maintenance, the issues and pull requests are just stale, and the web version started to return 500 error for a while, also saw a bunch of tweets asking about boilerpipe but got no response. Maybe should just consider to remove it for now?

@Muffo
Copy link
Owner

Muffo commented Aug 21, 2021

Thanks for the inputs!
I haven't checked boilerpipe in a while, but I agree it could be removed if not working properly.

When we do that, we should make sure users are automatically/transparently moved to a different extractor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants