Extractor problem #54

PeterDaveHello · 2021-02-19T15:09:04Z

Hi there,

With all due respect, fullyfeedly seem to be a very awesome browser extension, can help saving time and focus on the interested content, with less browser tabs switching!

I just noticed that there is an issue: the recommended Mercury extractor by default isn't powerful enough to work on many websites, looks like it need many custom extractor/parser to deal with different websites, the non-default Boilerpipe is very powerful, but not only the limited quota issue mentioned in the README.md, I also found that the request from fullyfeedly to Boilerpipe web service will face CORS error issues, which means it's not working right now, combined the different situation together, fullyfeedly will only be 100% working on limited websites.

Not sure if it's because the websites I frequently visit can't be properly parsed by Mercury is a coincidence, but I do compare the extracted result with Boilerpipe's, Boilerpipe works pretty better, in contrast, Mercury sometimes just extracted not meaningful html tags.

For the first part, I guess I can only write custom extractors and send pull requests to Mercury, but it could really consumed time, and not pretty scalable.

For second part: I've opened an issue at kohlschutter/boilerpipe#28, if anyone is also looking for a workaround, here it is: https://add0n.com/access-control.html (CORS Unblock).

Not sure if there is anything we can do to help improve the issue, will hosting an individual Boilerpipe web service be a considerable option? Or it's better to find some alternatives?

Thanks a lot!

Muffo · 2021-02-21T00:35:42Z

Thanks for raising this issue, I was not aware of the problem with boilerpipe.
I suspect that by deploying this code on a different cloud service we will run into the same quota limits, unless we decide to pay for increased capacity.
On the other hand, I am happy to consider other free APIs in case there are new solutions released after I created this extension.

PeterDaveHello · 2021-02-21T10:59:52Z

What if add a built-in parser as another choice? Like: https://github.com/ndaidong/article-parser, https://github.com/Tjatse/node-readability & https://github.com/mozilla/readability, and maybe an option to support setting customized Mercury/Boilerpipe service url for self-hosted service could help tolerate the quota issue. (Don't know if it's easy to setup one, yet.)

Muffo · 2021-02-22T17:25:30Z

That's another possibility, but this change would significantly affect the permissions of this extension. See my comment on #38.

If we really want to use a built-in parser, I would prefer to create a new extension and avoid disruption for existing users.

PeterDaveHello · 2021-08-20T08:42:58Z

Looks like the GitHub repo of boilerpipe is no longer under maintenance, the issues and pull requests are just stale, and the web version started to return 500 error for a while, also saw a bunch of tweets asking about boilerpipe but got no response. Maybe should just consider to remove it for now?

Muffo · 2021-08-21T15:05:57Z

Thanks for the inputs!
I haven't checked boilerpipe in a while, but I agree it could be removed if not working properly.

When we do that, we should make sure users are automatically/transparently moved to a different extractor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extractor problem #54

Extractor problem #54

PeterDaveHello commented Feb 19, 2021

Muffo commented Feb 21, 2021

PeterDaveHello commented Feb 21, 2021 •

edited

Loading

Muffo commented Feb 22, 2021

PeterDaveHello commented Aug 20, 2021

Muffo commented Aug 21, 2021

Extractor problem #54

Extractor problem #54

Comments

PeterDaveHello commented Feb 19, 2021

Muffo commented Feb 21, 2021

PeterDaveHello commented Feb 21, 2021 • edited Loading

Muffo commented Feb 22, 2021

PeterDaveHello commented Aug 20, 2021

Muffo commented Aug 21, 2021

PeterDaveHello commented Feb 21, 2021 •

edited

Loading