HTML of the content extracted #59

papriwalprateek · 2014-01-15T15:16:09Z

main content of the article is extracted using .content. But how can the main content of article be extracted in the same css format?

cantino · 2014-01-15T23:48:28Z

So you want it to figure out the main content area, but not strip any HTML?

papriwalprateek · 2014-01-16T03:14:24Z

Actually, when i do .content, it works well but does not retain images in the main content. How can i extract main content along with the images.
For example, I wanted to extract the main content of http://algs4.cs.princeton.edu/22mergesort/. It extracts but does not keep the images. How can this be achieved?

papriwalprateek · 2014-01-16T03:16:41Z

In short, can i extract the main content while keeping most of the format and content as it is.

cantino · 2014-01-16T04:58:30Z

Have you tried to pass a list of all tags that you want to keep into :tags?

papriwalprateek · 2014-01-16T05:08:02Z

Hmm yes i have. For example to extract the main content of https://www.cs.auckland.ac.nz/~jmor159/PLDS210/qsort.html, i have used

source = open('https://www.cs.auckland.ac.nz/~jmor159/PLDS210/qsort.html').read
y = Readability::Document.new(source,:tags => %w[p div pre img h1 h2 h3 h4 li ul tt em b a ol blockquote center br table td tr tbody font i dl dt dd], :attributes => %w[href rowspan border color src bgcolor width size align face]).content

but the result does not contain the image and sideby code.

Is there a way to extract everything of the main content?

cantino · 2014-01-16T05:17:11Z

You could try calling prepare_candidates instead of content, then looking at the value of best_candidate. This is a port of the JavaScript readability library, so it's (originally) intended for cleaning out content and making readable text. That said, it would be useful to make it easier to just return the primary content region in-full.

papriwalprateek · 2014-01-17T11:39:27Z

I tried calling prepare_candidates, but it gave a single element. I am not getting what you are saying. There are times when some img or tables is being missed out of the content. Can this be tailored?

cantino · 2014-01-17T19:08:46Z

prepare_candidates gave a single element, or calling best_candidate did after having called prepare_candidates?

papriwalprateek · 2014-01-21T18:40:40Z

Hi,

I applied .content on http://www.algolist.net/Algorithms/Sorting/Bubble_sort . I got fairly good content but images were not coming. Is there a way to get them ?

cantino · 2014-01-21T23:59:28Z

Did you include img in :tags? There is also a call to get images. See https://github.com/cantino/ruby-readability#images

pagojo · 2014-01-22T10:37:27Z

Sorry to sound repetitive but this may be due to #51 perhaps?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML of the content extracted #59

HTML of the content extracted #59

papriwalprateek commented Jan 15, 2014

cantino commented Jan 15, 2014

papriwalprateek commented Jan 16, 2014

papriwalprateek commented Jan 16, 2014

cantino commented Jan 16, 2014

papriwalprateek commented Jan 16, 2014

cantino commented Jan 16, 2014

papriwalprateek commented Jan 17, 2014

cantino commented Jan 17, 2014

papriwalprateek commented Jan 21, 2014

cantino commented Jan 21, 2014

pagojo commented Jan 22, 2014

HTML of the content extracted #59

HTML of the content extracted #59

Comments

papriwalprateek commented Jan 15, 2014

cantino commented Jan 15, 2014

papriwalprateek commented Jan 16, 2014

papriwalprateek commented Jan 16, 2014

cantino commented Jan 16, 2014

papriwalprateek commented Jan 16, 2014

cantino commented Jan 16, 2014

papriwalprateek commented Jan 17, 2014

cantino commented Jan 17, 2014

papriwalprateek commented Jan 21, 2014

cantino commented Jan 21, 2014

pagojo commented Jan 22, 2014