Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving main document image #52

Open
tommedema opened this issue Jun 29, 2015 · 2 comments
Open

Retrieving main document image #52

tommedema opened this issue Jun 29, 2015 · 2 comments

Comments

@tommedema
Copy link

Libraries like boilerpipe allow one to extract an article's main image. Is it possible to achieve this with node-readability?

@kashifeqbal
Copy link

Node-Unfluff is best to get the Article main Image.

var read = require('node-readability');

var extractor = require('unfluff');

read('http://www.ainonline.com/aviation-news/aerospace/2016-01-14/boeing-and-unions-reconciled-sides-reach-tentative-deal', function (err, article, meta) {
  // Main Article 
  console.log(article.content);
  console.log("--------------------------------------------------");
  // Title 
  console.log(article.title);
  console.log("--------------------------------------------------");

  // HTML Source Code 
  console.log(article.html);
  console.log("--------------------------------------------------");

  // Response Object from Request Lib 
  console.log(meta);
  console.log("--------------------------------------------------");

  // Extract article main image
  var data = extractor(article.html);
  console.log(data.image);
  console.log("--------------------------------------------------");

  // Close article to clean up jsdom and prevent leaks 
  article.close();
});

@458muda
Copy link

458muda commented Oct 6, 2016

I am using the following code to grab the main image of the article.
helpers.js

Grabbing main image of the article

var grabImage = module.exports.grabImage = function(document) {
    var images = document.getElementsByTagName('IMG');
    var MINIMUM_SURFACE =  100*100;
     if (images.length > 0) { 
    for (var i = 0; i < images.length; ++i) {
      var image = images[i];
       if ( image.getAttribute('data-src') ) {
                    image.getAttribute('src') = image.getAttribute('data-src');
                }
                if ( image.getAttribute('data-lazy-src') ) {
                    image.getAttribute('src') = image.getAttribute('data-lazy-src');
                }
                 if ( !image.getAttribute('src') ) {
                    continue;
                }

                // //Compute surface
                // var w = image.getAttribute('width') || 1;
                // var h = image.getAttribute('height') || 1;
                // image.surface = w * h;



                // //Filter by size
                // if ( image.surface > MINIMUM_SURFACE ) {
                 var mainImageSrc = image.getAttribute('src');

                     //Resolve relative url
                     if (!mainImageSrc.match(/^http/)) {
                                     if (!image.ownerDocument.originalURL) {

                                          } else{
                          mainImageSrc = url.resolve(image.ownerDocument.originalURL, mainImageSrc);
                        }
                      }
                      image.parentNode.removeChild(image);
                    break;
              //  }

       }
      }
      return mainImageSrc; 
};

and readability.js

var mainImgUrl = helpers.grabImage(this._document);
  var img = this._document.createElement("IMG");
   img.setAttribute('src',  mainImageUrl); 
   articleContent.insertBefore(img, articleContent.childNodes[0] );

I am adding this above part of code in this function
Readability.prototype.getContent = function(notDeprecated) {
But, it' not working. The whole content is being grabbed ,but I'm getting this error

> Cleaning Conditionally [object HTMLDivElement] (image width-494:)
Cleaning Conditionally [object HTMLDivElement] (:)
fixed link
C:\Users\SAI\reader-rest\routes\api.js:19
                var content = '<html><head><meta charset="utf-8"><title>'+articl
e.title+'</title></head><body>' +article.content+'</body></html>';

 ^

TypeError: Cannot read property 'title' of undefined
    at C:\Users\SAI\reader-rest\routes\api.js:19:78
    at Object.jsdom.env.done (C:\Users\SAI\reader-rest\node_modules\node-readabi
lity\src\readability.js:292:18)
    at C:\Users\SAI\reader-rest\node_modules\node-readability\node_modules\jsdom
\lib\jsdom.js:259:18
    at nextTickCallbackWith0Args (node.js:420:9)
    at process._tickCallback (node.js:349:13)
typeerror: Cannot read property 'title' of undefined    at C:\Users\SAI\reader-rest\routes\api.js:19:78    at Object.jsdom.env.done (C:\Users\SAI\reader-rest\node_modules\node-readability\src\readability.js:292:18)    at C:\Users\SAI\reader-rest\node_modules\node-readability\node_modules\jsdom\lib\jsdom.js:259:18    at nextTickCallbackWith0Args (node.js:420:9)    at process._tickCallback (node.js:349:13)

Any Help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants