cleanedArticleText() returns nothing #77

ghost · 2013-04-23T07:59:15Z

I am using the goose-2.1.22.jar. My Java code is:

package com.pasionat.test;

import com.gravity.goose.Article;
import com.gravity.goose.Configuration;
import com.gravity.goose.Goose;

public class Main {
public static void main(String[] args) throws Exception {
String url_string = "http://www.cnn.com/2010/POLITICS/08/13/democrats.social.security/index.html";
Goose goose = new Goose(new Configuration());

    Article article = goose.extractContent(url_string);

    System.out.println("Goose text: " + article.cleanedArticleText());
    System.out.println("Goose rawHtml: " + article.rawHtml());
    System.out.println("Goose title: " + article.title());
}

}

cleanedArticleText() returns empty string, rawHTML returns the full HTML code of the page and title() returns the title of the page. So the website is being parsed and processed, but for some reason cleanedArticleText() is always empty. Is this a bug or am I doing something wrong? Thanks.

The text was updated successfully, but these errors were encountered:

pacificleo · 2013-06-13T18:33:57Z

I am hitting this too. Have you figured out a workaround?

pacificleo · 2013-06-13T21:56:10Z

I figured out the issue. The src\main\resources\com\gravity\goose\text\stopwords-en.txt file when downloaded from github still had "\n" as line endings. Since I am working on windows, the scala line separator is "\r\n". Hence, the stopwords are incorrectly split. This leads to issues in ContentExtractor.scala algorithm which uses number of stopwords in a paragraph as a clustering input

Simply fixing the end-of-line characters in the stopwords file has fixed this issue for me. I think git might be the cause of the issue since it should translate the line endings automatically.

ghost · 2013-06-14T07:22:23Z

Nice find and thanks for posting the solution! It worked for me to.

coding-idiot · 2014-03-28T10:29:23Z

Thanks a ton, I've been trying to figure out since last week and your solution worked perfectly fine.

shohamtalsignals · 2014-09-10T13:03:29Z

or you can fix the splitter (I develop on windows and deploy to linux so I need to use the same file):

go to \src\main\scala\com\gravity\goose\text\StopWords.scala .
change this line:
val STOP_WORDS = FileHelper.loadResourceFile("stopwords-en.txt", StopWords.getClass).split(sys.props("line.separator")).toSet

to this:
val STOP_WORDS = FileHelper.loadResourceFile("stopwords-en.txt", StopWords.getClass).split("\r?\n").toSet

dylanwatsonsoftware · 2015-06-08T04:59:31Z

Looks like this issue was introduced here: #58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cleanedArticleText() returns nothing #77

cleanedArticleText() returns nothing #77

ghost commented Apr 23, 2013

pacificleo commented Jun 13, 2013

pacificleo commented Jun 13, 2013

ghost commented Jun 14, 2013

coding-idiot commented Mar 28, 2014

shohamtalsignals commented Sep 10, 2014

dylanwatsonsoftware commented Jun 8, 2015

cleanedArticleText() returns nothing #77

cleanedArticleText() returns nothing #77

Comments

ghost commented Apr 23, 2013

pacificleo commented Jun 13, 2013

pacificleo commented Jun 13, 2013

ghost commented Jun 14, 2013

coding-idiot commented Mar 28, 2014

shohamtalsignals commented Sep 10, 2014

dylanwatsonsoftware commented Jun 8, 2015