Skip to content
This repository has been archived by the owner on Oct 30, 2018. It is now read-only.

cleanedArticleText() returns nothing #77

Open
ghost opened this issue Apr 23, 2013 · 6 comments
Open

cleanedArticleText() returns nothing #77

ghost opened this issue Apr 23, 2013 · 6 comments

Comments

@ghost
Copy link

ghost commented Apr 23, 2013

I am using the goose-2.1.22.jar. My Java code is:

package com.pasionat.test;

import com.gravity.goose.Article;
import com.gravity.goose.Configuration;
import com.gravity.goose.Goose;

public class Main {
public static void main(String[] args) throws Exception {
String url_string = "http://www.cnn.com/2010/POLITICS/08/13/democrats.social.security/index.html";
Goose goose = new Goose(new Configuration());

    Article article = goose.extractContent(url_string);

    System.out.println("Goose text: " + article.cleanedArticleText());
    System.out.println("Goose rawHtml: " + article.rawHtml());
    System.out.println("Goose title: " + article.title());
}

}

cleanedArticleText() returns empty string, rawHTML returns the full HTML code of the page and title() returns the title of the page. So the website is being parsed and processed, but for some reason cleanedArticleText() is always empty. Is this a bug or am I doing something wrong? Thanks.

@pacificleo
Copy link

I am hitting this too. Have you figured out a workaround?

@pacificleo
Copy link

I figured out the issue. The src\main\resources\com\gravity\goose\text\stopwords-en.txt file when downloaded from github still had "\n" as line endings. Since I am working on windows, the scala line separator is "\r\n". Hence, the stopwords are incorrectly split. This leads to issues in ContentExtractor.scala algorithm which uses number of stopwords in a paragraph as a clustering input

Simply fixing the end-of-line characters in the stopwords file has fixed this issue for me. I think git might be the cause of the issue since it should translate the line endings automatically.

@ghost
Copy link
Author

ghost commented Jun 14, 2013

Nice find and thanks for posting the solution! It worked for me to.

@coding-idiot
Copy link

Thanks a ton, I've been trying to figure out since last week and your solution worked perfectly fine.

@shohamtalsignals
Copy link

or you can fix the splitter (I develop on windows and deploy to linux so I need to use the same file):

go to \src\main\scala\com\gravity\goose\text\StopWords.scala .
change this line:
val STOP_WORDS = FileHelper.loadResourceFile("stopwords-en.txt", StopWords.getClass).split(sys.props("line.separator")).toSet

to this:
val STOP_WORDS = FileHelper.loadResourceFile("stopwords-en.txt", StopWords.getClass).split("\r?\n").toSet

@dylanwatsonsoftware
Copy link

Looks like this issue was introduced here: #58

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants