-
Notifications
You must be signed in to change notification settings - Fork 322
cleanedArticleText() returns nothing #77
Comments
I am hitting this too. Have you figured out a workaround? |
I figured out the issue. The src\main\resources\com\gravity\goose\text\stopwords-en.txt file when downloaded from github still had "\n" as line endings. Since I am working on windows, the scala line separator is "\r\n". Hence, the stopwords are incorrectly split. This leads to issues in ContentExtractor.scala algorithm which uses number of stopwords in a paragraph as a clustering input Simply fixing the end-of-line characters in the stopwords file has fixed this issue for me. I think git might be the cause of the issue since it should translate the line endings automatically. |
Nice find and thanks for posting the solution! It worked for me to. |
Thanks a ton, I've been trying to figure out since last week and your solution worked perfectly fine. |
or you can fix the splitter (I develop on windows and deploy to linux so I need to use the same file): go to \src\main\scala\com\gravity\goose\text\StopWords.scala . to this: |
Looks like this issue was introduced here: #58 |
I am using the goose-2.1.22.jar. My Java code is:
package com.pasionat.test;
import com.gravity.goose.Article;
import com.gravity.goose.Configuration;
import com.gravity.goose.Goose;
public class Main {
public static void main(String[] args) throws Exception {
String url_string = "http://www.cnn.com/2010/POLITICS/08/13/democrats.social.security/index.html";
Goose goose = new Goose(new Configuration());
}
cleanedArticleText() returns empty string, rawHTML returns the full HTML code of the page and title() returns the title of the page. So the website is being parsed and processed, but for some reason cleanedArticleText() is always empty. Is this a bug or am I doing something wrong? Thanks.
The text was updated successfully, but these errors were encountered: