Skip to content
This repository has been archived by the owner on Sep 27, 2019. It is now read-only.

Banned_words

Félix Lehoux edited this page Mar 1, 2018 · 1 revision

We found that the crawler has a "bug" linked to banned words. At first, we thought that the crawler blocked all websites containing words from the list of banned words. Instead, we found that it only blocks an onion domain if the title of the website has banned words and does not look in the content of the page. Due to this functionality, one could save unwanted links. So far, the infrastructure offers two types of lists: a white one and a black one. The whitelist contains all websites that do not contain banned words in the title and the blacklist contains all websites with banned words in the title. We could change the rule so the content of the page is also scanned. However, this could trigger a lot of false positives. In the banned list, we have terms like "child porn" or "child pornography", but if a website has "no child pornography" in the content, the domain will still be banned.

Clone this wiki locally