You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 27, 2019. It is now read-only.
Félix Lehoux edited this page Mar 1, 2018
·
1 revision
We found that the crawler has a "bug" linked to banned words. At first, we thought that the crawler blocked all websites containing words from the list of banned words. Instead, we found that it only blocks an onion domain if the title of the website has banned words and does not look in the content of the page. Due to this functionality, one could save unwanted links. So far, the infrastructure offers two types of lists: a white one and a black one. The whitelist contains all websites that do not contain banned words in the title and the blacklist contains all websites with banned words in the title. We could change the rule so the content of the page is also scanned. However, this could trigger a lot of false positives. In the banned list, we have terms like "child porn" or "child pornography", but if a website has "no child pornography" in the content, the domain will still be banned.