Kangooroo is a Java utility for crawling malicious URLs. It is integrated with Assemblyline's URLDownloader service and can be used as a standalone commandline application.
We are using Java 11 with Gradle 7.0.2 for building the Kangooroo Jars. You can also use the gradle wrapper in the project to build the Jar.
To create a uber jar for Kangooroo to run as a standalone utility, run command gradle shadowJar
. You should be able to find the KangoorooStandalone.jar
in the build directory.
-
Have Java version 11 installed
-
Have
chromium-browser
and achromedriver
that matches the chromium browser version installed (chromedriver can be downloaded from https://chromedriver.chromium.org/downloads) -
Have
chromedriver
andKangoorooStandalone.jar
placed in the same folder -
if you want to configure simple logging (we have log4j included in the Jar), copy the file at log4j2.xml to your current directory and add this to your java command
java -Dlog4j2.configurationFile=./log4j2.xml
-
if you want to configure
output_folder
andtemporary_folder
location, copy over theconf.yml
file and modify the value foroutput_folder
andtemporary_folder
to your desired location.
The program can be run with command java -jar KangoorooStandalone.jar [ARGUMENTS]
Running the program with logging configured:
java -Dlog4j2.configurationFile=./log4j2.xml -jar KangoorooStandalone.jar [ARGUMENTS]
url
the url that you wish to crawl. It should be full url that starts with "http://" or "https://"url-type
EitherPHISHING
(default) orSMISHING
. ThePHISHING
options sets the window size to1280x720
and sets the user-agent to a desktop client. TheSMISHING
option sets the window size to be375x667
and sets the user-agent to a phone client.
Users can specify which directory for output and which directory for storing temporary files by modifying the conf.yml file and specify the new conf file with: -cf path/to/conf/file
.
The current default is ./tmp/
directory for temporary files and ./output/
for output result file. These two folder must exist on disk before running the program.
Users can also control the verbosity of the output with the followign arguments:
not-sanitize-session
,nss
: To not remove file content from the the HAR output stored in results.jsonnot-save-file
,nsf
: To not save save favicon.ico, website source, and screenshot filenot-save-har
,nsh
: To not save the full har file
Users can also turn off Chrome's default sandbox mode with argument no-sandbox
. NOTE: This is a security risk and is not recommended.
Users can specify some special modules with the mods
option with a comma separated list.
Current modules:
captcha
: Run the audio captcha solver when possiblefile_hash
: Include file hashes of each files fetched from the web crawl in results.jsonsummary
: Include summary of crawl result in results.json
All files of interest would be in the output_folder specified in conf.yml
. The result of the web crawl is stored in the directory {output_folder}/{MD5 HASH of URL}/
.
No more zip files in the output directory, instead we have:
- favicon.ico : favicon of website if exist
- screenshot.png : a screenshot of the website of interest
- session.har : the HAR file of the communication with the website
- source.html: the source html file of the website of interest
- result.json: json summary of the url fetch result