Incremental Maven Crawler

This application crawls from Maven Central Incremental Index Repository with a certain interval. Running this application will follow this repository and outputs the unique artifacts released on Maven central. Currently, Maven Central releases a new (incremental) index every week.

Several outputs exist including Kafka and HTTP support. Moreover, a checkpointing mechanism is added to support persistence across restarts. More specifically, the checkpointDir stores an INDEX.index file where the INDEX is the next index to crawl. E.g. when 800.index is stored, the crawler will start crawling including index 800.

Usage

usage: IncrementalMavenCrawler
 -i,--interval <hours>            Time to wait between crawl attempts (in
                                  hours). Defaults to 1 hour.
 -o,--output <[std|kafka|rest]>   Output to send the crawled artifacts to.
                                  Defaults to std.
 -si,start_index                  Index to start crawling from (inclusive). Required.
 -bs,--batch_size <amount>        Size of batches to send to output.
                                  Defaults to 50.
 -cd,--checkpoint_dir <hours>     Directory to checkpoint/store latest
                                  crawled index. Used for recovery on
                                  crash or restart. Optional.
 -kb,--kafka_brokers <brokers>    Kafka brokers to connect with. I.e.
                                  broker1:port,broker2:port,...
                                  Required for Kafka output.
 -kt,--kafka_topic <topic>        Kafka topic to produce to.
                                  Required for Kafka output.
 -re,--rest_endpoint <url>        HTTP endpoint to post crawled batches to.
                                  Required for Rest output.

Outputs

An example JSON output message:

{
   "artifactId":"config",
   "groupId":"software.amazon.awssdk",
   "version":"2.15.58",
   "date":1609791717000,
   "artifactRepository":"https://repo.maven.apache.org/maven2/"
}

StdOutput:
Outputs to the console using System.out.println in a JSON format.

KafkaOutput:
Outputs to a Kafka topic.
Requires the arguments:

--output kafka; switch output mode to kafka
--kafka_topic TOPIC; the kafka topic to send to. --kafka_brokers BROKER1,BROKER2; the brokers to connect to.

Deployment

To build the image:

mvn clean package
docker build . -t crawler

Alternatively, download from GitHub Packages. This requires your token to be installed.

docker pull docker.pkg.github.com/fasten-project/incremental-maven-crawler/crawler:latest
docker tag docker.pkg.github.com/fasten-project/incremental-maven-crawler/crawler:latest crawler

To run:

docker run crawler [arguments]

For example:

docker run crawler --start_index 682 --output std

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
src		src
.bettercodehub.yml		.bettercodehub.yml
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Incremental Maven Crawler

Usage

Outputs

Deployment

About

Releases

Packages

Languages

License

software-improvement-group-research/incremental-maven-crawler

Folders and files

Latest commit

History

Repository files navigation

Incremental Maven Crawler

Usage

Outputs

Deployment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages