Skip to content

Latest commit

 

History

History
62 lines (40 loc) · 3.42 KB

README.md

File metadata and controls

62 lines (40 loc) · 3.42 KB

DGST

DGST (Distributed Generalized Suffix Tree) is an efficient and scalable generalized suffix tree construction algorithm on distributed data-parallel platforms. DGST supports indexing all suffixes for a variety of strings, ranging from a single long string to multiple strings of varied lengths. In addition, DGST is built on the widely-used distributed data-parallel platform Apache Spark.

News

DGST won the first place in the 3rd National University Cloud Computing Application Innovation Competition Big Data Skills Challenge held in 2016-2017, China.

Prerequisites

  • Apache Spark: As DGST is built on top of Spark, you need to get the Spark installed first. If you are not clear how to setup Spark, please refer to the guidelines here. Currently, DGST is developed on the APIs of Spark 1.6.3 version.
  • Apache HDFS: DGST uses HDFS as the distributed file system. If you are not clear how to setup HDFS, please refer to the guidelines here. The HDFS version is 2.6.5.
  • Java: The JDK version is 1.7.

Compile

DGST is built using Apache Maven. To build DGST, Run mvn scala:compile compile package in the root directory.

The compiled jar is available at target/DGST-1.0-SNAPSHOT.jar

Note: A precompiled jar is provided in the lib directory.

Run

The entry of DGST is defined in the scala class GST.Main.

Run DGST with the following command:

spark-submit --master [SPARK_MASTER_ADDRESS] \
--executor-memory 2G \
--driver-memory 20G \
--total-executor-cores [Total cores in the Spark cluster] \ 
--conf spark.memory.fraction=0.9 \
--conf spark.memory.storageFraction=0.1 \
--driver-class-path [Configuration directory] \
--class GST.Main \
DGST-1.0-SNAPSHOT.jar \
<Input path> <Output path> <Temporary path> \
<Range size> <Maximum sub-tree size> <Parallelism>

The run command contains the following parameters:

  • <Input path>: The input data path on HDFS or local file system.
  • <Output path>: The output data path on HDFS or local file system.
  • <Temporary path>: The tempoaray data path on HDFS or local file system.
  • <Range size>: The size of range in the LCP-Range structure
  • <Maximum sub-tree size>: The Maximum sub-tree size (i.e., the maximum S-prefix frequency)
  • <Parallelism>: The computation parallelism on Spark

Note: --driver-class-path is the directory where the configuration file is. The parameters configured in the run command have higher priority than those in the configuration file. To find more information about the configuration file, please refer to the Configuration Guide.

Note: We also port the state-of-the-art ERa algorithm to Spark. Please replace the main class with GST.EraMain to run the ERa algorithm. The performance comparison on Spark is here.

Demo

The dataset in demo directory contais two strings. Each file represents a string.

The generalized suffix tree for the two demo strings can be found here.

Licence

Please contact the authors for the licence info of the source code.