Skip to content

Latest commit

 

History

History
111 lines (77 loc) · 4.15 KB

README.md

File metadata and controls

111 lines (77 loc) · 4.15 KB

MapReduce on DBLP Dataset


Pre-requiste

  • Install VmWare based on your OS.
  • Install Cloudera VM
  • Java 1.8 needs to be installed on the system
  • Update Java 1.8 on Cloudera (Follow this steps URL1, URL2)
  • Install Gephi

Setup using SBT

  1. Checkout this project or download the entire project directory and extract it.

  2. Open up IntelliJ. Navigate to File -> Open and select the directory of the project.

  3. Open the terminal tab in IntelliJ and type the following commands:

    Compile and run the unit tests: sbt clean compile test

    Compile and run the application: sbt clean compile assembly

Running

1. XML Parser for DBLP dataset

  • Open the workspace in IntelliJ

  • Open the file DBLPParser.java

  • Go to Edit configuration and do the following changes:

    a. Main class should be com.uic.mapreduce.xml.DBLPParser

    b. In VM Options set: -Xmx6G ( increase memory allocation to JVM for this task)

    c. Provide absolute paths to the DBLP dataset and UIC authors.txt(included in the project) as program arguments e.g.: F:\UIC\441\mapreduce\dblp\dblp.xml .\src\main\resources\UIC_authors.txt

  • Run the file.

It will generate a file in the logs which will have comma separated UIC authors for each article,inproceedings,proceedings,book,incollection,phdthesis

Sample Output:

Robert H. Sloan,Ugo A. Buy
Bhaskar DasGupta
Andrew E. Johnson
Luc Renambot,Andrew E. Johnson
Ajay D. Kshemkalyani
Luc Renambot,Andrew E. Johnson
Luc Renambot,Andrew E. Johnson

2. Assemble Jar

Use sbt clean compile assembly which will create a jar under \target\scala-2.12 by name of author-map-dblp.jar

3. SFTP to Cloudera VM

  • Start the Cloudera VM instance on VmWare
  • Get the IP address of the VM from the network settings of VmWare once the VM is up and running.
  • Using WinSCP or other tools or commands(base don your OS) transfer the files(jar and the logfile a location on the VM)
  • Use default username and password provided by Cloudera.

4. Run Hadoop Mapreduce

  • Navigate to the directory where the above files are stored on the VM.
  • Create input directory on hadoop hadoop fs -mkdir input_dir
  • Transfer the logfile to the input directory hadoop fs -put <logfilename> input_dir
  • Run the jar from the directory where the jar is present. hadoop jar author-map-dblp.jar AuthorMapping input_dir output_dir

5. Extract Output to System

  • Once the job is completed the output needs to be extracted from hadoop to the local VM directory hadoop fs -get output_dir/part-r-00000 ./

Sample Output:

A. Prasad Sistla,A. Prasad Sistla,	102
A. Prasad Sistla,Bing Liu 0001,	1
A. Prasad Sistla,Isabel F. Cruz,	2
A. Prasad Sistla,Lenore D. Zuck,	6
A. Prasad Sistla,Robert H. Sloan,	1
A. Prasad Sistla,V. N. Venkatakrishnan,	8
Ajay D. Kshemkalyani,Ajay D. Kshemkalyani,	112
Ajay D. Kshemkalyani,Ugo Buy,	1

6. SFTP to Cloudera

  • Move this file to a folder on the host system.

7. Run Gephi for Graph

  • Open Gephi and the workspace provided in the logs folder of this project.
  • Import the CSV file( convert the file moved from the Cloudera to CSV extension).

Output Graph:

Built With


  • Scala - Scala combines object-oriented and functional programming in one concise, high-level language
  • SBT - sbt is a build tool for Scala & Java
  • Cloudera - Cloudera QuickStart VMs (single-node cluster)
  • Hadoop - framework that allows for the distributed processing of large data sets

Authors