MapReduce on DBLP Dataset

Pre-requiste

Install VmWare based on your OS.
Install Cloudera VM
Java 1.8 needs to be installed on the system
Update Java 1.8 on Cloudera (Follow this steps URL1, URL2)
Install Gephi

Setup using SBT

Checkout this project or download the entire project directory and extract it.
Open up IntelliJ. Navigate to File -> Open and select the directory of the project.
Open the terminal tab in IntelliJ and type the following commands:

Compile and run the unit tests: sbt clean compile test

Compile and run the application: sbt clean compile assembly

Running

1. XML Parser for DBLP dataset

Open the workspace in IntelliJ
Open the file DBLPParser.java
Go to Edit configuration and do the following changes:

a. Main class should be com.uic.mapreduce.xml.DBLPParser

b. In VM Options set: -Xmx6G ( increase memory allocation to JVM for this task)

c. Provide absolute paths to the DBLP dataset and UIC authors.txt(included in the project) as program arguments e.g.: F:\UIC\441\mapreduce\dblp\dblp.xml .\src\main\resources\UIC_authors.txt
Run the file.

It will generate a file in the logs which will have comma separated UIC authors for each article,inproceedings,proceedings,book,incollection,phdthesis

Sample Output:

Robert H. Sloan,Ugo A. Buy
Bhaskar DasGupta
Andrew E. Johnson
Luc Renambot,Andrew E. Johnson
Ajay D. Kshemkalyani
Luc Renambot,Andrew E. Johnson
Luc Renambot,Andrew E. Johnson

2. Assemble Jar

Use sbt clean compile assembly which will create a jar under \target\scala-2.12 by name of author-map-dblp.jar

3. SFTP to Cloudera VM

Start the Cloudera VM instance on VmWare
Get the IP address of the VM from the network settings of VmWare once the VM is up and running.
Using WinSCP or other tools or commands(base don your OS) transfer the files(jar and the logfile a location on the VM)
Use default username and password provided by Cloudera.

4. Run Hadoop Mapreduce

Navigate to the directory where the above files are stored on the VM.
Create input directory on hadoop hadoop fs -mkdir input_dir
Transfer the logfile to the input directory hadoop fs -put <logfilename> input_dir
Run the jar from the directory where the jar is present. hadoop jar author-map-dblp.jar AuthorMapping input_dir output_dir

5. Extract Output to System

Once the job is completed the output needs to be extracted from hadoop to the local VM directory hadoop fs -get output_dir/part-r-00000 ./

Sample Output:

A. Prasad Sistla,A. Prasad Sistla,	102
A. Prasad Sistla,Bing Liu 0001,	1
A. Prasad Sistla,Isabel F. Cruz,	2
A. Prasad Sistla,Lenore D. Zuck,	6
A. Prasad Sistla,Robert H. Sloan,	1
A. Prasad Sistla,V. N. Venkatakrishnan,	8
Ajay D. Kshemkalyani,Ajay D. Kshemkalyani,	112
Ajay D. Kshemkalyani,Ugo Buy,	1

6. SFTP to Cloudera

Move this file to a folder on the host system.

7. Run Gephi for Graph

Open Gephi and the workspace provided in the logs folder of this project.
Import the CSV file( convert the file moved from the Cloudera to CSV extension).

Output Graph:

Built With

Scala - Scala combines object-oriented and functional programming in one concise, high-level language
SBT - sbt is a build tool for Scala & Java
Cloudera - Cloudera QuickStart VMs (single-node cluster)
Hadoop - framework that allows for the distributed processing of large data sets

Authors

Amrish Jhaveri

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MapReduce on DBLP Dataset

Pre-requiste

Setup using SBT

Running

1. XML Parser for DBLP dataset

2. Assemble Jar

3. SFTP to Cloudera VM

4. Run Hadoop Mapreduce

5. Extract Output to System

6. SFTP to Cloudera

7. Run Gephi for Graph

Built With

Authors

Files

README.md

Latest commit

History

README.md

File metadata and controls

MapReduce on DBLP Dataset

Pre-requiste

Setup using SBT

Running

1. XML Parser for DBLP dataset

2. Assemble Jar

3. SFTP to Cloudera VM

4. Run Hadoop Mapreduce

5. Extract Output to System

6. SFTP to Cloudera

7. Run Gephi for Graph

Built With

Authors