Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Development Environment #2

Open
mflipe opened this issue May 6, 2021 · 1 comment
Open

Development Environment #2

mflipe opened this issue May 6, 2021 · 1 comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed

Comments

@mflipe
Copy link

mflipe commented May 6, 2021

Hi, my name is Marcos.

I found your project when searching for examples of hadoop and want to explore it further.
Would you describe how did you set up the development environment for this project?

Thank you in advance!
Marcos

@samujjwaal
Copy link
Owner

samujjwaal commented May 6, 2021

Hi Marcos,

This project was done as part of the course "CS441:Engineering Distributed Objects For Cloud Computing" of my Master's program at the University of Illinois at Chicago.

The development environment is set up using IntelliJ and SBT as described in this build file.

For starters, we were given a baseline requirement of what our application should be like. I am sharing some excerpts from the requirements we were given by the Professor.


Preliminaries

You will install IntelliJ with your academic license, the JDK, the Scala runtime, and the IntelliJ Scala plugin, the Simple Build Toolkit (SBT) and make sure that you can create, compile, and run Java and Scala programs. Please make sure that you can run Java monitoring tools or you can choose a newer JDK and tools if you want to use a more recent one.

Overview

In this homework, you will create a distributed program for parallel processing of the publically available DBLP dataset that contains entries for various publications at many different venues (e.g., conferences and journals). Raw XML-based DBLP dataset is also publically available along with its schema and the documentation.

Each entry in the dataset describes a publication, which contains the list of authors, the title, and the publication venue, and a few other attributes. The file is approximately 2.5Gb - not big by today's standards, but large enough for this homework assignment. Each entry is independent of the other one in that it can be processed without synchronizing with processing some other entries.

Consider the following entry in the dataset.

<inproceedings mdate="2017-05-24" key="conf/icst/GrechanikHB13">
<author>Mark Grechanik</author>
<author>B. M. Mainul Hossain</author>
<author>Ugo Buy</author>
<title>Testing Database-Centric Applications for Causes of Database Deadlocks.</title>
<pages>174-183</pages>
<year>2013</year>
<booktitle>ICST</booktitle>
<ee>https://doi.org/10.1109/ICST.2013.19</ee>
<ee>http://doi.ieeecomputersociety.org/10.1109/ICST.2013.19</ee>
<crossref>conf/icst/2013</crossref>
<url>db/conf/icst/icst2013.html#GrechanikHB13</url>
</inproceedings>

This entry lists a paper at the IEEE International Conference on Software Testing, Verification and Validation (ICST) published in 2013 whose authors are my former Ph.D. student at UIC, now tenured Associate Professor at the University of Dhaka, Prof. Dr. B.M. Mainul Hussain whose advisor Prof.Mark Grechanik is a co-author of this paper. The third co-author is Prof.Ugo Buy, a faculty member at our CS department. The presence of three co-authors in a single publication like this one increments a count variable that represents the number of publications with three co-authors. Your job is to determine the distribution of the number of authors across many different journals and conferences using the information extracted from this dataset. Partitioning this dataset into shards is easy since it requires preserving the well-formedness of XML only. Most likely, you will write a simple program to partition the dataset into approximately equal size shards.

Functionality

Your homework assignment is to create a program for parallel distributed processing of the publication dataset. Your goal is to produce the following statistics about the authors and the venues they published their papers at.

  • First, you will compute a spreadsheet or a CSV file that shows the top ten published authors at each venue.
  • Second, you will compute the list of authors who published without interruption for N years where 10 <= N.
  • Then, for each venue, you will produce a list of publications that contains only one author.
  • Next, you will produce the list of publications for each venue that contain the highest number of authors for each of these venues.
  • Finally, you will produce the list of top 100 authors in descending order who publish with most co-authors and the list of 100 authors who publish without any co-authors.

Your job is to create the mapper and the reducer for each task, explain how they work, and then implement them and run them on the DBLP dataset. The output of your map/reduce is a spreadsheet or a CSV file with the required statistics.

You will create and run your software application using Apache Hadoop, a framework for distributed processing of large data sets across multiple computers (or even on a single node) using the map/reduce model. If your laptop/workstation is limited in its RAM, you can use Cloudera QuickStart VM with the minimum req of RAM 4Gb. Even though you can install and configure Hadoop on your computers, I recommend that you use a virtual machine (VM) of Hortonworks Sandbox, a preconfigured Apache Hadoop installation with a comprehensive software stack.

You can complete this homework using Scala and you will immensely enjoy the embedded XML processing facilities that come with Scala. You will use Simple Build Tools (SBT) for building the project and running automated tests.


Additionally, Refer to Application Design to learn more about the mappers and reducers implemented for the MapReduce jobs.

@mflipe let me know if you need any more details. I am happy to help.

@samujjwaal samujjwaal added documentation Improvements or additions to documentation help wanted Extra attention is needed labels May 6, 2021
@samujjwaal samujjwaal pinned this issue May 6, 2021
@samujjwaal samujjwaal unpinned this issue May 8, 2021
@samujjwaal samujjwaal moved this to In progress in DBLP MapReduce Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed
Projects
Status: In progress
Development

No branches or pull requests

2 participants