-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Development Environment #2
Comments
Hi Marcos, This project was done as part of the course "CS441:Engineering Distributed Objects For Cloud Computing" of my Master's program at the University of Illinois at Chicago. The development environment is set up using IntelliJ and SBT as described in this build file. For starters, we were given a baseline requirement of what our application should be like. I am sharing some excerpts from the requirements we were given by the Professor. PreliminariesYou will install IntelliJ with your academic license, the JDK, the Scala runtime, and the IntelliJ Scala plugin, the Simple Build Toolkit (SBT) and make sure that you can create, compile, and run Java and Scala programs. Please make sure that you can run Java monitoring tools or you can choose a newer JDK and tools if you want to use a more recent one. OverviewIn this homework, you will create a distributed program for parallel processing of the publically available DBLP dataset that contains entries for various publications at many different venues (e.g., conferences and journals). Raw XML-based DBLP dataset is also publically available along with its schema and the documentation. Each entry in the dataset describes a publication, which contains the list of authors, the title, and the publication venue, and a few other attributes. The file is approximately 2.5Gb - not big by today's standards, but large enough for this homework assignment. Each entry is independent of the other one in that it can be processed without synchronizing with processing some other entries. Consider the following entry in the dataset. <inproceedings mdate="2017-05-24" key="conf/icst/GrechanikHB13">
<author>Mark Grechanik</author>
<author>B. M. Mainul Hossain</author>
<author>Ugo Buy</author>
<title>Testing Database-Centric Applications for Causes of Database Deadlocks.</title>
<pages>174-183</pages>
<year>2013</year>
<booktitle>ICST</booktitle>
<ee>https://doi.org/10.1109/ICST.2013.19</ee>
<ee>http://doi.ieeecomputersociety.org/10.1109/ICST.2013.19</ee>
<crossref>conf/icst/2013</crossref>
<url>db/conf/icst/icst2013.html#GrechanikHB13</url>
</inproceedings> This entry lists a paper at the IEEE International Conference on Software Testing, Verification and Validation (ICST) published in 2013 whose authors are my former Ph.D. student at UIC, now tenured Associate Professor at the University of Dhaka, Prof. Dr. B.M. Mainul Hussain whose advisor Prof.Mark Grechanik is a co-author of this paper. The third co-author is Prof.Ugo Buy, a faculty member at our CS department. The presence of three co-authors in a single publication like this one increments a count variable that represents the number of publications with three co-authors. Your job is to determine the distribution of the number of authors across many different journals and conferences using the information extracted from this dataset. Partitioning this dataset into shards is easy since it requires preserving the well-formedness of XML only. Most likely, you will write a simple program to partition the dataset into approximately equal size shards. FunctionalityYour homework assignment is to create a program for parallel distributed processing of the publication dataset. Your goal is to produce the following statistics about the authors and the venues they published their papers at.
Your job is to create the mapper and the reducer for each task, explain how they work, and then implement them and run them on the DBLP dataset. The output of your map/reduce is a spreadsheet or a CSV file with the required statistics. You will create and run your software application using Apache Hadoop, a framework for distributed processing of large data sets across multiple computers (or even on a single node) using the map/reduce model. If your laptop/workstation is limited in its RAM, you can use Cloudera QuickStart VM with the minimum req of RAM 4Gb. Even though you can install and configure Hadoop on your computers, I recommend that you use a virtual machine (VM) of Hortonworks Sandbox, a preconfigured Apache Hadoop installation with a comprehensive software stack. You can complete this homework using Scala and you will immensely enjoy the embedded XML processing facilities that come with Scala. You will use Simple Build Tools (SBT) for building the project and running automated tests. Additionally, Refer to Application Design to learn more about the mappers and reducers implemented for the MapReduce jobs. @mflipe let me know if you need any more details. I am happy to help. |
Hi, my name is Marcos.
I found your project when searching for examples of
hadoop
and want to explore it further.Would you describe how did you set up the development environment for this project?
Thank you in advance!
Marcos
The text was updated successfully, but these errors were encountered: