Packaged JARs that handle map reduce job(s) for aggregating skills
- User Enetered Skills
- Skills from Challenges successfully participated in.
http://zhongyaonan.com/hadoop-tutorial/setting-up-hadoop-2-6-on-mac-osx-yosemite.html
mvn package -DskipTests=true
hadoop jar target/ap-emr-skills-1.0-SNAPSHOT.jar com.appirio.mapreduce.skills.SkillsAggregator src/main/resources/data/tagsMap.txt src/test/resources/skills/input/userEnteredSkills.txt src/test/resources/skills/input/challengeSkills.txt src/test/resources/skills/input/stackOverflowSkills.txt /tmp/skills
Create Cluster with command line or create Cluster from AWS console:
aws emr create-cluster --name “SkillsTest3” --enable-debugging --log-uri s3://supply-emr/skills/logs/skillstest3 --release-label emr-4.0.0 --applications Name=Hive Name=Hadoop --use-default-roles --ec2-attributes KeyName=topcoder-dev-vpc-app —instance-type m3.xlarge -no-auto-terminate
Enable SSH on Master Node
Upload Jar file:
aws emr put --cluster-id <Your EMR cluster Id> --key-pair-file "<Your Key Pair File>" --src "/<Your Path to>/ap-emr-skills/target/ap-emr-skills-1.0-SNAPSHOT.jar"
Execute task:
aws emr ssh --cluster-id <Your EMR cluster Id> --key-pair-file "<Your Key Pair File>" --command "hadoop jar ap-emr-skills-1.0-SNAPSHOT.jar com.appirio.mapreduce.skills.SkillsAggregator hdfs:///<Your Path to>/tagsMap.txt hdfs:///<Your Path to>/userEnteredSkills.txt hdfs:///<Your Path to>/challengeSkills.txt hdfs:///<Your Path to>/stackOverflowSkills.txt hdfs:///<Your Path to>/aggregatedSkills/"
The overall execution flow is defined in resources/jobs/job-tasks.json file, steps are:
- Install sqoop:
- Copy sqoop and other lib files to HDFS
- Create input/output directories
- Import Tags:
- Create db_tags and tags_export, then export tags data to hdfs:///user/supply/skills/input/tagsMap/
- Import Challenge Skills
- Query challenges skills from INFORMIX and save it to hdfs:///user/supply/skills/input/challenge/
- Import User Entered Skills
- Query user entered skills from DynamoDB and save it to hdfs:///user/supply/skills/input/userEntered/
- Import Stack Overflow Skills
- Query stack overflow skills from DynamoDB and save it to hdfs:///user/supply/skills/input/stackOverflow/
- Aggregate Skills
- This MapReduce program will aggrate user skills from HDFS location and output the results into hdfs:///user/supply/skills/output/aggregatedSkills/
- Export Aggregated Skills
- Read aggregate skills from hdfs:///user/supply/skills/output/aggregatedSkills/ and save it to DynamoDB