Course Description: The Introduction to Big Data course introduces students to Big Data on a conceptual level and gives students exposure and practice with several skills and tools currently in use. These skills will be taught at a manageable level, and then broadening methods will be used to help students grasp the meaning and popularity of analyzing substantial amounts of data. Students will learn the foundational concepts of Big Data and will know how to move from Big Data basics to more business-specific needs and requirements.
Quarter Credit Hours: | 3 |
Course Length: | 40 hours |
Prerequisites: | DS102, DS104, DS109 |
Proficiency Exam: | No |
Theory Hours: | 20 |
Laboratory Hours: | 20 |
Externship Hours: | 0 |
Outside Hours: | 10 |
Total Contact Hours: | 40 |
Module | Lesson Number | Lesson Name |
---|---|---|
DS107 Big Data | 1 | Introduction to Big Data |
2 | Getting Started with Hadoop | |
3 | MapReduce, Hive, and Sqoop | |
4 | Pig and HBase | |
5 | Spark 2.0 and Zeppelin | |
6 | Working with Real-Time Data | |
7 | Amazon Web Services | |
8 | DASK | |
9 | Source Control using Git | |
10 | Final Project |
-
Ground-based students are required to bring a late model laptop computer (either PC or MacBook) to class every day.
-
Online students are required to have a late model laptop or desktop computer with internet access.
-
Minimum: PC (Windows 10/11) or Mac (Big Sur or Monterey) laptop. 8GB ram, 512GB HD, Intel Core i5, AMD Ryzen 5, or Apple Intel or M1 Chipsets.
-
Recommended: PC (Windows 10/11) or Mac laptop(Big Sur or Monterey). 16GB ram, 1TB SSD, Intel Core i7, AMD Ryzen 7, or Apple M1/M1 Pro Chipsets.
-
Professionals: PC (Windows 10/11) or Mac(Big Sur or Monterey). 32-64 GB ram, 2-8TB SSD, Intel Core i9, AMD Ryzen 9/Threadripper, or Apple M1 Max Chipsets.
-
It is a requirement that you are able to download programming resources to your laptop/desktop for this class. (This means you need a steady internet high bandwidth connection.)
-
You are required to have a quiet place to study and to be able to focus on the material.
-
You are required to have uninterrupted weekly 1:1 video meetings with your mentor.
-
You are required to log into the Learning Management System (LMS) daily for at least 20 minutes.
-
Please follow and review each lesson page by page coding examples provided as this will ensure you have a full understanding for your final hands-on assignments.
Upon successful completion of this course, students will be able to:
- Explain the background and evolution of Big Data
- Use the fundamentals of Hadoop
- Hands on Hadoop experience
- Learn the fundamentals of MapReduce
- Hands on MapReduce experience
- Learn how to increase skills to large datasets using available tools
- Learn how Hadoop and MapReduce utilize multiple computing clusters
- Learn about other Hadoop technologies
Course: Week 1
- Introduction: Introduction to Big Data, Python Review, Running Code in the Terminal for Windows Users, Running Code in the Terminal for Mac/Linux Users, Reading from Standard Input, Reading from Files (File IO)
- ETL & MapReduce: ETL & Map Reduce, Reading in Data, Reading in CSVs, Errors, MapReduce, Overall Goal, Create a Reduce File, Run the Map and Reduce Files Together, Counting the Types of Arrests, Key Terms
- Orchestration: Orchestrations, Streams, Crimes Data, Installing Packages for Windows, Installing Packages for Mac/Linux, Manager Set Up, Worker Set Up, Running the Files, Crime Analysis with MapReduce Using Orchestration, Monitor System Performance, Key Terms Week 2
- Distribution: Partitioning Data, Multiple Workers, Distribution, Running the Files, Activity Monitoring, Process Scheduling, Load Balancing, Key Terms
- Amazon Web Service Set Up: Introduction to Spark, Introduction to Amazon Web Service, Accessing Your AWS Educate Starter Account, EC2 Setup, Connecting to EC2, Key Terms
- PySpark Set Up: Includes PySpark Set Up, Installing Anaconda, Configuring Jupyter Notebook, Running Jupyter Notebook, Installing Additional Software, Installing Pip and Packages, Install Spark
- Using PySpark: Introduction, Windows: How to Reconnect to Your Instance, Mac/Linux: How to Reconnect to Your Instance, Using PySpark, Map(), Mapping a Dataset, ReduceByKey(), Filter(), SortBy(), Sample(), Distinct(), Union(), Key Terms
Week 3 - Hadoop: Introduction, What is Hadoop?, Key Terms
- Big Judgement: Failure Recovery, There’s Always An Exception
- Final Project
Class: DSO107 | Topic presented |
---|---|
Week 1 Workshop #1 | What is Big Data? (L1) |
Week 1 Workshop #2 | MapReduce (L2) |
Week 2 Workshop #1 | Orchestration and Distribution (L3, L4) |
Week 2 Workshop #2 | PySpark (L6, L7) |
Week 3 Workshop #1 | Hadoop (L8) |
Week 3 Workshop #2 | Practice Project (L10) |
Assignment | Points | Topic |
---|---|---|
L1 Hands On | 45 points | Review Python fundamentals by searching for stop words and using the strip() function. |
L2 Hands On | 45 points | Utilize MapReduce functions in Python. |
L3 Practice Hands On | 0 points | Orchestrate the MapReduce function across multiple workers in Python. |
L4 Practice Hands On | 0 points | Distribute the MapReduce function across multiple workers in Python. |
L7 Hands On | 45 points | Utilize AWS to tap into PySpark and perform data transformations in Spark. |
L8 Hands On | 45 points | Assess the current big data job market. |
L9 Practice Hands On | 0 points | Comment the code and utilize try/except to improve functioning in Python. |
L10 Final Project | 200 points | Orchestrate, distribute, and perform MapReduce on data in Python. |
- Professionalism, Attendance and Class Participation points 20 (5%)
- Assignments/Hands-On/Homework: L1-9 Hands On total points180 (45%)
- Projects/Competencies/Research: Final Project 200 (50%)
- Total points: 400 (100%)
With the data given, create one manager and create two workers. Perform MapReduce to count accidents for each vehicle and most common actions prior to accident. Lastly, determine which action is most common.
- Professionalism, Attendance and Class Participation* 5%
- Assignments/Hands-On/Homework 95%
- Total 100%