Skip to content

An open and introductory book for the Python API of Apache Spark (pyspark)

License

Notifications You must be signed in to change notification settings

mateusamorim96/Introd-pyspark

 
 

Repository files navigation

Introd-pyspark

An open and introductory book for the Python API of Apache Spark. The book "Introduction to pyspark" provides a quick introduction for the pyspark Python package, which is the Python API of Apache Spark.

Read the book at: https://pedropark99.github.io/Introd-pyspark/.

You can buy a copy of the book through Amazon: https://www.amazon.com/dp/B0CRYMVWDN.

Publication page: https://pedro-faria.netlify.app/publications/book/introd-pyspark/en/.

With pyspark you are able to use the Python language to write Spark applications and run them on a Spark cluster in a scalable and elegant way. This book focus on teaching the fundamentals of pyspark, and how to use it for big data analysis.

Some of the main subjects discussed in the book are:

  • How an Apache Spark application works?
  • What are Spark DataFrames?
  • How to transform and model your Spark DataFrame.
  • How to import data into, and export data out of Apache Spark.
  • How to work with SQL inside pyspark.
  • Tools for manipulating specific data types (e.g. string, dates and datetimes).
  • How to use window functions.

About the author

Pedro Duarte Faria have a bachelor degree in Economics from Federal University of Ouro Preto - Brazil. Currently, he is a Data Platform Engineer at Blip, and an Associate Developer for Apache Spark 3.0 certified by Databricks.

The author have more than 5 years of experience in the data analysis market. He developed data pipelines, reports and analysis for research institutions and some of the largest companies in the brazilian financial sector, such as the BMG Bank, Sodexo and Pan Bank, besides dealing with databases that go beyond the billion rows.

Furthermore, Pedro is specialized on the R programming language, and have given several lectures and courses about it, inside graduate centers (such as PPEA-UFOP), in addition to federal and state organizations (such as FJP-MG). As researcher, he have experience in the field of Science, Technology and Innovation Economics.

Personal Website: https://pedro-faria.netlify.app/

Twitter: @PedroPark9

Mastodon: @[email protected]

License

Copyright © 2024 Pedro Duarte Faria. This book is licensed by the CC-BY 4.0 Creative Commons Attribution 4.0 International Public License.

Creative Commons License

About

An open and introductory book for the Python API of Apache Spark (pyspark)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 68.0%
  • R 23.3%
  • TeX 4.9%
  • Shell 2.6%
  • Other 1.2%