Unsplash Photo Trends

1. Table of Contents
1. What is this Project about?
1. The Dataset in numbers
1. Data Flow Diagramm (DFD)
1. Data Model
1. Deployments (Flows)
1. Technology
1. Prerequisites
1. Folder structure
1. Most important files on root level
1. Setup
- 11.1. Activate Pre-commit
- 11.2. Environment Variables
- 11.3. Github Action Secrets
- 11.4. GCP Setup
- 11.5. Environment Setup
- 11.6. Prefect Setup
- 11.7. Setup Storage
- 11.8. Deployment and Testing

What is this Project about?

Ever wondered which type of photos are downloaded the most often? Look no further. This project aims to deliver the answer to this question.

This Project is a Data Enginering Project about collecting photo metadata and actual photos from the Unsplash Photo Platform. Unsplash offers photos for download under a free licensing model.

The Project is mostly written in Python, uses Prefect for Data Orchestration and Google Cloud Storage and Google BigQuery as main storage technologies. The Jobs are scheduled either daily or in 10-60 minute intervals and are executed trough Google Cloud Runs.

🔢 The Dataset in numbers

Last Update on 17th October, 2023

243.000 Photos (~712 GiB)
243.000 Metadata Photo entries
- including Views, Likes, Downloads, EXIF, Location, User
Unsplash Platform Stats
- including number of total photos/views/downloads on platform, daily new photos, daily new photographers etc.

➡️ Data Flow Diagramm (DFD)

Just the EL part of ELT is done in this project.

Google Cloud Storage is used as initial, immutable raw data layer (E).

BigQuery is used for two purposes:

to log which photos have already been requested. This logs are checked in each run, so the same image won't get requested twice.
to serve as intermediata storage (L). The actual data requires further processing and transformation (T) to be suitable for data analysis.

The following diagram describes the flow of data from different API endpoints to different storages.

A scrollable/zoomable version can be found here (Lucid account required)

🌐 Data Model

The most important table is photo-editorial-metadata-expanded which contains Photo metadata from the Editorial section of Unsplash.

Editorial photos can only be used in editorial projects such as: news related materials (newspaper and magazine articles, TV news), blogs and educational materials.

Additional tables not shown in this diagram as they are external tables syncing data from Google Cloud Storage to Big Query and are not being picked up by ERM modeling software are:

🚀 Deployments (Flows)

The Data Pipeline consist of 5 different Flows. A Flow in the context of Prefect is comparable to an ETL Job.

📡 Technology

Storage

Compute

Google Cloud Run

Orchestration

Prefect

Languages

Python
Bash
SQL

Prerequisites

If you want to reproduce some of this code or copy and paste parts of it, feel free to do so.

If you want to build on this projects, here are the prerequisites

A Google account (to create ressources in GCP)
A Prefect account (with access to the Prefect Cloud)
Expects the following Git branches. See Git Branching Strategy:
- master
- develop
- test

Folder structure

These are the main folders (and their their descrptions) of this Github repo. In case some folders are not visible, then they are not meant to be shared (see .gitignore)

hint: tree -L 2 -d -A

.
└── .github          --> Github actions (e.g. CI)
├── data             --> Data in different stages (raw, staged, final)
│   ├── 00_raw       --> Immutable, raw data
│   ├── 01_staged    --> Processed data
│   └── 02_final     --> Data which can be served (ML, Analytics)
├── docs
│   └── images       --> Images used in this Readme.md
├── images           --> Docker Images (which are used across flows)
├── make             --> Makefiles for setting up ressources and environment
├── output           --> Deliverables in form of reports or models
│   ├── models
│   └── reports
├── references       --> Data dictionaries, manuals, and all other explanatory materials
├── src              --> Source code (Python)
│   ├── blocks       --> Prefect Blocks
│   ├── etl          --> Collection of common Extraction, Transformation and Loading functions used
│   ├── prefect      --> Prefect Flows
│   └── scripts      --> Python and Bash utility scripts
├── tests            --> Unit tests

Most important files on root level

prefect.yaml: Deployment steps and configuration
.pre-commit-config.yaml: Pre-commit hooks which are run before each commit
Makefile: Settings for Makefile (which are stored in folder make/*)
pyproject.toml & poetry.lock: Python dependencies

Setup

Activate Pre-commit

Install Pre-commit hooks (for code formatting, import statement checks before committing)

pre-commit install

Environment Variables

Define values in base.env (not part of this repository) For reference check base.env.example which contains all major variables required for this project

Github Action Secrets

Add the following Secrets as Action secrets to your Github repository:

PREFECT_API_KEY
PREFECT_API_URL

See https://docs.prefect.io/latest/api-ref/rest-api/#finding-your-prefect-cloud-details

GCP Setup

Run make setup-gcp to setup up the Google Cloud Project

If this doesn't work, run the commands from 00_setup_gcp.mk command by command in the following order:

make create-gcp-project
make set-default-gcp-project
make link-project-to-billing-account
make create-deployment-service-account
make create-deployment-service-account-key-file
make enable-gcp-services
make bind-iam-policies-to-deployment-service-account
make set-deployment-service-account-as-default

Environment Setup

necessary everytime you start working on the project

make env-init to prepare ENV environment variable
make dev-init to setup development environment

Prefect Setup

As mentioned above, this project requires a Prefect account and access to the Prefect Cloud

make setup-prefect

Setup Storage

Setup the storage infrastructure by running

make create-gcs-buckets

Deployment and Testing

Start on develop

Write Tasks and Flows
If necessary write Unit Tests
Run make run-unit-tests

Move on to test

Merge with develop
Deploy Flow by running make deploy-flow
Run integration test trough a manual Flow run
Sync with Bigquery (using existing Flow if you want to sync GCS and Bigquery using a Push pattern)

Move on prod

Merge with test
Deploy Flow by running make deploy-flow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsplash Photo Trends

Table of Contents

What is this Project about?

🔢 The Dataset in numbers

➡️ Data Flow Diagramm (DFD)

🌐 Data Model

🚀 Deployments (Flows)

📡 Technology

Prerequisites

Folder structure

Most important files on root level

Setup

Activate Pre-commit

Environment Variables

Github Action Secrets

GCP Setup

Environment Setup

Prefect Setup

Setup Storage

Deployment and Testing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 400 Commits
.github/workflows		.github/workflows
docs/images		docs/images
images/prefect_runner		images/prefect_runner
make		make
output		output
references		references
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prefectignore		.prefectignore
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
prefect.yaml		prefect.yaml
pyproject.toml		pyproject.toml

m-p-esser/unsplash_photo_trends

Folders and files

Latest commit

History

Repository files navigation

Unsplash Photo Trends

Table of Contents

What is this Project about?

🔢 The Dataset in numbers

➡️ Data Flow Diagramm (DFD)

🌐 Data Model

🚀 Deployments (Flows)

📡 Technology

Prerequisites

Folder structure

Most important files on root level

Setup

Activate Pre-commit

Environment Variables

Github Action Secrets

GCP Setup

Environment Setup

Prefect Setup

Setup Storage

Deployment and Testing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages