Udacity Full Stack Developer Nanodegree program
Brendon Smith (br3ndonland)
Databases can contain trillions of rows and columns, something that spreadsheet programs like Google Sheets and Microsoft Excel simply can't handle. Structured Query Language (SQL) is used to query (ask) the computer to retrieve information from databases.
I completed this project as part of the Udacity Full Stack Developer Nanodegree program. Students were asked to write SQL queries to extract information from a database of news articles with over a million rows. The SQL queries contain advanced joins, selection, and calculation features. Another general-purpose computing language called Python was used to group and control the queries.
- logs.py - main program code
- logs-output.txt - example code output
- README.md - this file, a concise description of the project
- db/ - files that set up the PostgreSQL database. Data can be downloaded from the original source or my copy.
- img/ - images used in repository
- info/ -
- logs-methods.md - computational narrative explaining how I wrote the code for the project
- logs-review.md - Udacity code review
- logs-udacity.md - project description, installation instructions, and rubric from Udacity
- sql/ - database queries in SQL format
- docker-compose.yml and Dockerfile - files used by Docker to set up containers for the application.
- Pipfile and Pipfile.lock - files used by Pipenv to set up a Python virtual environment.
- Vagrantfile - file used to configure a Vagrant virtual environment.
The application can be run by setting up either a virtual environment or a virtual machine. Instructions for each option are provided below.
~
❯ cd /path/to/repo
~/path/to/repo
❯ git clone [email protected]:br3ndonland/udacity-fsnd-sql-logs.git
Install dependencies with a package manager of choice, such as Homebrew on macOS.
brew install python pipenv postgresql
brew cask install postico
Major dependencies:
- Python
- Pipenv
- PostgreSQL or the Postgres app
- Postico (optional): helpful GUI that can complement Postgres. Postico needs a Postgres server to connect to local databases.
-
Pipenv was used to manage the development virtual environment for this project.
-
Install Pipenv with a system package manager like Homebrew, or with the Python package manager
pip
. -
Use Pipenv to install the virtual environment from the Pipfile with
pipenv install --dev
.-
The
--dev
flag was used to accommodate the Black autoformatter, which is considered a pre-release package (see code style below). -
When generating the initial Pipfile containing the Black dev package, the
--pre
flag added a line to the Pipfile to allow pre-release packages. (TOML format):[pipenv] allow_prereleases = true
-
Further information can be found in the Pipenv docs.
-
-
-
VSCode can be configured to recognize the Pipenv virtual environment. See Using Python environments in VS Code.
- Command Palette -> Python: Select Interpreter. Select virtual environment.
- Command Palette -> Python: Create Terminal. Creates a terminal and automatically activates the virtual environment. If the virtual environment is set as the interpreter for the workspace, new terminal windows will automatically start in the virtual environment.
Postgres should be installed as described above. Postico is a helpful GUI for PostgreSQL that can complement (but not replace) Postgres. Postico needs a Postgres server to connect to local databases.
- Unzip data/newsdata.zip. If the file is not present, download from the original source or my copy and save in the data/ directory.
- Enter PostgreSQL with
psql
on the command line. - Create
vagrant
user:- newsdata.sql is configured for a PostgreSQL user
vagrant
. - Without Vagrant, you may need to run
createdb
before starting thepsql
CLI. - When using the Vagrant virtual machine, this step is specified in the Vagrantfile and performed during
vagrant up
.
- newsdata.sql is configured for a PostgreSQL user
- Create
news
database. - Grant user access to database.
- Load newsdata.sql into database.
~
❯ psql
psql (11.3)
Type "help" for help.
user=# CREATE USER vagrant;
user=# CREATE DATABASE news;
user=# GRANT ALL PRIVILEGES ON DATABASE "news" to vagrant;
user=# \q
~
❯ cd /path/to/repo
~/path/to/repo
❯ psql -d news -f db/data/newsdata.sql
The queries can be run by psql
, Postico, or Python.
~/path/to/repo
❯ psql -U vagrant -d news -a -f ./sql/1-most-popular-articles.sql
❯ psql -U vagrant -d news -a -f ./sql/2-most-popular-authors.sql
❯ psql -U vagrant -d news -a -f ./sql/3-http-request-error-rate.sql
~/path/to/repo
❯ pipenv shell
udacity-fsnd-sql-logs-hash ❯ python logs.py
-
Docker is a technology for running lightweight virtual machines called containers.
- An image is the executable set of files read by Docker.
- A container is a running image.
- The Dockerfile tells Docker how to build the container.
-
VSCode has built-in Docker features. See Working with Docker in VSCode and the VSCode tutorial on deploying Python with Docker.
-
To install Docker tools locally:
- Ubuntu Linux: follow the instructions for Ubuntu Linux, making sure to follow the postinstallation steps to activate the Docker daemon.
- macOS and Windows: install Docker Desktop (available via Homebrew with
brew cask install docker
).
-
To build a Docker image and run the container:
cd path/to/repo docker build . -t app:latest docker run -d -p 80:80 app:latest
-t
web tells Docker to name the imageapp
. Adding.
builds from the current directory.-d
runs the container in detached mode. Docker will display the container hash and return the terminal prompt.-p 80:80
maps the http port 80 from your local machine to port 1040 in the container.- A tag can be specified with
name:tag
, otherwise, the taglatest
will be used.
-
Expand this details element for more useful Docker commands.
# List images docker image ls # List containers docker container ls # Inspect a container (web in this example) and return the IP Address docker inspect web | grep IPAddress # Stop a container docker container stop # container hash # Remove a downloaded image docker image rm # image hash or name # Remove a container docker container rm # container hash # Prune stopped containers (wipes them and resets their state) docker container prune # Connect to running container (sort of like SSH) docker ps # get ID/hash of container you want to connect to docker exec -it [ID] /bin/bash # Or, connect as root: docker exec -u 0 -it [ID] /bin/bash # Copy file to/from container: docker cp [container_name]:/path/to/file destination.file
-
There are a few adjustments needed to enable Pipenv and Docker to work together.
-
Pipenv must first be installed with
pip
. -
Python dependencies are then installed from Pipfile.lock.
- Docker containers don't need virtual environments, so the
--system
flag is used to install packages into the container's global Python installation. Thus, it is not necessary to enter the virtual environment withpipenv shell
before starting the application. - The
--deploy
flag causes the build to fail if the Pipfile.lock is out of date. - The
--ignore-pipfile
flag tells Pipenv to use the Pipfile.lock for installation instead of the Pipfile.
# Pull an image: alpine images are tightly controlled and small in size FROM python:3.7-alpine LABEL app=news WORKDIR /app # Copy the directory to /app in the container COPY . /app # Install psycopg2 PostgreSQL dependencies RUN apk update; apk add build-base postgresql postgresql-dev libpq # Install Pipenv RUN python -m pip install pipenv # Install packages from Pipfile.lock, configured for Docker deployments RUN pipenv install --system --deploy --ignore-pipfile
- Docker containers don't need virtual environments, so the
TODO. A preliminary Docker Compose file has been added, but the application is not fully functional within Docker yet.
A virtual machine can be used to run the code from an operating system with a defined configuration. The virtual machine has all the dependencies needed to run the application.
-
HashiCorp Vagrant
- Software that provides an operating system in a defined configuration, allowing it to run identically across many personal computers.
-
Vagrant and VirtualBox can be installed with Homebrew Cask:
brew cask install vagrant
-
Vagrant creates a virtual environment using a provider.
- Oracle VirtualBox is the most common provider used. It is also available as a Homebrew Cask.
- Docker can be used as a Vagrant provider instead of VirtualBox.
- The provider normally installs the operating system using boxes. The Docker provider does not require a
config.vm.box
setting. Rather, it pulls an image from a container registry like Docker Hub.
-
Vagrant is configured with a Vagrantfile.
- This project was originally completed with the Udacity virtual machine configuration, a Git repository from Udacity that contains a Vagrantfile to configure Vagrant. If using the Udacity configuration, note that some of the necessary Python modules in the Udacity virtual machine configuration are only included for Python 2, and not Python 3. If needed, install the modules with
pip
. - A Vagrantfile was later added directly to this repo, with Docker as a provider.
- This project was originally completed with the Udacity virtual machine configuration, a Git repository from Udacity that contains a Vagrantfile to configure Vagrant. If using the Udacity configuration, note that some of the necessary Python modules in the Udacity virtual machine configuration are only included for Python 2, and not Python 3. If needed, install the modules with
-
Vagrant reads the Vagrantfile, and uses the provider to run the virtual environment.
-
Install the Udacity Vagrant virtual machine using the instructions in logs-udacity.md. Summary:
~ ❯ cd /path/to/repo ~/path/to/repo ❯ git clone [email protected]:udacity/fullstack-nanodegree-vm.git ~/path/to/repo ❯ cd fullstack-nanodegree-vm/vagrant ~/path/to/repo/fullstack-nanodegree-vm/vagrant ❯ git clone [email protected]:br3ndonland/udacity-fsnd-sql-logs.git ~/path/to/repo/fullstack-nanodegree-vm/vagrant ❯ vagrant up ~/path/to/repo/fullstack-nanodegree-vm/vagrant ❯ vagrant ssh ❯ vagrant@vagrant:~$ cd /vagrant/udacity-fsnd-sql-logs vagrant@vagrant:/vagrant/logs$ python3 logs.py
I implemented three SQL queries:
What are the most popular three articles of all time?
Returns a sorted list of the three most highly accessed articles, with the top article first.
select title, num from
(select substr(path, 10), count(*) as num from log
where path !='/' group by path)
as hits, articles where substr = slug order by num desc limit 3;
Terminal output:
title | views
----------------------------------+--------
Candidate is jerk, alleges rival | 338647
Bears love berries, alleges bear | 253801
Bad things gone, say good people | 170098
(3 rows)
Who are the most popular article authors of all time?
Returns a sorted list of the most popular article authors, with the most popular author at the top.
select name, sum(views) as total_views from
(select name, author, title, views from
(select substr(path, 10), count(*) as views from log
where path !='/' group by path)
as hits, articles, authors
where substr = slug and author = authors.id
order by views desc)
as threetables group by name order by total_views desc;
Terminal output:
name | total_views
------------------------+-------------
Ursula La Multa | 507594
Rudolf von Treppenwitz | 423457
Anonymous Contributor | 170098
Markoff Chaney | 84557
(4 rows)
On which days did more than 1% of requests lead to errors?
Returns a list of days on which >1% of HTTP requests resulted in HTTP error codes.
select errdate, http_requests, http_404,
100.0 * http_404 / http_requests as errpct from
(select date_trunc('day', time) as reqdate, count(*)
as http_requests from log group by reqdate)
as requests,
(select date_trunc('day', time) as errdate, count(*)
as http_404 from log where status = '404 NOT FOUND'
group by errdate)
as errors
where reqdate = errdate
and errors.http_404 > 0.01 * requests.http_requests
order by errdate desc;
Terminal output:
errdate | http_requests | http_404 | errpct
------------------------+---------------+----------+--------------------
2016-07-17 00:00:00+00 | 55907 | 1265 | 2.2626862468027260
(1 row)