import dask
dask.config.set(scheduler='threads', num_workers = 4)
@@ -761,7 +768,7 @@ 3. Databases
This material is drawn from the tutorial on Working with large datasets in SQL, R, and Python, though I won’t hold you responsible for all of the database/SQL material in that tutorial, only what appears here in this Unit.
Overview
-Basically, standard SQL databases are relational databases that are a collection of rectangular format datasets (tables, also called relations), with each table similar to R or Pandas data frames, in that a table is made up of columns, which are called fields or attributes, each containing a single type (numeric, character, date, currency, enumerated (i.e., categorical), …) and rows or records containing the observations for one entity. Some of the tables in a given database will generally have fields in common so it makes sense to merge (i.e., join) information from multiple tables. E.g., you might have a database with a table of student information, a table of teacher information and a table of school information, and you might join student information with information about the teacher(s) who taught the students. Databases are set up to allow for fast querying and merging (called joins in database terminology).
+Standard SQL databases are relational databases that are a collection of rectangular format datasets (tables, also called relations), with each table similar to R or Pandas data frames, in that a table is made up of columns, which are called fields or attributes, each containing a single type (numeric, character, date, currency, enumerated (i.e., categorical), …) and rows or records containing the observations for one entity. Some of the tables in a given database will generally have fields in common so it makes sense to merge (i.e., join) information from multiple tables. E.g., you might have a database with a table of student information, a table of teacher information and a table of school information, and you might join student information with information about the teacher(s) who taught the students. Databases are set up to allow for fast querying and merging (called joins in database terminology).
Memory and disk use
Formally, databases are stored on disk, while Python and R store datasets in memory. This would suggest that databases will be slow to access their data but will be able to store more data than can be loaded into an Python or R session. However, databases can be quite fast due in part to disk caching by the operating system as well as careful implementation of good algorithms for database operations.
@@ -774,7 +781,8 @@ Interacting wi
Many DBMS have a client-server model. Clients connect to the server, with some authentication, and make requests (i.e., queries).
There are often multiple ways to interact with a DBMS, including directly using command line tools provided by the DBMS or via Python or R, among others.
We’ll concentrate on SQLite (because it is simple to use on a single machine). SQLite is quite nice in terms of being self-contained - there is no server-client model, just a single file on your hard drive that stores the database and to which you can connect to using the SQLite shell, R, Python, etc. However, it does not have some useful functionality that other DBMS have. For example, you can’t use ALTER TABLE
to modify column types or drop columns.
-A good alternative to SQLite that I encourage you to consider is DuckDB. DuckDB stores data column-wise, which can lead to big speedups when doing queries operating on large portions of tables (so-called “online analytical processing” (OLAP)). Another nice feature of DuckDB is that it can interact with data on disk without always having to read all the data into memory. In fact, ideally we’d use it for this class, but I haven’t had time to create a DuckDB version of the StackOverflow database.
+A good alternative to SQLite that I encourage you to consider is DuckDB. DuckDB stores data column-wise, which can lead to big speedups when doing queries operating on large portions of tables (so-called “online analytical processing” (OLAP)). Another nice feature of DuckDB is that it can interact with data on disk without always having to read all the data into memory.
+In the demo code, we’ll have the option to use either SQLite or DuckDB.
Database schema and normalization
@@ -940,7 +948,7 @@ Stack Over