diff --git a/.nojekyll b/.nojekyll index a410613..1454779 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -1a8c3982 \ No newline at end of file +627a00b4 \ No newline at end of file diff --git a/labs/lab0-setup.pdf b/labs/lab0-setup.pdf index 8828a38..71f39f8 100644 Binary files a/labs/lab0-setup.pdf and b/labs/lab0-setup.pdf differ diff --git a/labs/lab1-submission.pdf b/labs/lab1-submission.pdf index 7a01f52..2f0059a 100644 Binary files a/labs/lab1-submission.pdf and b/labs/lab1-submission.pdf differ diff --git a/labs/lab2-testing.pdf b/labs/lab2-testing.pdf index 0494366..9a316d0 100644 Binary files a/labs/lab2-testing.pdf and b/labs/lab2-testing.pdf differ diff --git a/labs/lab3-debugging.pdf b/labs/lab3-debugging.pdf index d568f55..5e5bbdc 100644 Binary files a/labs/lab3-debugging.pdf and b/labs/lab3-debugging.pdf differ diff --git a/labs/lab4-reproducibility.pdf b/labs/lab4-reproducibility.pdf index 4d3dcbf..45ca306 100644 Binary files a/labs/lab4-reproducibility.pdf and b/labs/lab4-reproducibility.pdf differ diff --git a/labs/lab5-codereview.pdf b/labs/lab5-codereview.pdf index 6db4775..bde2e46 100644 Binary files a/labs/lab5-codereview.pdf and b/labs/lab5-codereview.pdf differ diff --git a/labs/lab6-scfparallel.html b/labs/lab6-scfparallel.html index 332bce2..c7b8ae2 100644 --- a/labs/lab6-scfparallel.html +++ b/labs/lab6-scfparallel.html @@ -548,6 +548,7 @@

Other login nodes

  • radagast
  • shelob
  • +

    Note that all those machines are stand-alone login nodes. After logging in and launching a job, the job will be run on a cluster node. Usually, users cannot ssh into cluster nodes and the process of launching a job is via slurm from one of the login machines. However, if one has a job running on a cluster node, that specific node becomes accessible via ssh. This can be helpful with troubleshooting.

    diff --git a/labs/lab6-scfparallel.pdf b/labs/lab6-scfparallel.pdf index 4ed0a3b..5fa72c1 100644 Binary files a/labs/lab6-scfparallel.pdf and b/labs/lab6-scfparallel.pdf differ diff --git a/ps/ps1.pdf b/ps/ps1.pdf index 1ad96ae..4562057 100644 Binary files a/ps/ps1.pdf and b/ps/ps1.pdf differ diff --git a/ps/ps2.pdf b/ps/ps2.pdf index 52662f2..a1d0c70 100644 Binary files a/ps/ps2.pdf and b/ps/ps2.pdf differ diff --git a/ps/ps3.pdf b/ps/ps3.pdf index 2980280..308bd09 100644 Binary files a/ps/ps3.pdf and b/ps/ps3.pdf differ diff --git a/ps/ps4.pdf b/ps/ps4.pdf index 3f4c8cf..4637168 100644 Binary files a/ps/ps4.pdf and b/ps/ps4.pdf differ diff --git a/search.json b/search.json index 9a7128d..29bf4e8 100644 --- a/search.json +++ b/search.json @@ -1784,7 +1784,7 @@ "href": "units/unit7-bigData.html#using-dask-for-big-data-processing", "title": "Big data and databases", "section": "Using Dask for big data processing", - "text": "Using Dask for big data processing\nUnit 6 on parallelization gives an overview of using Dask for flexible parallelization on different kinds of computational resources (in particular, parallelizing across multiple cores on one machine versus parallelizing across multiple cores across multiple machines/nodes).\nHere we’ll see the use of Dask to work with distributed datasets. Dask can process datasets (potentially very large ones) by parallelizing operations across subsets of the data using multiple cores on one or more machines.\nLike Spark, Dask automatically reads data from files in parallel and operates on chunks (also called partitions or shards) of the full dataset in parallel. There are two big advantages of this:\n\nYou can do calculations (including reading from disk) in parallel because each worker will work on a piece of the data.\nWhen the data is split across machines, you can use the memory of multiple machines to handle much larger datasets than would be possible in memory on one machine. That said, Dask processes the data in chunks, so one often doesn’t need a lot of memory, even just on one machine.\n\nWhile reading from disk in parallel is a good goal, if all the data are on one hard drive, there are limitations on the speed of reading the data from disk because of having multiple processes all trying to access the disk at once. Supercomputing systems will generally have parallel file systems that support truly parallel reading (and writing, i.e., parallel I/O). Hadoop/Spark deal with this by distributing across multiple disks, generally one disk per machine/node.\nBecause computations are done in external compiled code (e.g., via numpy) it’s effective to use the threads scheduler when operating on one node to avoid having to copy and move the data.\n\nDask dataframes (pandas)\nDask dataframes are Pandas-like dataframes where each dataframe is split into groups of rows, stored as smaller Pandas dataframes.\nOne can do a lot of the kinds of computations that you would do on a Pandas dataframe on a Dask dataframe, but many operations are not possible. See here.\nBy default dataframes are handled by the threads scheduler. (Recall we discussed Dask’s various schedulers in Unit 6.)\nHere’s an example of reading from a dataset of flight delays (about 11 GB data). You can get the data here.\n\nimport dask\ndask.config.set(scheduler='threads', num_workers = 4) \nimport dask.dataframe as ddf\npath = '/scratch/users/paciorek/243/AirlineData/csvs/'\nair = ddf.read_csv(path + '*.csv.bz2',\n compression = 'bz2',\n encoding = 'latin1', # (unexpected) latin1 value(s) in TailNum field in 2001\n dtype = {'Distance': 'float64', 'CRSElapsedTime': 'float64',\n 'TailNum': 'object', 'CancellationCode': 'object', 'DepDelay': 'float64'})\n# specify some dtypes so Pandas doesn't complain about column type heterogeneity\nair\n\nDask will reads the data in parallel from the various .csv.bz2 files (unzipping on the fly), but note the caveat in the previous section about the possibilities for truly parallel I/O.\nHowever, recall that Dask uses delayed (lazy) evaluation. In this case, the reading is delayed until compute() is called. For that matter, the various other calculations (max, groupby, mean) shown below are only done after compute() is called.\n\nimport time\n\nt0 = time.time()\nair.DepDelay.max().compute() # this takes a while\nprint(time.time() - t0)\n\nt0 = time.time()\nair.DepDelay.mean().compute() # this takes a while\nprint(time.time() - t0)\n\nair.DepDelay.median().compute() \n\nWe’ll discuss in class why Dask won’t do the median. Consider the discussion about moving data in the earlier section on MapReduce.\nNext let’s see a full split-apply-combine (aka MapReduce) type of analysis.\n\nsub = air[(air.UniqueCarrier == 'UA') & (air.Origin == 'SFO')]\nbyDest = sub.groupby('Dest').DepDelay.mean()\nresults = byDest.compute() # this takes a while too\nresults\n\nYou should see this:\n Dest \n ACV 26.200000 \n BFL 1.000000 \n BOI 12.855069 \n BOS 9.316795 \n CLE 4.000000\n ...\nNote: calling compute twice is a bad idea as Dask will read in the data twice - more on this in a bit.\n\n\n\n\n\n\nMemory use for results objects\n\n\n\nThink carefully about the size of the result from calling compute. The result will be returned as a standard Python object, not distributed across multiple workers (and possibly machines), and with the object entirely in memory. It’s easy to accidentally return an entire giant dataset.\n\n\n\n\nDask bags\nBags are like lists but there is no particular ordering, so it doesn’t make sense to ask for the i’th element.\nYou can think of operations on Dask bags as being like parallel map operations on lists in Python or R.\nBy default bags are handled via the processes scheduler.\nLet’s see some basic operations on a large dataset of Wikipedia log files. You can get a subset of the Wikipedia data here.\nHere we again read the data in (which Dask will do in parallel):\n\nimport dask.multiprocessing\ndask.config.set(scheduler='processes', num_workers = 4) \nimport dask.bag as db\n## This is the full data\n## path = '/scratch/users/paciorek/wikistats/dated_2017/'\n## For demo we'll just use a small subset\npath = '/scratch/users/paciorek/wikistats/dated_2017_small/dated/'\nwiki = db.read_text(path + 'part-0*gz')\n\nHere we’ll just count the number of records.\n\nimport time\nt0 = time.time()\nwiki.count().compute()\ntime.time() - t0 # 136 sec. for full data\n\nAnd here is a more realistic example of filtering (subsetting).\n\nimport re\ndef find(line, regex = 'Armenia'):\n vals = line.split(' ')\n if len(vals) < 6:\n return(False)\n tmp = re.search(regex, vals[3])\n if tmp is None:\n return(False)\n else:\n return(True)\n \n\nwiki.filter(find).count().compute()\narmenia = wiki.filter(find)\nsmp = armenia.take(100) ## grab a handful as proof of concept\nsmp[0:5]\n\nNote that it is quite inefficient to do the find() (and implicitly reading the data in) and then compute on top of that intermediate result in two separate calls to compute(). Rather, we should set up the code so that all the operations are set up before a single call to compute(). This is discussed in detail in the Dask/future tutorial.\nSince the data are just treated as raw strings, we might want to introduce structure by converting each line to a tuple and then converting to a data frame.\n\ndef make_tuple(line):\n return(tuple(line.split(' ')))\n\ndtypes = {'date': 'object', 'time': 'object', 'language': 'object',\n'webpage': 'object', 'hits': 'float64', 'size': 'float64'}\n\n## Let's create a Dask dataframe. \n## This will take a while if done on full data.\ndf = armenia.map(make_tuple).to_dataframe(dtypes)\ntype(df)\n\n## Now let's actually do the computation, returning a Pandas df\nresult = df.compute() \ntype(result)\nresult[0:5]\n\n\n\nDask arrays (numpy)\nDask arrays are numpy-like arrays where each array is split up by both rows and columns into smaller numpy arrays.\nOne can do a lot of the kinds of computations that you would do on a numpy array on a Dask array, but many operations are not possible. See here.\nBy default arrays are handled via the threads scheduler.\n\nNon-distributed arrays\nLet’s first see operations on a single node, using a single 13 GB two-dimensional array. Again, Dask uses lazy evaluation, so creation of the array doesn’t happen until an operation requiring output is done.\n\nimport dask\ndask.config.set(scheduler = 'threads', num_workers = 4) \nimport dask.array as da\nx = da.random.normal(0, 1, size=(40000,40000), chunks=(10000, 10000))\n# square 10k x 10k chunks\nmycalc = da.mean(x, axis = 1) # by row\nimport time\nt0 = time.time()\nrs = mycalc.compute()\ntime.time() - t0 # 41 sec.\n\nFor a row-based operation, we would presumably only want to chunk things up by row, but this doesn’t seem to actually make a difference, presumably because the mean calculation can be done in pieces and only a small number of summary statistics moved between workers.\n\nimport dask\ndask.config.set(scheduler='threads', num_workers = 4) \nimport dask.array as da\n# x = da.from_array(x, chunks=(2500, 40000)) # adjust chunk size of existing array\nx = da.random.normal(0, 1, size=(40000,40000), chunks=(2500, 40000))\nmycalc = da.mean(x, axis = 1) # row means\nimport time\nt0 = time.time()\nrs = mycalc.compute()\ntime.time() - t0 # 42 sec.\n\nOf course, given the lazy evaluation, this timing comparison is not just timing the actual row mean calculations.\nBut this doesn’t really clarify the story…\n\nimport dask\ndask.config.set(scheduler='threads', num_workers = 4) \nimport dask.array as da\nimport numpy as np\nimport time\nt0 = time.time()\nx = np.random.normal(0, 1, size=(40000,40000))\ntime.time() - t0 # 110 sec.\n# for some reason the from_array and da.mean calculations are not done lazily here\nt0 = time.time()\ndx = da.from_array(x, chunks=(2500, 40000))\ntime.time() - t0 # 27 sec.\nt0 = time.time()\nmycalc = da.mean(x, axis = 1) # what is this doing given .compute() also takes time?\ntime.time() - t0 # 28 sec.\nt0 = time.time()\nrs = mycalc.compute()\ntime.time() - t0 # 21 sec.\n\nDask will avoid storing all the chunks in memory. (It appears to just generate them on the fly.) Here we have an 80 GB array but we never use more than a few GB of memory (based on top or free -h).\n\nimport dask\ndask.config.set(scheduler='threads', num_workers = 4) \nimport dask.array as da\nx = da.random.normal(0, 1, size=(100000,100000), chunks=(10000, 10000))\nmycalc = da.mean(x, axis = 1) # row means\nimport time\nt0 = time.time()\nrs = mycalc.compute()\ntime.time() - t0 # 205 sec.\nrs[0:5]\n\n\n\nDistributed arrays\nUsing arrays distributed across multiple machines should be straightforward based on using Dask distributed. However, one would want to be careful about creating arrays by distributing the data from a single Python process as that would involve copying between machines.", + "text": "Using Dask for big data processing\nUnit 6 on parallelization gives an overview of using Dask for flexible parallelization on different kinds of computational resources (in particular, parallelizing across multiple cores on one machine versus parallelizing across multiple cores across multiple machines/nodes).\nHere we’ll see the use of Dask to work with distributed datasets. Dask can process datasets (potentially very large ones) by parallelizing operations across subsets of the data using multiple cores on one or more machines.\nLike Spark, Dask automatically reads data from files in parallel and operates on chunks (also called partitions or shards) of the full dataset in parallel. There are two big advantages of this:\n\nYou can do calculations (including reading from disk) in parallel because each worker will work on a piece of the data.\nWhen the data is split across machines, you can use the memory of multiple machines to handle much larger datasets than would be possible in memory on one machine. That said, Dask processes the data in chunks, so one often doesn’t need a lot of memory, even just on one machine.\n\nWhile reading from disk in parallel is a good goal, if all the data are on one hard drive, there are limitations on the speed of reading the data from disk because of having multiple processes all trying to access the disk at once. Supercomputing systems will generally have parallel file systems that support truly parallel reading (and writing, i.e., parallel I/O). Hadoop/Spark deal with this by distributing across multiple disks, generally one disk per machine/node.\nBecause computations are done in external compiled code (e.g., via numpy) it’s effective to use the threads scheduler when operating on one node to avoid having to copy and move the data.\n\nDask dataframes (pandas)\nDask dataframes are Pandas-like dataframes where each dataframe is split into groups of rows, stored as smaller Pandas dataframes.\nOne can do a lot of the kinds of computations that you would do on a Pandas dataframe on a Dask dataframe, but many operations are not possible. See here.\nBy default dataframes are handled by the threads scheduler. (Recall we discussed Dask’s various schedulers in Unit 6.)\nHere’s an example of reading from a dataset of flight delays (about 11 GB data). You can get the data here.\n\nimport dask\ndask.config.set(scheduler='threads', num_workers = 4) \nimport dask.dataframe as ddf\npath = '/scratch/users/paciorek/243/AirlineData/csvs/'\nair = ddf.read_csv(path + '*.csv.bz2',\n compression = 'bz2',\n encoding = 'latin1', # (unexpected) latin1 value(s) in TailNum field in 2001\n dtype = {'Distance': 'float64', 'CRSElapsedTime': 'float64',\n 'TailNum': 'object', 'CancellationCode': 'object', 'DepDelay': 'float64',\n 'ActualElapsedTime': 'float64', 'ArrDelay': 'float64', 'ArrTime': 'float64',\n 'DepTime': 'float64'})\n## Specify some dtypes so Pandas doesn't complain about heterogeneity of ints and\n## floats within certain columns.\nair\nair.npartitions\n\nDask will reads the data in parallel from the various .csv.bz2 files (unzipping on the fly), but note the caveat in the previous section about the possibilities for truly parallel I/O.\nHowever, recall that Dask uses delayed (lazy) evaluation. In this case, the reading is delayed until compute() is called. For that matter, the various other calculations (max, groupby, mean) shown below are only done after compute() is called.\n\nimport time\n\nt0 = time.time()\nair.DepDelay.max().compute() # This takes a while.\nprint(time.time() - t0)\n\nt0 = time.time()\nair.DepDelay.mean().compute() # This takes a while.\nprint(time.time() - t0)\n\nair.DepDelay.median().compute() \n\nWe’ll discuss in class why Dask won’t do the median. Consider the discussion about moving data in the earlier section on MapReduce.\nNext let’s see a full split-apply-combine (aka MapReduce) type of analysis.\n\nsub = air[(air.UniqueCarrier == 'UA') & (air.Origin == 'SFO')]\nbyDest = sub.groupby('Dest').DepDelay.mean()\nresults = byDest.compute() # This takes a while too.\nresults\n\nYou should see this:\n Dest \n ACV 26.200000 \n BFL 1.000000 \n BOI 12.855069 \n BOS 9.316795 \n CLE 4.000000\n ...\nNote: calling compute twice is a bad idea as Dask will read in the data twice - more on this in a bit.\n\n\n\n\n\n\nMemory use for results objects\n\n\n\nThink carefully about the size of the result from calling compute. The result will be returned as a standard Python object, not distributed across multiple workers (and possibly machines), and with the object entirely in memory. It’s easy to accidentally return an entire giant dataset.\n\n\n\n\nDask bags\nBags are like lists but there is no particular ordering, so it doesn’t make sense to ask for the i’th element.\nYou can think of operations on Dask bags as being like parallel map operations on lists in Python or R.\nBy default bags are handled via the processes scheduler.\nLet’s see some basic operations on a large dataset of Wikipedia log files. You can get a subset of the Wikipedia data here.\nHere we again read the data in (which Dask will do in parallel):\n\nimport dask.multiprocessing\ndask.config.set(scheduler='processes', num_workers = 4) \nimport dask.bag as db\n## This is the full data\n## path = '/scratch/users/paciorek/wikistats/dated_2017/'\n## For demo we'll just use a small subset.\npath = '/scratch/users/paciorek/wikistats/dated_2017_small/dated/'\nwiki = db.read_text(path + 'part-0*gz')\n\nHere we’ll just count the number of records.\n\nimport time\nt0 = time.time()\nwiki.count().compute()\ntime.time() - t0 # 136 sec. for full data\n\nAnd here is a more realistic example of filtering (subsetting).\n\nimport re\n\ndef find(line, regex = 'Armenia'):\n vals = line.split(' ')\n if len(vals) < 6:\n return(False)\n tmp = re.search(regex, vals[3])\n if tmp is None:\n return(False)\n else:\n return(True)\n \n\nwiki.filter(find).count().compute()\n\narmenia = wiki.filter(find)\nsmp = armenia.take(100) # Grab a handful as proof of concept.\nsmp[0:5]\n\nNote that it is quite inefficient to do the find() (and implicitly read the data in) and then compute on top of that intermediate result in two separate calls to compute(). Rather, we should set up the code so that all the operations are set up before a single call to compute(). This is discussed in detail in the Dask/future tutorial.\nSince the data are just treated as raw strings, we might want to introduce structure by converting each line to a tuple and then converting to a data frame.\n\ndef make_tuple(line):\n return(tuple(line.split(' ')))\n\ndtypes = {'date': 'object', 'time': 'object', 'language': 'object',\n'webpage': 'object', 'hits': 'float64', 'size': 'float64'}\n\n## Let's create a Dask dataframe. \n## This will take a while if done on full data.\ndf = armenia.map(make_tuple).to_dataframe(dtypes)\ntype(df)\n\n## Now let's actually do the computation,\n## returning a **Pandas** (not Dask) df to the main process.\nresult = df.compute() \ntype(result)\nresult[0:5]\n\n\n\nDask arrays (numpy)\nDask arrays are numpy-like arrays where each array is split up by both rows and columns into smaller numpy arrays.\nOne can do a lot of the kinds of computations that you would do on a numpy array on a Dask array, but many operations are not possible. See here.\nBy default arrays are handled via the threads scheduler.\n\nNon-distributed arrays\nLet’s first see operations on a single node, using a single 13 GB two-dimensional array. Again, Dask uses lazy evaluation, so creation of the array doesn’t happen until an operation requiring output is done.\n\nimport dask\ndask.config.set(scheduler = 'threads', num_workers = 4) \nimport dask.array as da\nx = da.random.normal(0, 1, size=(40000,40000), chunks=(10000, 10000))\n# square 10k x 10k chunks\nmycalc = da.mean(x, axis = 1) # by row\nimport time\nt0 = time.time()\nrs = mycalc.compute()\ntime.time() - t0 # 41 sec.\n\nFor a row-based operation, we would presumably only want to chunk things up by row, but this doesn’t seem to actually make a difference, presumably because the mean calculation can be done in pieces and only a small number of summary statistics moved between workers.\n\nimport dask\ndask.config.set(scheduler='threads', num_workers = 4) \nimport dask.array as da\n# x = da.from_array(x, chunks=(2500, 40000)) # How to adjust chunk size of existing array.\nx = da.random.normal(0, 1, size=(40000,40000), chunks=(2500, 40000))\nmycalc = da.mean(x, axis = 1) # row means\nimport time\nt0 = time.time()\nrs = mycalc.compute()\ntime.time() - t0 # 42 sec.\n\nOf course, given the lazy evaluation, this timing comparison is not just timing the actual row mean calculations.\nBut this doesn’t really clarify the story…\n\nimport dask\ndask.config.set(scheduler='threads', num_workers = 4) \nimport dask.array as da\nimport numpy as np\nimport time\nt0 = time.time()\nx = np.random.normal(0, 1, size=(40000,40000))\ntime.time() - t0 # 110 sec.\n# For some reason the `from_array` and `da.mean` calculations are not done lazily here.\nt0 = time.time()\ndx = da.from_array(x, chunks=(2500, 40000))\ntime.time() - t0 # 27 sec.\nt0 = time.time()\nmycalc = da.mean(x, axis = 1) # What is this doing given .compute() also takes time?\ntime.time() - t0 # 28 sec.\nt0 = time.time()\nrs = mycalc.compute()\ntime.time() - t0 # 21 sec.\n\nDask will avoid storing all the chunks in memory. (It appears to just generate them on the fly.) Here we have an 80 GB array, but we never use more than a few GB of memory (based on top or free -h).\n\nimport dask\ndask.config.set(scheduler='threads', num_workers = 4) \nimport dask.array as da\nx = da.random.normal(0, 1, size=(100000,100000), chunks=(10000, 10000))\nmycalc = da.mean(x, axis = 1) # row means\nimport time\nt0 = time.time()\nrs = mycalc.compute()\ntime.time() - t0 # 205 sec.\nrs[0:5]\n\n\n\nDistributed arrays\nUsing arrays distributed across multiple machines should be straightforward based on using Dask distributed. However, one would want to be careful about creating arrays by distributing the data from a single Python process as that would involve copying between machines.", "crumbs": [ "Units", "Unit 7 (Databases)" @@ -1795,7 +1795,7 @@ "href": "units/unit7-bigData.html#overview-1", "title": "Big data and databases", "section": "Overview", - "text": "Overview\nBasically, standard SQL databases are relational databases that are a collection of rectangular format datasets (tables, also called relations), with each table similar to R or Pandas data frames, in that a table is made up of columns, which are called fields or attributes, each containing a single type (numeric, character, date, currency, enumerated (i.e., categorical), …) and rows or records containing the observations for one entity. Some of the tables in a given database will generally have fields in common so it makes sense to merge (i.e., join) information from multiple tables. E.g., you might have a database with a table of student information, a table of teacher information and a table of school information, and you might join student information with information about the teacher(s) who taught the students. Databases are set up to allow for fast querying and merging (called joins in database terminology).\n\nMemory and disk use\nFormally, databases are stored on disk, while Python and R store datasets in memory. This would suggest that databases will be slow to access their data but will be able to store more data than can be loaded into an Python or R session. However, databases can be quite fast due in part to disk caching by the operating system as well as careful implementation of good algorithms for database operations.", + "text": "Overview\nStandard SQL databases are relational databases that are a collection of rectangular format datasets (tables, also called relations), with each table similar to R or Pandas data frames, in that a table is made up of columns, which are called fields or attributes, each containing a single type (numeric, character, date, currency, enumerated (i.e., categorical), …) and rows or records containing the observations for one entity. Some of the tables in a given database will generally have fields in common so it makes sense to merge (i.e., join) information from multiple tables. E.g., you might have a database with a table of student information, a table of teacher information and a table of school information, and you might join student information with information about the teacher(s) who taught the students. Databases are set up to allow for fast querying and merging (called joins in database terminology).\n\nMemory and disk use\nFormally, databases are stored on disk, while Python and R store datasets in memory. This would suggest that databases will be slow to access their data but will be able to store more data than can be loaded into an Python or R session. However, databases can be quite fast due in part to disk caching by the operating system as well as careful implementation of good algorithms for database operations.", "crumbs": [ "Units", "Unit 7 (Databases)" @@ -1806,7 +1806,7 @@ "href": "units/unit7-bigData.html#interacting-with-a-database", "title": "Big data and databases", "section": "Interacting with a database", - "text": "Interacting with a database\nYou can interact with databases in a variety of database systems (DBMS=database management system). Some popular systems are SQLite, DuckDB, MySQL, PostgreSQL, Oracle and Microsoft Access. We’ll concentrate on accessing data in a database rather than management of databases. SQL is the Structured Query Language and is a special-purpose high-level language for managing databases and making queries. Variations on SQL are used in many different DBMS.\nQueries are the way that the user gets information (often simply subsets of tables or information merged across tables). The result of an SQL query is in general another table, though in some cases it might have only one row and/or one column.\nMany DBMS have a client-server model. Clients connect to the server, with some authentication, and make requests (i.e., queries).\nThere are often multiple ways to interact with a DBMS, including directly using command line tools provided by the DBMS or via Python or R, among others.\nWe’ll concentrate on SQLite (because it is simple to use on a single machine). SQLite is quite nice in terms of being self-contained - there is no server-client model, just a single file on your hard drive that stores the database and to which you can connect to using the SQLite shell, R, Python, etc. However, it does not have some useful functionality that other DBMS have. For example, you can’t use ALTER TABLE to modify column types or drop columns.\nA good alternative to SQLite that I encourage you to consider is DuckDB. DuckDB stores data column-wise, which can lead to big speedups when doing queries operating on large portions of tables (so-called “online analytical processing” (OLAP)). Another nice feature of DuckDB is that it can interact with data on disk without always having to read all the data into memory. In fact, ideally we’d use it for this class, but I haven’t had time to create a DuckDB version of the StackOverflow database.", + "text": "Interacting with a database\nYou can interact with databases in a variety of database systems (DBMS=database management system). Some popular systems are SQLite, DuckDB, MySQL, PostgreSQL, Oracle and Microsoft Access. We’ll concentrate on accessing data in a database rather than management of databases. SQL is the Structured Query Language and is a special-purpose high-level language for managing databases and making queries. Variations on SQL are used in many different DBMS.\nQueries are the way that the user gets information (often simply subsets of tables or information merged across tables). The result of an SQL query is in general another table, though in some cases it might have only one row and/or one column.\nMany DBMS have a client-server model. Clients connect to the server, with some authentication, and make requests (i.e., queries).\nThere are often multiple ways to interact with a DBMS, including directly using command line tools provided by the DBMS or via Python or R, among others.\nWe’ll concentrate on SQLite (because it is simple to use on a single machine). SQLite is quite nice in terms of being self-contained - there is no server-client model, just a single file on your hard drive that stores the database and to which you can connect to using the SQLite shell, R, Python, etc. However, it does not have some useful functionality that other DBMS have. For example, you can’t use ALTER TABLE to modify column types or drop columns.\nA good alternative to SQLite that I encourage you to consider is DuckDB. DuckDB stores data column-wise, which can lead to big speedups when doing queries operating on large portions of tables (so-called “online analytical processing” (OLAP)). Another nice feature of DuckDB is that it can interact with data on disk without always having to read all the data into memory.\nIn the demo code, we’ll have the option to use either SQLite or DuckDB.", "crumbs": [ "Units", "Unit 7 (Databases)" @@ -1828,7 +1828,7 @@ "href": "units/unit7-bigData.html#stack-overflow-metadata-example", "title": "Big data and databases", "section": "Stack Overflow metadata example", - "text": "Stack Overflow metadata example\nI’ve obtained data from Stack Overflow, the popular website for asking coding questions, and placed it into a normalized database. The SQLite version has metadata (i.e., it lacks the actual text of the questions and answers) on all of the questions and answers posted in 2021.\nWe’ll explore SQL functionality using this example database.\nNow let’s consider the Stack Overflow data. Each question may have multiple answers and each question may have multiple (topic) tags.\nIf we tried to put this into a single table, the fields could look like this if we have one row per question:\n\nquestion ID\nID of user submitting question\nquestion title\ntag 1\ntag 2\n…\ntag n\nanswer 1 ID\nID of user submitting answer 1\nage of user submitting answer 1\nname of user submitting answer 1\nanswer 2 ID\nID of user submitting answer 2\nage of user submitting answer 2\nname of user submitting answer 2\n…\n\nor like this if we have one row per question-answer pair:\n\nquestion ID\nID of user submitting question\nquestion title\ntag 1\ntag 2\n…\ntag n\nanswer ID\nID of user submitting answer\nage of user submitting answer\nname of user submitting answer\n\nAs we’ve discussed neither of those schema is particularly desirable.\n\n\n\n\n\n\nChallenge\n\n\n\nHow would you devise a schema to normalize the data. I.e., what set of tables do you think we should create?\n\n\nYou can view one reasonable schema. The lines between tables indicate the relationship of foreign keys in one table to primary keys in another table. The schema in the actual database of Stack Overflow data we’ll use in the examples here is similar to but not identical to that.\nYou can download a copy of the SQLite version of the Stack Overflow 2021 database.", + "text": "Stack Overflow metadata example\nI’ve obtained data from Stack Overflow, the popular website for asking coding questions, and placed it into a normalized database. The SQLite version has metadata (i.e., it lacks the actual text of the questions and answers) on all of the questions and answers posted in 2021.\nWe’ll explore SQL functionality using this example database.\nNow let’s consider the Stack Overflow data. Each question may have multiple answers and each question may have multiple (topic) tags.\nIf we tried to put this into a single table, the fields could look like this if we have one row per question:\n\nquestion ID\nID of user submitting question\nquestion title\ntag 1\ntag 2\n…\ntag n\nanswer 1 ID\nID of user submitting answer 1\nage of user submitting answer 1\nname of user submitting answer 1\nanswer 2 ID\nID of user submitting answer 2\nage of user submitting answer 2\nname of user submitting answer 2\n…\n\nor like this if we have one row per question-answer pair:\n\nquestion ID\nID of user submitting question\nquestion title\ntag 1\ntag 2\n…\ntag n\nanswer ID\nID of user submitting answer\nage of user submitting answer\nname of user submitting answer\n\nAs we’ve discussed neither of those schema is particularly desirable.\n\n\n\n\n\n\nChallenge\n\n\n\nHow would you devise a schema to normalize the data. I.e., what set of tables do you think we should create?\n\n\nYou can view one reasonable schema. The lines between tables indicate the relationship of foreign keys in one table to primary keys in another table. The schema in the actual database of Stack Overflow data we’ll use in the examples here is similar to but not identical to that.\nYou can download a copy of the SQLite version of the Stack Overflow 2021 database or a copy of the DuckDB version of the Stack Overflow 2021 database.", "crumbs": [ "Units", "Unit 7 (Databases)" @@ -1839,7 +1839,7 @@ "href": "units/unit7-bigData.html#accessing-databases-in-python", "title": "Big data and databases", "section": "Accessing databases in Python", - "text": "Accessing databases in Python\nPython provides a variety of front-end packages for manipulating databases from a variety of DBMS (SQLite, DuckDB, MySQL, PostgreSQL, among others). Basically, you start with a bit of code that links to the actual database, and then you can easily query the database using SQL syntax regardless of the back-end. The Python function calls that wrap around the SQL syntax will also look the same regardless of the back-end (basically execute(\"SOME SQL STATEMENT\")).\nWith SQLite, Python processes make calls against the stand-alone SQLite database (.db) file, so there are no SQLite-specific processes. With a client-server DBMS like PostgreSQL, Python processes call out to separate Postgres processes; these are started from the overall Postgres background process\nYou can access and navigate an SQLite database from Python as follows.\n\nimport sqlite3 as sq\ndir_path = '../data' # Replace with the actual path\ndb_filename = 'stackoverflow-2021.db'\n## download from http://www.stat.berkeley.edu/share/paciorek/stackoverflow-2021.db\n\ncon = sq.connect(os.path.join(dir_path, db_filename))\ndb = con.cursor()\ndb.execute(\"select * from questions limit 5\") # simple query \n\n<sqlite3.Cursor object at 0x7f8202e37640>\n\ndb.fetchall() # retrieve results\n\n[(65534165.0, '2021-01-01 22:15:54', 0.0, 112.0, 2.0, 0.0, None, \"Can't update a value in sqlite3\", 13189393.0), (65535296.0, '2021-01-02 01:33:13', 2.0, 1109.0, 0.0, 0.0, None, 'Install and run ROS on Google Colab', 14924336.0), (65535910.0, '2021-01-02 04:01:34', -1.0, 110.0, 1.0, 8.0, 0.0, 'Operators on date/time fields', 651174.0), (65535916.0, '2021-01-02 04:03:20', 1.0, 35.0, 1.0, 0.0, None, 'Plotting values normalised', 14695007.0), (65536749.0, '2021-01-02 07:03:04', 0.0, 108.0, 1.0, 5.0, None, 'Export C# to word with template', 14899717.0)]\n\n\nAlternatively, we could use DuckDB. However, I don’t have a DuckDB version of the StackOverflow database, so one can’t actually run this code.\n\nimport duckdb as dd\ndir_path = '../data' # Replace with the actual path\ndb_filename = 'stackoverflow-2021.duckdb' # This doesn't exist.\n\ncon = dd.connect(os.path.join(dir_path, db_filename))\ndb = con.cursor()\ndb.execute(\"select * from questions limit 5\") # simple query \ndb.fetchall() # retrieve results\n\nWe can (fairly) easily see the tables (this is easier from R):\n\ndef db_list_tables(db):\n db.execute(\"SELECT name FROM sqlite_master WHERE type='table';\")\n tables = db.fetchall()\n return [table[0] for table in tables]\n\ndb_list_tables(db)\n\n['questions', 'answers', 'questions_tags', 'users']\n\n\nTo see the fields in the table, if you’ve just queried the table, you can look at description:\n\n[item[0] for item in db.description]\n\n['name']\n\n\ndef get_fields():\n return [item[0] for item in db.description]\n\nHere’s how to make a basic SQL query. One can either make the query and get the results in one go or make the query and separately fetch the results. Here we’ve selected the first five rows (and all columns, based on the * wildcard) and brought them into Python as list of tuples.\n\nresults = db.execute(\"select * from questions limit 5\").fetchall() # simple query \ntype(results)\n\n<class 'list'>\n\ntype(results[0])\n\n<class 'tuple'>\n\nquery = db.execute(\"select * from questions\") # simple query \nresults2 = query.fetchmany(5)\nresults == results2\n\nTrue\n\n\nTo disconnect from the database:\n\ndb.close()\n\nIt’s convenient to get a Pandas dataframe back as the result. To that we can execute queries like this:\n\nimport pandas as pd\nresults = pd.read_sql(\"select * from questions limit 5\", con)", + "text": "Accessing databases in Python\nPython provides a variety of front-end packages for manipulating databases from a variety of DBMS (SQLite, DuckDB, MySQL, PostgreSQL, among others). Basically, you start with a bit of code that links to the actual database, and then you can easily query the database using SQL syntax regardless of the back-end. The Python function calls that wrap around the SQL syntax will also look the same regardless of the back-end (basically execute(\"SOME SQL STATEMENT\")).\nWith SQLite, Python processes make calls against the stand-alone SQLite database (.db) file, so there are no SQLite-specific processes. With a client-server DBMS like PostgreSQL, Python processes call out to separate Postgres processes; these are started from the overall Postgres background process\nYou can access and navigate an SQLite database from Python as follows.\n\nimport sqlite3 as sq\ndir_path = '../data' # Replace with the actual path\ndb_filename = 'stackoverflow-2021.db'\n## Download from http://www.stat.berkeley.edu/share/paciorek/stackoverflow-2021.db.\n\ncon = sq.connect(os.path.join(dir_path, db_filename))\ndb = con.cursor()\ndb.execute(\"select * from questions limit 5\") # simple query \n\n<sqlite3.Cursor object at 0x7f1bc801f640>\n\ndb.fetchall() # retrieve results\n\n[(65534165.0, '2021-01-01 22:15:54', 0.0, 112.0, 2.0, 0.0, None, \"Can't update a value in sqlite3\", 13189393.0), (65535296.0, '2021-01-02 01:33:13', 2.0, 1109.0, 0.0, 0.0, None, 'Install and run ROS on Google Colab', 14924336.0), (65535910.0, '2021-01-02 04:01:34', -1.0, 110.0, 1.0, 8.0, 0.0, 'Operators on date/time fields', 651174.0), (65535916.0, '2021-01-02 04:03:20', 1.0, 35.0, 1.0, 0.0, None, 'Plotting values normalised', 14695007.0), (65536749.0, '2021-01-02 07:03:04', 0.0, 108.0, 1.0, 5.0, None, 'Export C# to word with template', 14899717.0)]\n\n\nAlternatively, we could use DuckDB.\n\nimport duckdb as dd\ndir_path = '../data' # Replace with the actual path\ndb_filename = 'stackoverflow-2021.duckdb' \n## Download from http://www.stat.berkeley.edu/share/paciorek/stackoverflow-2021.duckdb.\n\ncon = dd.connect(os.path.join(dir_path, db_filename))\ndb = con.cursor()\ndb.execute(\"select * from questions limit 5\") # simple query \ndb.fetchall() # retrieve results\n\nWe can (fairly) easily see the tables (this is easier from R):\n\ndef db_list_tables(db):\n db.execute(\"SELECT name FROM sqlite_master WHERE type='table';\")\n tables = db.fetchall()\n return [table[0] for table in tables]\n\ndb_list_tables(db)\n\n['questions', 'answers', 'questions_tags', 'users']\n\n\nTo see the fields in the table, if you’ve just queried the table, you can look at description:\n\n[item[0] for item in db.description]\n\n['name']\n\n\ndef get_fields():\n return [item[0] for item in db.description]\n\nHere’s how to make a basic SQL query. One can either make the query and get the results in one go or make the query and separately fetch the results. Here we’ve selected the first five rows (and all columns, based on the * wildcard) and brought them into Python as list of tuples.\n\nresults = db.execute(\"select * from questions limit 5\").fetchall() # simple query \ntype(results)\n\n<class 'list'>\n\ntype(results[0])\n\n<class 'tuple'>\n\nquery = db.execute(\"select * from questions\") # simple query \nresults2 = query.fetchmany(5)\nresults == results2\n\nTrue\n\n\nTo disconnect from the database:\n\ndb.close()\n\nIt’s convenient to get a Pandas dataframe back as the result. To that we can execute queries like this:\n\nimport pandas as pd\nresults = pd.read_sql(\"select * from questions limit 5\", con)", "crumbs": [ "Units", "Unit 7 (Databases)" @@ -1949,7 +1949,7 @@ "href": "units/unit7-bigData.html#creating-database-tables", "title": "Big data and databases", "section": "Creating database tables", - "text": "Creating database tables\nOne can create tables from within the ‘sqlite’ command line interfaces (discussed in the tutorial), but often one would do this from Python or R. Here’s the syntax from Python, creating the table from a Pandas dataframe.\n\n## create data frame 'student_data' in some fashion\ncon = sq.connect(db_path)\nstudent_data.to_sql('student', con, if_exists='replace', index=False)", + "text": "Creating database tables\nOne can create tables from within the ‘sqlite’ command line interfaces (discussed in the tutorial), but often one would do this from Python or R. Here’s the syntax from Python, creating the table from a Pandas dataframe.\n\n## Create data frame 'student_data' in some fashion.\ncon = sq.connect(db_path)\nstudent_data.to_sql('student', con, if_exists='replace', index=False)", "crumbs": [ "Units", "Unit 7 (Databases)" @@ -2058,7 +2058,7 @@ "href": "labs/lab6-scfparallel.html#logging-in", "title": "Lab 6: SCF and Parallel Computing", "section": "Logging in", - "text": "Logging in\nThe SCF has a number of login nodes which you can access via ssh.\n\n\n\n\n\n\nNote\n\n\n\nFor info on using ssh (including on Windows), see here.\n\n\nFor example, I’ll use ssh to connect to the gandalf node:\njames@pop-os:~$ ssh <scf-username>@gandalf.berkeley.edu\nThe authenticity of host 'gandalf.berkeley.edu (128.32.135.47)' can't be established.\nED25519 key fingerprint is SHA256:io5uUQGbCZie78mF+UUZ5guDK29JXQGQ6LVB129UoUo.\nThis key is not known by any other names\nAre you sure you want to continue connecting (yes/no/[fingerprint])? yes\nWarning: Permanently added 'gandalf.berkeley.edu' (ED25519) to the list of known hosts.\nNotice that upon first connecting to a server you haven’t visited there is a warning that the “authenticity of the host … can’t be established”. So long as you have typed in the hostname correctly (gandalf.berkeley.edu, in this case), and trust the host (we trust the SCF!) then you can type yes to add the host to your known hosts file (found on your local machine at ~/.ssh/known_hosts).\nYou’ll then be asked to enter your password for the SCF cluster. For privacy, you won’t see anything happen in your terminal when you type it in, so type carefully (you can use Backspace if you make a mistake) and press Enter when you’re done. If you were successful, you should see a welcome message and your shell prompt, like:\n<scf-username>@gandalf:~$\nTo get your bearings, you can type pwd to see where your home directory is located on the SCF cluster filesystem:\n<scf-username>@gandalf:~$ pwd\n/accounts/grad/<scf-username>\nYour home directory is likely also in the /accounts/grad/ directory, as mine is.\n\nOther login nodes\n\n\n\n\n\n\nImportant\n\n\n\nDon’t run computationally intensive tasks on the login nodes!\nThey are shared by all the SCF users, and should only be used for non-intensive interactive work such as job submission and monitoring, basic compilation, managing your disk space, and transferring data to/from the server.\n\n\nIf for some reason gandalf is not working for you, the SCF has a number of nodes which can be accessed from your local machine with commands of the form ssh <scf-username>@<hostname>.berkeley.edu. Currently, these are:\n\naragorn\narwen\ndorothy\ngandalf\ngollum\nhermione\nquidditch\nradagast\nshelob", + "text": "Logging in\nThe SCF has a number of login nodes which you can access via ssh.\n\n\n\n\n\n\nNote\n\n\n\nFor info on using ssh (including on Windows), see here.\n\n\nFor example, I’ll use ssh to connect to the gandalf node:\njames@pop-os:~$ ssh <scf-username>@gandalf.berkeley.edu\nThe authenticity of host 'gandalf.berkeley.edu (128.32.135.47)' can't be established.\nED25519 key fingerprint is SHA256:io5uUQGbCZie78mF+UUZ5guDK29JXQGQ6LVB129UoUo.\nThis key is not known by any other names\nAre you sure you want to continue connecting (yes/no/[fingerprint])? yes\nWarning: Permanently added 'gandalf.berkeley.edu' (ED25519) to the list of known hosts.\nNotice that upon first connecting to a server you haven’t visited there is a warning that the “authenticity of the host … can’t be established”. So long as you have typed in the hostname correctly (gandalf.berkeley.edu, in this case), and trust the host (we trust the SCF!) then you can type yes to add the host to your known hosts file (found on your local machine at ~/.ssh/known_hosts).\nYou’ll then be asked to enter your password for the SCF cluster. For privacy, you won’t see anything happen in your terminal when you type it in, so type carefully (you can use Backspace if you make a mistake) and press Enter when you’re done. If you were successful, you should see a welcome message and your shell prompt, like:\n<scf-username>@gandalf:~$\nTo get your bearings, you can type pwd to see where your home directory is located on the SCF cluster filesystem:\n<scf-username>@gandalf:~$ pwd\n/accounts/grad/<scf-username>\nYour home directory is likely also in the /accounts/grad/ directory, as mine is.\n\nOther login nodes\n\n\n\n\n\n\nImportant\n\n\n\nDon’t run computationally intensive tasks on the login nodes!\nThey are shared by all the SCF users, and should only be used for non-intensive interactive work such as job submission and monitoring, basic compilation, managing your disk space, and transferring data to/from the server.\n\n\nIf for some reason gandalf is not working for you, the SCF has a number of nodes which can be accessed from your local machine with commands of the form ssh <scf-username>@<hostname>.berkeley.edu. Currently, these are:\n\naragorn\narwen\ndorothy\ngandalf\ngollum\nhermione\nquidditch\nradagast\nshelob\n\nNote that all those machines are stand-alone login nodes. After logging in and launching a job, the job will be run on a cluster node. Usually, users cannot ssh into cluster nodes and the process of launching a job is via slurm from one of the login machines. However, if one has a job running on a cluster node, that specific node becomes accessible via ssh. This can be helpful with troubleshooting.", "crumbs": [ "Labs", "Lab 6 (SCF and parallelization)" diff --git a/sitemap.xml b/sitemap.xml index b05c934..27abc68 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -90,7 +90,7 @@ https://stat243.berkeley.edu/fall-2024/units/unit7-bigData.html - 2024-10-10T17:03:27.902Z + 2024-10-10T17:40:09.896Z https://stat243.berkeley.edu/fall-2024/license.html @@ -106,7 +106,7 @@ https://stat243.berkeley.edu/fall-2024/labs/lab6-scfparallel.html - 2024-10-08T16:50:13.029Z + 2024-10-10T17:44:19.520Z https://stat243.berkeley.edu/fall-2024/office_hours.html diff --git a/units/regex.pdf b/units/regex.pdf index f433000..ec9582d 100644 Binary files a/units/regex.pdf and b/units/regex.pdf differ diff --git a/units/unit1-intro.pdf b/units/unit1-intro.pdf index f5c9d5f..a32e897 100644 Binary files a/units/unit1-intro.pdf and b/units/unit1-intro.pdf differ diff --git a/units/unit2-dataTech.pdf b/units/unit2-dataTech.pdf index ec1b064..37a95e7 100644 Binary files a/units/unit2-dataTech.pdf and b/units/unit2-dataTech.pdf differ diff --git a/units/unit3-bash.pdf b/units/unit3-bash.pdf index a7b56a3..2fa54c4 100644 Binary files a/units/unit3-bash.pdf and b/units/unit3-bash.pdf differ diff --git a/units/unit4-goodPractices.pdf b/units/unit4-goodPractices.pdf index f7ea5b6..52f14fc 100644 Binary files a/units/unit4-goodPractices.pdf and b/units/unit4-goodPractices.pdf differ diff --git a/units/unit5-programming.pdf b/units/unit5-programming.pdf index 76f31b0..2fd0634 100644 Binary files a/units/unit5-programming.pdf and b/units/unit5-programming.pdf differ diff --git a/units/unit6-parallel.pdf b/units/unit6-parallel.pdf index 13c13b0..c0cfd18 100644 Binary files a/units/unit6-parallel.pdf and b/units/unit6-parallel.pdf differ diff --git a/units/unit7-bigData.html b/units/unit7-bigData.html index a69899c..b354100 100644 --- a/units/unit7-bigData.html +++ b/units/unit7-bigData.html @@ -567,9 +567,13 @@

    Dask dataframes (pa compression = 'bz2', encoding = 'latin1', # (unexpected) latin1 value(s) in TailNum field in 2001 dtype = {'Distance': 'float64', 'CRSElapsedTime': 'float64', - 'TailNum': 'object', 'CancellationCode': 'object', 'DepDelay': 'float64'}) -# specify some dtypes so Pandas doesn't complain about column type heterogeneity -air + 'TailNum': 'object', 'CancellationCode': 'object', 'DepDelay': 'float64', + 'ActualElapsedTime': 'float64', 'ArrDelay': 'float64', 'ArrTime': 'float64', + 'DepTime': 'float64'}) +## Specify some dtypes so Pandas doesn't complain about heterogeneity of ints and +## floats within certain columns. +air +air.npartitions

    Dask will reads the data in parallel from the various .csv.bz2 files (unzipping on the fly), but note the caveat in the previous section about the possibilities for truly parallel I/O.

    However, recall that Dask uses delayed (lazy) evaluation. In this case, the reading is delayed until compute() is called. For that matter, the various other calculations (max, groupby, mean) shown below are only done after compute() is called.

    @@ -577,11 +581,11 @@

    Dask dataframes (pa
    import time
     
     t0 = time.time()
    -air.DepDelay.max().compute()   # this takes a while
    +air.DepDelay.max().compute()   # This takes a while.
     print(time.time() - t0)
     
     t0 = time.time()
    -air.DepDelay.mean().compute()   # this takes a while
    +air.DepDelay.mean().compute()   # This takes a while.
     print(time.time() - t0)
     
     air.DepDelay.median().compute() 
    @@ -591,7 +595,7 @@

    Dask dataframes (pa
    sub = air[(air.UniqueCarrier == 'UA') & (air.Origin == 'SFO')]
     byDest = sub.groupby('Dest').DepDelay.mean()
    -results = byDest.compute()            # this takes a while too
    +results = byDest.compute()            # This takes a while too.
     results

    You should see this:

    @@ -630,7 +634,7 @@

    Dask bags

    import dask.bag as db ## This is the full data ## path = '/scratch/users/paciorek/wikistats/dated_2017/' -## For demo we'll just use a small subset +## For demo we'll just use a small subset. path = '/scratch/users/paciorek/wikistats/dated_2017_small/dated/' wiki = db.read_text(path + 'part-0*gz') @@ -644,23 +648,25 @@

    Dask bags

    And here is a more realistic example of filtering (subsetting).

    import re
    -def find(line, regex = 'Armenia'):
    -    vals = line.split(' ')
    -    if len(vals) < 6:
    -        return(False)
    -    tmp = re.search(regex, vals[3])
    -    if tmp is None:
    -        return(False)
    -    else:
    -        return(True)
    -    
    -
    -wiki.filter(find).count().compute()
    -armenia = wiki.filter(find)
    -smp = armenia.take(100) ## grab a handful as proof of concept
    -smp[0:5]
    -
    -

    Note that it is quite inefficient to do the find() (and implicitly reading the data in) and then compute on top of that intermediate result in two separate calls to compute(). Rather, we should set up the code so that all the operations are set up before a single call to compute(). This is discussed in detail in the Dask/future tutorial.

    + +def find(line, regex = 'Armenia'): + vals = line.split(' ') + if len(vals) < 6: + return(False) + tmp = re.search(regex, vals[3]) + if tmp is None: + return(False) + else: + return(True) + + +wiki.filter(find).count().compute() + +armenia = wiki.filter(find) +smp = armenia.take(100) # Grab a handful as proof of concept. +smp[0:5] + +

    Note that it is quite inefficient to do the find() (and implicitly read the data in) and then compute on top of that intermediate result in two separate calls to compute(). Rather, we should set up the code so that all the operations are set up before a single call to compute(). This is discussed in detail in the Dask/future tutorial.

    Since the data are just treated as raw strings, we might want to introduce structure by converting each line to a tuple and then converting to a data frame.

    def make_tuple(line):
    @@ -674,10 +680,11 @@ 

    Dask bags

    df = armenia.map(make_tuple).to_dataframe(dtypes) type(df) -## Now let's actually do the computation, returning a Pandas df -result = df.compute() -type(result) -result[0:5]
    +## Now let's actually do the computation, +## returning a **Pandas** (not Dask) df to the main process. +result = df.compute() +type(result) +result[0:5]
    @@ -705,7 +712,7 @@

    Non-distributed arr
    import dask
     dask.config.set(scheduler='threads', num_workers = 4)  
     import dask.array as da
    -# x = da.from_array(x, chunks=(2500, 40000))  # adjust chunk size of existing array
    +# x = da.from_array(x, chunks=(2500, 40000))  # How to adjust chunk size of existing array.
     x = da.random.normal(0, 1, size=(40000,40000), chunks=(2500, 40000))
     mycalc = da.mean(x, axis = 1)  # row means
     import time
    @@ -724,18 +731,18 @@ 

    Non-distributed arr t0 = time.time() x = np.random.normal(0, 1, size=(40000,40000)) time.time() - t0 # 110 sec. -# for some reason the from_array and da.mean calculations are not done lazily here +# For some reason the `from_array` and `da.mean` calculations are not done lazily here. t0 = time.time() dx = da.from_array(x, chunks=(2500, 40000)) time.time() - t0 # 27 sec. t0 = time.time() -mycalc = da.mean(x, axis = 1) # what is this doing given .compute() also takes time? +mycalc = da.mean(x, axis = 1) # What is this doing given .compute() also takes time? time.time() - t0 # 28 sec. t0 = time.time() rs = mycalc.compute() time.time() - t0 # 21 sec.

    -

    Dask will avoid storing all the chunks in memory. (It appears to just generate them on the fly.) Here we have an 80 GB array but we never use more than a few GB of memory (based on top or free -h).

    +

    Dask will avoid storing all the chunks in memory. (It appears to just generate them on the fly.) Here we have an 80 GB array, but we never use more than a few GB of memory (based on top or free -h).

    import dask
     dask.config.set(scheduler='threads', num_workers = 4)  
    @@ -761,7 +768,7 @@ 

    3. Databases

    This material is drawn from the tutorial on Working with large datasets in SQL, R, and Python, though I won’t hold you responsible for all of the database/SQL material in that tutorial, only what appears here in this Unit.

    Overview

    -

    Basically, standard SQL databases are relational databases that are a collection of rectangular format datasets (tables, also called relations), with each table similar to R or Pandas data frames, in that a table is made up of columns, which are called fields or attributes, each containing a single type (numeric, character, date, currency, enumerated (i.e., categorical), …) and rows or records containing the observations for one entity. Some of the tables in a given database will generally have fields in common so it makes sense to merge (i.e., join) information from multiple tables. E.g., you might have a database with a table of student information, a table of teacher information and a table of school information, and you might join student information with information about the teacher(s) who taught the students. Databases are set up to allow for fast querying and merging (called joins in database terminology).

    +

    Standard SQL databases are relational databases that are a collection of rectangular format datasets (tables, also called relations), with each table similar to R or Pandas data frames, in that a table is made up of columns, which are called fields or attributes, each containing a single type (numeric, character, date, currency, enumerated (i.e., categorical), …) and rows or records containing the observations for one entity. Some of the tables in a given database will generally have fields in common so it makes sense to merge (i.e., join) information from multiple tables. E.g., you might have a database with a table of student information, a table of teacher information and a table of school information, and you might join student information with information about the teacher(s) who taught the students. Databases are set up to allow for fast querying and merging (called joins in database terminology).

    Memory and disk use

    Formally, databases are stored on disk, while Python and R store datasets in memory. This would suggest that databases will be slow to access their data but will be able to store more data than can be loaded into an Python or R session. However, databases can be quite fast due in part to disk caching by the operating system as well as careful implementation of good algorithms for database operations.

    @@ -774,7 +781,8 @@

    Interacting wi

    Many DBMS have a client-server model. Clients connect to the server, with some authentication, and make requests (i.e., queries).

    There are often multiple ways to interact with a DBMS, including directly using command line tools provided by the DBMS or via Python or R, among others.

    We’ll concentrate on SQLite (because it is simple to use on a single machine). SQLite is quite nice in terms of being self-contained - there is no server-client model, just a single file on your hard drive that stores the database and to which you can connect to using the SQLite shell, R, Python, etc. However, it does not have some useful functionality that other DBMS have. For example, you can’t use ALTER TABLE to modify column types or drop columns.

    -

    A good alternative to SQLite that I encourage you to consider is DuckDB. DuckDB stores data column-wise, which can lead to big speedups when doing queries operating on large portions of tables (so-called “online analytical processing” (OLAP)). Another nice feature of DuckDB is that it can interact with data on disk without always having to read all the data into memory. In fact, ideally we’d use it for this class, but I haven’t had time to create a DuckDB version of the StackOverflow database.

    +

    A good alternative to SQLite that I encourage you to consider is DuckDB. DuckDB stores data column-wise, which can lead to big speedups when doing queries operating on large portions of tables (so-called “online analytical processing” (OLAP)). Another nice feature of DuckDB is that it can interact with data on disk without always having to read all the data into memory.

    +

    In the demo code, we’ll have the option to use either SQLite or DuckDB.

    Database schema and normalization

    @@ -940,7 +948,7 @@

    Stack Over

    You can view one reasonable schema. The lines between tables indicate the relationship of foreign keys in one table to primary keys in another table. The schema in the actual database of Stack Overflow data we’ll use in the examples here is similar to but not identical to that.

    -

    You can download a copy of the SQLite version of the Stack Overflow 2021 database.

    +

    You can download a copy of the SQLite version of the Stack Overflow 2021 database or a copy of the DuckDB version of the Stack Overflow 2021 database.

    Accessing databases in Python

    @@ -951,29 +959,30 @@

    Accessing da
    import sqlite3 as sq
     dir_path = '../data'  # Replace with the actual path
     db_filename = 'stackoverflow-2021.db'
    -## download from http://www.stat.berkeley.edu/share/paciorek/stackoverflow-2021.db
    +## Download from http://www.stat.berkeley.edu/share/paciorek/stackoverflow-2021.db.
     
     con = sq.connect(os.path.join(dir_path, db_filename))
     db = con.cursor()
     db.execute("select * from questions limit 5")  # simple query 
    -
    <sqlite3.Cursor object at 0x7f8202e37640>
    +
    <sqlite3.Cursor object at 0x7f1bc801f640>
    db.fetchall() # retrieve results
    [(65534165.0, '2021-01-01 22:15:54', 0.0, 112.0, 2.0, 0.0, None, "Can't update a value in sqlite3", 13189393.0), (65535296.0, '2021-01-02 01:33:13', 2.0, 1109.0, 0.0, 0.0, None, 'Install and run ROS on Google Colab', 14924336.0), (65535910.0, '2021-01-02 04:01:34', -1.0, 110.0, 1.0, 8.0, 0.0, 'Operators on date/time fields', 651174.0), (65535916.0, '2021-01-02 04:03:20', 1.0, 35.0, 1.0, 0.0, None, 'Plotting values normalised', 14695007.0), (65536749.0, '2021-01-02 07:03:04', 0.0, 108.0, 1.0, 5.0, None, 'Export C# to word with template', 14899717.0)]
    -

    Alternatively, we could use DuckDB. However, I don’t have a DuckDB version of the StackOverflow database, so one can’t actually run this code.

    +

    Alternatively, we could use DuckDB.

    import duckdb as dd
     dir_path = '../data'  # Replace with the actual path
    -db_filename = 'stackoverflow-2021.duckdb'  # This doesn't exist.
    -
    -con = dd.connect(os.path.join(dir_path, db_filename))
    -db = con.cursor()
    -db.execute("select * from questions limit 5")  # simple query 
    -db.fetchall() # retrieve results
    +db_filename = 'stackoverflow-2021.duckdb' +## Download from http://www.stat.berkeley.edu/share/paciorek/stackoverflow-2021.duckdb. + +con = dd.connect(os.path.join(dir_path, db_filename)) +db = con.cursor() +db.execute("select * from questions limit 5") # simple query +db.fetchall() # retrieve results

    We can (fairly) easily see the tables (this is easier from R):

    @@ -1400,7 +1409,7 @@

    Subqueri

    Creating database tables

    One can create tables from within the ‘sqlite’ command line interfaces (discussed in the tutorial), but often one would do this from Python or R. Here’s the syntax from Python, creating the table from a Pandas dataframe.

    -
    ## create data frame 'student_data' in some fashion
    +
    ## Create data frame 'student_data' in some fashion.
     con = sq.connect(db_path)
     student_data.to_sql('student', con, if_exists='replace', index=False)
    diff --git a/units/unit7-bigData.pdf b/units/unit7-bigData.pdf index 867aa9e..ee6a9d4 100644 Binary files a/units/unit7-bigData.pdf and b/units/unit7-bigData.pdf differ