Skip to content

sdam-elte/data-exp-vis-2024

Repository files navigation

Course material for Data Exploration and Visualisation

2024 spring semester

This course is part of the Scientific Data Analytics and Modelling Programme. The aim of the course is that students gain practical skills in accessing large databases/datasets, handling data stored in different formats, exploring/cleaning these data and presenting the gathered information. During the course students will come across databases used in various scientific fields. Completing of the several projects allows students to gain useful experiences that will give a firm foundation for later courses on theoretical datamining and advanced computing laboratories. In this course we intend to introduce state of the art tools and methods for data exploration and visualization. A library in a given programming language that we use today might become obsolete in a year or become the standard for a longer time. This field evolves rapidly and therefore the emphasis is not on the exact tool for solving a task, but the notions and algorithms.

There is a useful tutorial into python and some data transformation/analysis methods http://bokae.web.elte.hu/numerical_methods.html), which gives a wide background knowledge, that comes handy. Also for those who prefer the language R there are several tutorials.

During the course every sample code will be shown in jupyter notebooks, which can be accessed on the K8plex Edu platform and in this Github repository.

The course is held in the North Building in computer lab 5.56 every Tuesday between 12h-14h.

Course info

Neptun code: dsexplorf20vm
Instructor: Dávid Visontai
Semester: spring
Type: Lecture + Practice
Credit points: 6
Prerequisites: programming in either python, R - using Jupyter notebooks or RMarkdown

Each occasion starts with approximately a twenty minute of introduction into the current topic. After that everyone can work on the assignments and the consultants will be available for help with the any related problems and questions. For those who lack experience in either programming or data analysis it is highly recommended to attend the lecture and practice sessions. There are many tricks and tips that can be learned much quicker from personal discussions.

The schedule of the course

1, 13.02.2024. Introduction to K8plex Edu, Jupyter Notebooks, Requirements for the course, Timeseries analysis and USGS water discharge statistics
2, 20.02.2024. Maps, shapes, coordinates
3, 27.02.2024. Review the basics, Python, Pandas, etc.
4, 12.03.2024. Image exploration
5, 19.03.2024. Network exploration
6, 26.03.2024. SQL queries
7, 09.04.2024. Natural Language Processing
8, 16.04.2024. Datacleaning, and Filtering, Reproducibility
9, 23.04.2024. Interactive Visualization
10, 30.04.2024. REST services
11, 07.04.2024. Working with large datasets
12, 14.05.2024. ** Away for conference, no lecture**

Covered topics

  • Datatypes, images, tables, graphs, textual data
  • Standards of file- and dataformats (csv, hdf5, netcdf)
  • Raw and processed data, metadata, data cleaning
  • Access data locally and through the web (APIs, HTTP protocol)
  • Access of scientific databases, Usage of relational databases (SQL)
  • Transforming data, sorting, combining pandas
  • Basics of timeseries analysis
  • Handling datasets with geolocation (shapely, folium, geoviews)
  • Basics of image processing (opencv)
  • Dimension reduction, clustering
  • Processing textual data, logs (natural language processing)
  • Infographics, visualisation (html, css, javascript libraries)
  • Interactive dataexplorative tools (ipywidgets, bokeh, holoviews)
  • Creating applications with Streamlit
  • Developing open source softwares, reproducible research (OSF)

Assignments, the expected output format and the name of the corrector

  1. Following John Snow- HTML - Ágnes Becsei

  2. USGS water discharge statistics - HTML - Dávid Visontai

  3. SQL queries

  4. Interactive Visualization

  5. Application

    • a) REST API - (REST service/API) - Dávid Visontai
    • **b) **
  6. Network exploration - HTML - Dávid Visontai

  7. Data extraction from images - HTML - József Stéger

  8. Natural Language Processing on tweets - HTML - !!! Ex 7. and 8. is optional. - Eszter Bokányi

You have to submit 8 assignment (choose one from each topic) All assigments should be converted into HTML unless stated otherwise!

The submitted results should look like a report, which can be generated from the notebooks. This will be explained on the lecture. All figures should have labels, title, each exercise should end with a descriptive conclusion and explanatory comments should be inserted into the code. These are must have features of a work that is intended for presentation.

(Final) Submission deadline

The assignments/projects will be released in the first couple of weeks, which you have to submit after the end of the course 24th June 1PM (this is the last deadline). Each submitted assignment will be corrected and a feedback will be given. After that you may resubmit your assignment once more but before the final deadline.

Where to work on the assignments?

https://k8plex-edu.elte.hu/hub is where the notebooks will be handed out. It is available for all students with a valid Neptun or caesar account. Once you launch your environment you will find a folder with the course material. The notebooks will be available in this Github repository as well. We will explain how to use the portal on the first lecture. In case there is any problem with the portal you can run a notebook server locally on any other computer and upload your work later.

Grading

Each assignment will be corrected after submission and a maximum of 20 points will be given for it. 10 for all the completed tasks, 10 for the quality of the submitted report (look, clarity and comments). The final grades will be given according to the following point system:
0 - 50: failed
51 - 70: 2
71 - 94: 3
95 - 115: 4
116 - : 5

Folder structure of the working environment

  • Assignments will be visible in /v/courses/dataexp2024.assignments
  • Example Datasets will be in the /v/courses/dataexp2024.public/Datasets directory you will find datasets, that you can work with.
  • Other lecture materials will be in /v/courses/dataexp2024.public/<lecture_name> and on this github page

Learning materials

Recommended readings

  • Wes McKinney: Python for Data Analysis, (O’Reilly 2013)
  • Joel Grus: Data Science from Scratch (O’Reilly 2015)

Simulation and data visualizations

List of tutors

  • Ágnes Becsei
  • Anikó Mentes
  • Krisztián Papp
  • József Stéger
  • Dávid Visontai

About

Data exploration and visualization 2024

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published