Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementing a database interface #24

Open
teauxfu opened this issue Jun 8, 2021 · 2 comments · May be fixed by #25
Open

implementing a database interface #24

teauxfu opened this issue Jun 8, 2021 · 2 comments · May be fixed by #25
Labels
enhancement New feature or request refactor Something could be handled better

Comments

@teauxfu
Copy link
Member

teauxfu commented Jun 8, 2021

the issue

ScaleWiz currently manages data storage by looking for JSON files.

It performs some bulk data mutation and serialization occasionally.
When updates to data do occur, the entire data file must be written and read from again. This is inefficient.

Further, the lack of a centralized store means data can be duplicated 2-3 times in memory (TestHandler, ProjectEditor, EvaluationWindow) , with each copy pending being mutated by some UI widget. Problematic operations like this are currently disabled on the GUI layer.

Competing edits could result in data loss, so it would be better to just use a real database.
Data security here should be worth the overhead of having to make database retrievals/insertions.

possible solutions

The two best options appear to be MongoDB, and TinyDB.

MongoDB would be the more comprehensive, robust option. However, it requires talking to an actual instance of MongoDB either locally or remotely. This makes ScaleWiz harder to install/setup initially, with some potential long term benefits.

TinyDB is a less comprehensive option, but is user-friendly and easily extensible. It is pure Python, and doesn't require talking to a external process / server instance. ScaleWiz could still simply be pip installed.

Both options work by serializing data in a very dict-compatible format. MongoDB uses BSON and TinyDB uses JSON (or YAML, etc.) ScaleWiz models are already stored in JSON so migrating to it should be pretty easy. Migrating from TinyDB to something like MongoDB should be similarly easy if the need arises.

the work

Resolving this issue has a couple implementation goals

  1. structuring the database
  2. hooking up to the database
  3. port the existing data to the new database

structure

One of the first and most important implementation details to sort out will be how to actually find the documents once they are in the database. Right now we rely on the user navigating the filesystem to show us where the relevant chunk of JSON is. This file's path is effectively its 'universal identifier'.

A possible solution is to add to each Project a uuid, which could be stored as a hex string such as

1d849bddb8d14f4da6743770033e463f

We can generate these on the fly relatively cheaply. they are almost guaranteed to be collision free.

A TinyDB instance could be defined in the init.py of the package for easy access. All modifications to it should be passed through the main tkinter thread though (much like the GUI updates)

getting started will be kind of awkward. right now we rely on the user to navigate the filesystem to point us to the data. instead we will need to be able to query for it reliably. we also need to be able to relay the visual representation of the available data to the user. probably with something like a treeview
image

also see the difference between these possible schemas

https://github.com/teauxfu/scalewiz/blob/5523105262126654d62f7c37ed70d81190302bf6/sampledb.json#L1-L42

https://github.com/teauxfu/scalewiz/blob/5523105262126654d62f7c37ed70d81190302bf6/sampledb.yaml#L1-L33

JSON vs YAML as a format

hooking up

the current data cycle is like

JSON file -> Python dict -> Python object (tkinter variable model) -> Python dict -> JSON

this isn't very elegant, but the tkinter variables are easy to add traces to (fire a callback when the variable is updated).
this affords some nice functionality

  • makes it easier to keep certain properties up to date (eg. Test.name and Project.name)
  • it is easy to update chunks of the underlying data based in UI inputs (eg. updating scores when adding blanks to a report)

related #21

a possibly better two-way flow might be like

database (manages its own file, we don't care what kind) <-> Python object (tkinter vars)

a pain point in the UI previously has been keeping multiple copies of the same Project in memory, each with their own view, and having to do lots of file read/writing to rebuild the UI

  • the TestHandler which must read experiment params from the Project
    • it occasionally appends Tests to the Project, and can create new Projects but never independently mutates the metadata on the Project nor its existing Tests
  • the ProjectEditor can make new Projects and read/write all of a Project metadata. cannot mutate existing Tests
  • the EvaluationWindow can read/write to the list of Tests in a Project, read/write a couple Project metadata fields (plot and path iirc)

the new database implementation should

  1. avoid having to duplicate the Project object in this way
  2. read from the database to build the GUI
  3. update the database when appropriate

updates could be done immediately, or in batches by keeping a memo of update operations and sending them all off when the user clicks save etc. often it makes sense to let the user preview changes before persisting them

porting

Whatever the implementation ends up being, it must provide a path to migrate older data. There seems to be two options here

  1. migrate all the data up front, using a custom GUI/CLI tool to dynamically build a database from old JSON files. kind of a pain but not really hard to do. more work upfront, but overall a cleaner solution.
  2. try to patch over pieces of JSON on the fly somehow, maybe adding an 'Import' style button/feature to ScaleWiz. this would require less work up front but would certainly be tacky and add bloat.

The data as it is currently stored doesn't use a specific JSON schema. At the Python/tkinter layer we just hope the right fields are there, using the dict's get method to try to find the right key while offering fallbacks. eg.

data = json.load('my_data.json')
self.property.set(data.get('my_key', data.get('myKey'))

This is somewhat useful for kind-of correcting old JSON files on the fly, but also cumbersome. Especially as each undocumented change to the non-existent schema has to be added in as extra Python logic.

@balacla
Copy link
Collaborator

balacla commented Jun 8, 2021

After some reading, I noticed that TinyDB is restricted to single threaded operations. While that is not in and of itself a problem, it might be worthwhile to look into an alternative such as SQLite, which could be useful in the long run if multi threading becomes necessary. SQLite would be slightly more future proof while (hopefully) not requiring as much time to implement as MongoDB. SQLite also touts itself as being better equipped to handle tables/graphs, though it is unclear as to what those additional benefits are. If SQLite was shown to be the inferior choice after further investigation, implementing TinyDB sounds like a worthwhile addition!

@teauxfu
Copy link
Member Author

teauxfu commented Jun 8, 2021

SQLite seems somewhat easier than setting up MongoDB and somewhat harder than working with TinyDB.
SQLite would probably be more robust and future-proof than TinyDB, but I'm not yet sure how much / if it's worth the awkwardness of using it. Properly using table relationships with SQLite would probably be really powerful -- something TinyDB can't do.

As the UI components representing/updating the data will need to be changed having support for multi-threaded access is a nice option to have. Not sure if it will be necessary or appropriate to use, but nice to have.

At any rate, learning some SQL is probably a good thing, just on principle.

Still undecided, will do some more pondering

Related reading
sqlite access in vba
python docs for sqlite3
sqlite vs tinydb
sqlite json1

pandas.normalize_json should help with flattening dicts

import json
import sqlite3

sqlite3.register_converter('json1', json.loads)
sqlite3.register_adapter(dict, json.dumps)
sqlite3.register_adapter(list, json.dumps)

db = sqlite3.connect(
    ':memory:', detect_types=sqlite3.PARSE_DECLTYPES
)

with db:
    db.execute(
        'create table docs (i integer primary key, d json1)'
    )

doc1 = {'a': 1, 'b': ['x', 'y', 'z'], 'c': None}
db.execute('insert into docs (d) values (?)', (doc1,))

doc2 = db.execute('select d from docs').fetchone()[0]
assert doc1 == doc2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactor Something could be handled better
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants