Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usability: Allow pure folder-based access and storage of AiiDA profiles #22

Open
giovannipizzi opened this issue May 23, 2024 · 6 comments
Labels
roadmap/proposed A roadmap item that has been proposed but not yet processed

Comments

@giovannipizzi
Copy link
Member

Motivation

Even advanced users might not fully understand where data is stored (DB, repository, configuration file, ...). This is currently also virtual-environment based. If I want to know all profiles in my computer, it's hard to keep track of them if I'm note very organised.
A user might just want to know that everything about a profile is inside a folder (as it happens for a git repo, everything is inside a .git folder), and if I move the folder, I'm moving everything; if I delete a folder, everything is gone; etc.

Desired Outcome

It is possible to have a way to define a profile that is fully confined in a folder (including all data). This should be easy for the SQLite DB at least (notes in a comment below for PSQL). It should then be easy to just use that profile by just navigating into the folder.
In the future, similar to a git checkout, one could have a way to mirror part of the data in the folder, at least in read-only mode, so people can just use usual file browsers, grep commands etc, to understand what the folder is about. Syncing to this folder does not have to be realtime, but can happen when a command is invoked, similar to git pull, e.g. verdi sync .... This could e.g. be in the form of an extended version of the verdi process dump command, that instead dumps all nodes inside a given group. And, when a verdi group sync --all command is run, it refreshes the dump files to ensure they are up to date with the AiiDA DB.

Impact

I think many users have a hard time understanding the concept of profiles, where data is stored, how to delete a profile and back it up (even if we provide commands), what to do when disk storage is running low, etc. Folder based can help a lot in starting with AiiDA while feeling to have a full control of their data.

Complexity

  • we need to find a way to make verdi commands use the correct folder, ideally based on the folder in which we are (similarly to git, so it significantly reduces typing as you don't have to specify a specific profile at every verdi command - and it's familiar to most users who e.g. understand git), and make sure this is not incompatible with the way AiiDA is currently used
  • Once this is in place, it will be very possible that one tries to access the same folder-based profile from different virtual envs. We need therefore e.g. to make sure that any profile created with a given major AiiDA version can be accessed without problems from any other AiiDA version with the same major version. Things e.g. to consider:
    • do we need DB migrations anymore (at least within a major version?). My suggestion is no: in 2.x we only have 2 to drop the cache, that are not crucial. We should also probably start using DB migrations only for schema changes, and not for data migrations (that we shouldn't do anymore?). Also DB migrations can be reduced, if we really need to change the schema, by defining a new storage backend with an improved schema. The migration is then replaced by a "transition" script to move from one DB profile to another.
    • ensure that the config file is not automatically migrated by AiiDA, and is compatible within a major version. And for new major versions, either it can still run using the old config file (without migration), or will just show a message that you need to use version X of AiiDA.

Progress

Being discussed/brainstorming.

@giovannipizzi giovannipizzi added the roadmap/proposed A roadmap item that has been proposed but not yet processed label May 23, 2024
@giovannipizzi
Copy link
Member Author

Here are some steps to create a minimal running PSQL in user space, confined in a folder. The idea is that probably we could consider this, at least for the folder-based approach?

The idea of this message here is just to show that it is actually possible to have a folder-based approach even with PSQL, with some caveats.

Here are some steps to create and use a new PSQL DB locally, as a standard users, without ports but just Unix sockets.

STEP 1: create an empty scaffolding folder for PSQL.

As a folder I use pwd for simplicity. it shoudl of course be in a place like ./.aiida/psql_db/

  • Note: all of this an be run in user space!
  • Note: for this test, Assuming there are no white spaces, otherwise you need to escape PWD especially for the PostgreSQL.conf file
  • Note: in reality it might be good to explicitly specify at least some of these flags to the init command (check man initdb):
    --encoding= --locale= --pwfile= --username=postgres
pg_ctl init -D `pwd`

STEP 2: Minimal configuration of the new PSQL instance

I create in it a sockets dir inside the same folder, and make sure that only sockets are used, and that sockets go in the folder just created

mkdir sockets_dir
echo "listen_addresses = ''" >> postgresql.conf
echo "unix_socket_directories = '`pwd`/sockets_dir'" >> postgresql.conf

STEP 3: Use this user-space, folder-based PSQL

I can start, check the status, and stop the PSQL server with these commands.

pg_ctl start -l logfile -D `pwd`
pg_ctl -D `pwd` status
pg_ctl -D `pwd` stop

Further notes:

  • A typical socket file will look like: sockets_dir/.s.PGSQL.5432
  • You can connect via PSQL specifying the socket dir as the host:
    psql -h /Users/pizzi/tmp/test-local-psql/sockets_dir template1
    
    (you can e.g. check DBs, create a new one, try out things etc.)
  • AiiDA could take care of having a command to start/stop psql, or even do it semiautomatically (e.g. verdi status could check and suggest which command to run, e.g. verdi storage startdb, that does nothing for e.g. a sqlite profile, but runs this command for a PSQL)
  • One needs to be careful to inform users of possible caveats (e.g.: do not move the folder in general (paths are hardcoded, also in AiiDA, that can use sockets IIRC), do not move the folder if you have the daemon running, etc.

@giovannipizzi
Copy link
Member Author

pinging @mbercx since we discussed this today, @sphuber since we discussed this in the past, and also others like @unkcpz @khsrali @GeigerJ2 @agoscinski

@GeigerJ2
Copy link

Thanks, @giovannipizzi, for the detailed write-up! Some preliminary notes:

  • Regarding having a profile localized, we could add an optional --local flag to the verdi profile setup command, at least for the psql_dos storage (as with SQLite it's already localized), which takes care of the necessary steps you outlined in the background (similar to how verdi quicksetup sets up PSQL)? I'm not familiar with pg_ctl so not sure how feasible/easy this would be.
  • I recall @mbercx's objections to the folder-based discovery, and a version implemented by @sphuber that would check recursively for the .aiida folder, similar to gits discovery mechanism (both in the verdi init PR discussion). I still think this would be a nice feature, if we can make it work such that AIIDA_PATH still takes precedence if it is set. To make verdi commands use the correct folder should be doable with the discovery in place, by checking the Path.cwd(), with making "sure this is not incompatible with the way AiiDA is currently used" requiring some thought.
  • For the verdi sync command, I looked a bit into this. Mirroring processes to disk is straightforward with the new verdi process dump command. We should probably check and do it only for finished, sealed processes as verdi process dump currently doesn't have an option for incremental dumping (I'm actually not sure how the command behaves for running processes... I'd guess it just dumps the files that are there, and running again to update would require --overwrite). For other entities, we should define a schema for the resulting directory structure (I remember the idea of allowing users to specify this schema, e.g., via a YAML file). That is, how are groups handled, other entities that might be of interest (such as StructureData -> dump those to disk in a structures directory?), and further logic that determines the output directory structure. I guess these things will become clearer, as I'm working on this feature and syncing some profile data that contains certain elements of organization, e.g. groups.

@giovannipizzi
Copy link
Member Author

Thanks for the comments! Just a follow up comment ony own comments. What I wrote was just some thoughts and ideas. I'm happy to discuss if, for psql, it's really safe to put all in a folder. Maybe it creates more problems if people start to move the folder while the DB is running etc. To be discussed

@mbercx
Copy link
Member

mbercx commented Jun 5, 2024

Thanks @giovannipizzi! Just for context, I'm putting the verdi init PR here, which implemented a git-like folder discovery for the .aiida folder:

aiidateam/aiida-core#6315

As well as my rather extensive objections to this approach.

Just writing down my thoughts quickly, on the train and only have 5 mins. ^^

  1. Is there any reason why we would prefer .aiida-folder discovery over profile-via-folder discovery?

  2. One way I envision this to work is to give the user the option (perhaps literally via an option, but perhaps also as a different storage backend) to create a "localized" or "contained" profile in a folder. This would write a specific file to the top level of that folder (e.g. .aiida_profile) that we could then use to implement git-like directory-based profile discovery. I.e. the precedence would be:

    a. Profile specified in command via -p option.
    b. Folder-based discovery of the .aiida_profile file.
    c. Default configured profile in the .aiida directory.

@giovannipizzi
Copy link
Member Author

Hi,
I have no objection of using a different file/folder name for this (instead of .aiida) - but I'd need to rediscuss why this is a problem (ore read again your objections if it's explained there).
But for 1, I think it's just more intuitive for people who just want to work in parallel with multiple profiles. I think that at the moment, most people just use 1 profile because switching is not trivial (it takes a lot of time to setup correctly, and you need to use different terminals for each). With folder-based, we mirror git: no need to open a new terminal, just change folder (even a subfolder) and everything will apply to the new repository. Very simple, intuitive, and people are used to this with git and other tools, so shouldn't be a surprising behaviour. And no need for complex setup.

It comes of course with implications on supporting working on various profiles even if created with different AiIDA versions without automated profile file changes etc. But I think it's OK, we can probably even commit to not making any migration within major versions, as well as having backward compatible profile files within major versions (and anyway avoiding automatic migration of those, but wanting users that they are using a profile of a old - or too new - AiiDA version, with suggestions on how to proceed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
roadmap/proposed A roadmap item that has been proposed but not yet processed
Projects
None yet
Development

No branches or pull requests

3 participants