Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to modify path-like environment variable in a context #4681

Open
wants to merge 8 commits into
base: 5.0.x
Choose a base branch
from

Conversation

Crivella
Copy link
Contributor

The proposed function wrap_path_env would have the utility to allow running commands or functions with modified values for PATH/LD_LIBRARY_PATH/...

The function allows to both append and prepend to a path-like environment variables, using by default the default system separator.
A custom separator can be specified for each variable by passing a dictionary of strings with the same keys as
prepend and append.
If a key is not present in the sep dictionary, os.pathsep will be used unless strict is True, then an error
will be raised.

Comparison to env= parameter in run_shell_cmd for 5.0.x:

  • The env parameter on it's own can be cumbersome to use as it requires redefining the entire environment, which would entail copying the environment and modifying the environment variables to have the same effect as the proposed function.
  • The env parameter would not work for other python function calls (eg testing a module that independently calls commands)

Comparison with currently available framework.environment functions:

  • I do not believe the functionality introduced could be easily replicated using the currently available functions (without rewriting the same logic). It is possible to call reset_changes to reset all variables modified by setvar but it is possible that the user has already set some variables and only want to temporarily modify only some of them or different ones, without resetting the originally set variables.

Possible enhancements/discussions:

  • Adding a set/override parameter to override the value instead of just appending/prepending
    • The utility i would see in this is to make sure the appended path is correct, (eg an incorrect path is appended but the dependency is resolved with the original path leading to unexpected behaviors)
    • Might need checking if the key is defined in (prepend | append) at the same time as (set) and either log a warning or raise an error
  • The way the function is built now, only environment variable modified by the context manager are reset to the original value. If a variable is modified within the context manager it will be reset to the original value only if it was originally modified by the context manager.
    • One possible solution would be to store the entire environment before entering the context, but I am not sure if this should be implemented or if it is possible that leaving non-managed variables to the discretion of the user should be a wanted/useful behavior
  • While i think it would also be useful to have the capabilities to override non path-like variables, (could be implemented together with the set/override paramenter) i am not sure if this should be implemented in a separate function.

@Crivella Crivella changed the base branch from develop to 5.0.x October 23, 2024 15:50
@boegel boegel added this to the 5.0 milestone Nov 6, 2024
Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The env parameter on it's own can be cumbersome to use as it requires redefining the entire environment

oh darn, i thought it only replaced the specified environment variables, but indeed it's not the case (and i guess all our docs do indicate that). Makes this pretty much useless for EB apart from maybe a few extremely specific cases.

The env parameter would not work for other python function calls (eg testing a module that independently calls commands)

though they really shouldn't do that

I think we absolutely need this

easybuild/tools/environment.py Show resolved Hide resolved


@contextlib.contextmanager
def wrap_path_env(prepend=None, append=None, sep=os.pathsep, strict=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to use this also (perhaps mostly) for non-path environment variables, so just replace them
A few real cases that has just come up in the past days have been the need to set "BUILD_VERSION" and "HOME" (which of course should be isolated).

Would be worth covering the case of

with env.wrap_path_env(override={'BUILD_VERSION': self.version}):
    ...

or do we prefer to

with env.wrap_path_env():
    setenv(...)

though, in either case, maybe one ought to downplay the _path part?

Also, while rare, if we ever need to, say, set TCLLIBPATH (space separator) and a a custom libpath as the same time, we are in trouble. So, sep as s separate argument worries me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking about this some more. I guess one can always nest in the rare case that one wants to set the separator.

with wrap_env(prepend=xxx), wrap_env(prepend=yyy, sep=' ')

though lines get long, the situation is extremely rare.

I would like to see something like this used for EB when it installs extensions/components in bundles (pseudo-code):

for extension in extensions:
    with wrap_env():  # prevent extensions from leaking into global env
        process(extension)

and they could, in turn of course just do this themselves when they need further isolation in different steps.

But this also got me thinking how this would possible interact with parallel installation of extensions (not sure if that is using multiprocessing or async or so? Haven't looked at that code yet, so i don't know how/if it would impact the environment)

Sorry for the unstructured brain dump, weekend is calling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the function is set up right now it is possible to pass a dict to sep with a separator for every variable.
It will default to os.sep if strict = False or raise if there is no separator found for a variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this also got me thinking how this would possible interact with parallel installation of extensions (not sure if that is using multiprocessing or async or so? Haven't looked at that code yet, so i don't know how/if it would impact the environment)

Concerning parallel extension installation, it seems that ThreadPoolExecutor is being used to spawn threads

thread_pool = ThreadPoolExecutor(max_workers=self.cfg['parallel'])
that than use subprocess.Popen through run_shell_cmd
https://github.com/easybuilders/easybuild-easyblocks/blob/b9b242a854332a7f0fbde7c8ec112b8f17638567/easybuild/easyblocks/generic/rpackage.py#L322

While the environment of each subprocess should be isolated, the one of threads is not, so unless the code is modified to make sure each run_shell_cmd is invoked only after the previous one in the queue has started it might be difficult to avoid race conditions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the override argument, beside the threading discussion, I guess one design choice that I am unsure of is whether we should reset only variables that are being explicitly managed by the context or the entire environment,

eg using with wrap_env(override={'TEST1': '1'}): if TEST2 is modified inside the context manager, it will not be reset after the context exits (atleast how the function is setup right now)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, the issues i've hit recently has been about wrapping the entire environment and restoring, since so many easyblocks, hooks, etc. might be touching all kinds of environment variables which indirectly impacts other code.
I don't know of any legitimate cases where we want code to leak into os.environ for all the remaining runtime of eb. Such thing would seem extremely fragile hack to me, but I maybe there are cases where it's truly needed?
If an easyblock needs some variable to be set for config + build + install step, then they can/should set again each time.

If one goes with just wrapping everything, i think it makes sense to just keep it dead simple, and just move the append/prepend helpers as separate functions.

import easybuild.tools.environment as env

def xxx(...):
    with wrap_env():
        env.setvar('BUILD_VERSION', '1.2.3')
        env.append_path('PYTHONPATH', './lib/python')
        env.prepend_path('TCLLIBPATH, './lib', sep=' ')
        run_cmd(...)

but this design would allow for separating the context from the modifications:

    def install_extensions(self, ...):
        for extension in self.extension_list:
            with wrap_env():
                extension.install()  # Might be setting environment variables?

However, the parallel extensions leaves me scratching my chin again here, maybe env isn't so bad?

Parallel extensions

Now this throws a whole wrench into thing. Any python code that runs during these steps must not ever touch any global state; whether it is os.chdir or os.environ, and context can't help here.
The only possibility here is to allow the use of run_shell_cmd's env parameter, despite it being cumbersome.

Maybe that is just best solved with some added convenience methods instead?

def copy_env(override, prepend, append, sep):
    env_copy = os.environ,copy()
    for key, value in override.items()
        env_copy[key] = val
        ...
    return env_copy

or someting like a

class CustomEnv:
    def __init__(self):
        self.environ = os.environ.copy()

    def set(self, key, val):
        self.environ[key] = val
        return self

    def append(self, key, val, sep=':'):
        self.environ[key] = ...
        return self

   def __iter__(self):
       return iter(self.environ)

or maybe an extra override_env for run_shell_cmd directly.

    run_shell_cmd('install_torchvision', override_env={'BUILD_VERSION', '1.2.3'})
    run_shell_cmd('install_torchvision', override_env=append_path('PYTHONPATH', './lib/python'))  # using a helper function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class approach could(should?) still be a context manager.

  • We could either have something like

    class CustomEnv:
        ...
    
    env = CustomEnv()
    env.append_path(...)
    env.prepend_path(...)
    with env.apply():
        ...
  • or something like

    class CustomEnv:
        ...
    
    with CustomEnv() as env:
        env.append_path(...)
        env.prepend_path(...)
        ...

Regarding run_shell_cmd probably adding an override_env argument might be the easiest solution.
I was thinking if there might be some way to intercept all function calls inside a context and modify them, but that might be very hackish and likely fragile.

Copy link
Contributor

@Micket Micket Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could(should?) still be a context manager.

Maybe, the issue is what anything that might run in parallel threads or async must never modify os.environ, their only option is to pass the env to subprocess (so my idea was that it would be passed to run_shell_cmd(..., env=CustomEnv(xxxx)) using some custom dict-like class just to possible make things more convenient for the typical usecases.

Currently not a lot of parallel code in EB, but if we do any major rework here might be good to design in a way that avoids those problematic globals, just like with cwd which faces the same problem (and easyblocks should be using run_shell_cmd(.., workdir=relative_path) instead).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants