Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to merge sections with the same name from different sources #11

Open
hakkeroid opened this issue Oct 23, 2016 · 17 comments
Open

Allow to merge sections with the same name from different sources #11

hakkeroid opened this issue Oct 23, 2016 · 17 comments

Comments

@hakkeroid
Copy link
Contributor

I started a similar project to yours a while ago but ditched it in favor of yours as i really like it. However there is one use case I can't really apply.

I want to have a configuration setting that can be extended as opposed to be overridden by the subsequent configuration sources. The following example isn't meant to work with list types only but could be applied to other data types, too.

Consider I have a tool that loads plugins from a list of paths. I want to allow a user of the tool to extend this list with additional paths coming from a configuration file within the current directory, the user's home directory and so on. When any of the sources provides the same name for the paths-key the previous settings will be overridden (or ignored from the perspective of the code)

~/.config/tool/config.yml

paths:
  - /some/path
  - /some/other/path

./config.yml

paths:
  - ./local/path

Now in python:

import os
from layeredconfig import LayeredConfig, YAMLFile, Defaults

_DEFAULTS = {
  'paths': []
}

# .. path setup

sources = [Defaults(_DEFAULTS)]

if os.path.exists(homedir):
  sources.append(YAMLFile(homedir))

if os.path.exists(current_dir):
  sources.append(YAMLFile(current_dir))

config = LayeredConfig(*sources, merge=['paths'])

Finally it would be great to do a simple config.paths and get a list of multiple paths. Either as a list of lists or flattened to one list. Although I prefer the nested list as this enables the user to deal with equal values.
I patched layeredconfig locally and can open a PR if you like. However I wasn't really able to get the tests running due to interpreter and invocation errors. So the changes doesn't include additional tests.

@staffanm
Copy link
Owner

I think I understand the use case, and I've had a similar case myself (also concerning paths -- maybe this shows that path configuration often has that case) that I never gotten around to implementing in LayeredConfig, primarily because I couldn't come up with a simple API. As I understand your suggestion, you have a new "merge" argument to the LayeredConfig constructor that specifies which config keys should be merged instead of replaced. As you write, merging could be done in different ways (either create a flattened list out of several lists, or just tie them together in a nested list). And there could be other methods for consolidating multiple values, such as adding integer values or concatenating string values.

I think that specifying such a "consolidation strategy" for a particular config key is similar to specifying it's type, and wonder if the following API would make sense. Assuming the following:

~/.config/tool/config.yml

paths:
  - /some/path
  - /some/other/path
pathstring: /some/path;/some/other/path
times: 1

./config.yml

paths:
  - ./local/path
pathstring: /local/path
times: 2

code:

_DEFAULTS = {'paths': [],
             'pathstring': '',
             'times': 0
}

_CONSOLIDATORS= {'times': operator.add
                 'paths': lambda a, b: a.append(b),
                 'pathstring': lambda x,y: x + ";"+y
                }

sources = [Defaults(_DEFAULTS, consolidators=_CONSOLIDATORS)]

if os.path.exists(homedir):
    sources.append(YAMLFile(homedir))

if os.path.exists(current_dir):
    sources.append(YAMLFile(current_dir))

config = LayeredConfig(*sources)                 

I.e. the specified consolidators would be methods that take two versions of a config value (the first being the version with lower precedence), and returns a consolidated version.

The default consolidator would be lambda a, b: b, ie keep the high precedence version. For list values, if you want them flattened instead of nested, use lambda a, b: a.extend(b). With the above values and consolidators, the results would be:

>>> config.paths
[["/some/path", "/some/other/path"], ["/local/path"]]

>>> config.pathstring
"/some/path;/some/other/path;/local/path"

>>> config.times
3

I think that the above API is very flexible with regards to how a merge/consolidation should be performed. I worry that it might be overly complex though (An API requiring functions as arguments, and giving examples using lambdas is not exactly simple). What do you think?

@hakkeroid
Copy link
Contributor Author

hakkeroid commented Oct 24, 2016

I also thought about other values but didn't came up with something. Your suggestion is definitely a pretty neat solution and in combination with the default consolidator a simple one to use in my opinion. Way better than dictating a merge strategy. Honestly I wouldn't simplify anything here. If that is still needed though I could think of providing some default strategies similar to what you did with 'times': operator.add. Let's say

def find_latest_timestamp(previous_source, next_source):
  # datetime comparison

_CONSOLIDATORS= {
  'int_values': 'add',
  'paths': 'merge',
  'simplelist': 'flatten',
  'pathstring': lambda x,y: x + ";" + y,  # simple special cases
  'timestamps': find_latest_timestamp,  # more complex special case
}

# ...

cfg = LayeredConfig(*sources, consolidators=_CONSOLIDATORS)

Internally you differentiate between strings and callables and map the strings to default functions. That way a user does not need to fully understand how to work with consecutive sources. At least in simple cases which might be enough most of the time.

I think a more difficult issue is handling changes to these values. In my local implementation I hadn't implemented something like that just yet. Some ideas that come to my mind are

  • Clearly communicate that in consequence of merging values they can't be writable which would be the simplest solution.
  • Another solution I can think of is that a user can specify a target source for saving the changes. But that's definitely a difficult one. On the one hand not all sources are writable and on the other hand as a user how do you provide the target source. You could possibly hand over the instantiated source to the write method. However that requires you to keep the source objects laying around or layeredconfig provides easy access to individual sources. For example
# setup other sources
sources.append(YAMLFile(..), name='somename')
cfg = LayeredConfig(*sources)

# make your changes
cfg.paths = cfg.paths + ['/even/more/paths']

# access individual sources
target = cfg.sources.somename

# and save them
LayeredConfig.write(cfg, target_source=target)

Although that feels a bit clunky it might open up other possibilities like saving values to specific targets similar to git's --global or --local flags.

  • I had another one but can't remember it.. :-/
  • Finally you could still rely on the current best-guess-approach to find writable sources.
    Except for the first one all of these require a type check when saving as the types of merged values and and the original might differ.

[edit]
Oh.. Sorry, I completely missed the fact that you can specify identifiers already. That makes things straight forward and most of my concerns are irrelevant. Although I couldn't figure out how to save changed values when you use LayeredConfig.set except for directly using cfg._sources. According to the note in the docstring this function is meant for internal trickery only. The only other thing left is a simple type check whether the value fits to the original value type.

@staffanm
Copy link
Owner

Your suggestion to have a set of default consolidators make a lot of sense. Maybe they could be provided as functions by the main module, so that you could handle them like constants, something like

from layeredconfig.consolidators import add, merge, flatten

_CONSOLIDATORS= {
  'int_values': add,
  'paths': merge,
  'simplelist': flatten, ... 
}

This makes it clear to the intermediate user that these "constants" are just callables that accepts two arguments and returns one.

Regarding what to do when a consolidated value is changed, I think it would be least surprising if the current, highest-precedence-writable-source, approach is used. The issue when changing a value to a different type than it's original already exists (i.e. if you specify that a value should be a str, then change it to a list in a initialized config object), and it's not clear to me that it should be handled any differently than now.

Is "consolidated" the best term for what we're discussing here? Maybe "merged", which you suggested in the title of this issue, is just as descriptive and easier to understand?

@hakkeroid
Copy link
Contributor Author

hakkeroid commented Oct 25, 2016

Before I also thought about importing the callables from a submodule. However if we use predefined strategies anyway we could simplify their usage even more by just using strings. Though the advantage of importing callables is that one can simply inspect them with something like ipdb. In the end I am absolutely fine with both ways.

In terms of handling changes you are right. There is no need to change anything if that functionality is already in place. I wasn't sure about the current state. So let's forget about my concerns.

Now for the naming part. I would not call it "merged" as this is an adjective. Maybe "mergers" although I wouldn't go for merge in general as it implies that the values are combined somehow. This however contradicts the case of simply shadowing them which is the default behavior. Other than that I thought about what context we are in. It is about configuration settings, files, applications and development in general and as such I could think of words like "strategies" or "processors". Sounds a bit generic but in the end we are doing exactly that; processing values with different strategies. So personally I would go for "strategies" which clearly communicates that we are doing the same thing (handling values) in different ways. Although it requires the user to know what they are for which should be the case if he wants to use them in the first place. In the end I really have no strong feelings one way or the other. I am absolutely happy if this feature makes it into layeredconfig. :-)

from layeredconfig import LayeredConfig, strategy

_STRATEGIES= {
  'int_values': strategy.add,
  'paths': strategy.merge,
  'simplelist': strategy.flatten, ... 
}
config = LayeredConfig(*sources, strategies=_STRATEGIES)

[edit]
One word regarding the default strategy. I am not sure if it is really required to use one. The reason is if we are shadowing underlaying sources anyway we do not need to traverse the complete chain of sources. So I guess in reality we do not need a "default strategy" as it simply means stopping as soon as we found the value. No need to process it with consecutive values.

@staffanm
Copy link
Owner

This is starting to look like a solid extension of the API -- thank you very much for the productive discussion! I'll try to write some tests and see if it can be implemented, and if so, release it as version 0.4.0. Due to time constraints, I won't be able to do it within the next few weeks, but hopefully before year's end. PR's are welcome in the meantime, of course... :-)

@hakkeroid
Copy link
Contributor Author

You are most welcome. I enjoyed the discussion and am eager to hop into the code. Let's see if I can get something up and running that is more robust than my current modifications.

@hakkeroid
Copy link
Contributor Author

hakkeroid commented Mar 9, 2017

I had a look into it and tried to prepare some tests. While doing that I found pytest to be easier to read and use for debugging purposes. However it does not work without small adjustments to the test suite. Out of curiosity what do you think about pytest? Personally I am huge fan of pytest as it reduces much of the boilerplate needed with unittest. Is it an option for you?

@staffanm
Copy link
Owner

I would prefer to use only whats included in the stdlib, as every dependency creates some risk (in particular, that it won't support all platforms and versions that layeredconfig itself supports). But since pytest would be a developer-only dependency and very well supported on all platforms (I guess?), I could consider using it. What kind of adjustments were needed?

@hakkeroid
Copy link
Contributor Author

hakkeroid commented Mar 10, 2017

I never had any issues with pytest together with any python version or os platform. According to its tox file it also looks very compatible.

If you are not familiar with pytest it basically searches for standalone functions or method starting with "test". Additionally for methods the containing class itself has to start with "Test". Subclassing from unittest.TestCase is not required though. That way pytest supports but is not restricted to unittest-style test classes.
Now in the test suite there are two classes (TestLayeredConfigHelper and TestConfigSourceHelper) which are not meant to be executed directly but work as mixins. However due to their name pytest collects and executes them, too. Specifically with TestConfigSourceHelper this is an issue because the methods start with "test" and will fail by reason of the missing self.simple and self.complex variables. So I renamed both helper classes to LayeredConfigHelperTests and ConfigSourceHelperTests.

With unittest this does not make a difference as none of the mixins subclass unittest.TestCase anyway.

I pushed a branch to make communication a bit easier. :-)

@staffanm
Copy link
Owner

staffanm commented Mar 10, 2017 via email

@hakkeroid
Copy link
Contributor Author

I stumbled across a couple of things which are related to internal behavoir. Do you want to discuss them here or do you prefer another channel to keep the comment section clean and readable for others?

Besides of implementation details I wondered how to best approach strategies on nested keys. We could either define that a key in the strategy map is applied globally (so on any nested level) or that we use something like dots to denote the nesting where to apply the rule. I.e.:

_STRATEGIES= {
  'int_values': strategy.add,
  'paths.homdir': strategy.merge,
  'mymodule.nested.simplelist': strategy.flatten, ... 
}

@hakkeroid
Copy link
Contributor Author

hakkeroid commented Apr 4, 2017

Hi staffanm, sorry for the lengthy comment. However I want to be as clear as possible about my reasoning.

tl;dr
Because of a complex and in my opinion reduced maintainabilty when implementing the strategy-feature I built the sources and the config object from scratch to change how they work internally. Now I want to ask you for your opinion on that solution and whether you think it is worth the effort to merge the changes back or not.

[edit] I created a repository with the conceptual code.

Long version
For the last couple of weeks I looked into the issue and tried to build a solution for it. I tried multiple approaches but as mentioned I was never really satisfied by the complexity and the maintainability of the solution. I also tried to solve the issue on blank paper. However, after I kept dropping all the implementations I tried to think more outside the box. So I forgot about the layering concept at all and started with the source objects itself from scratch. As a result I got a self-contained source object that could be used without requiring a layering object.

from layeredconfig import YamlSource

cfg = YamlSource('/path/to/cfg.yml')

assert cfg.mymodule.extra == True

After the foundation was layed out the next step was to reintroduce layering. Because both concepts - handling a source and layering multiple sources - were separated from each other, adding the final solution for folding over the values (merge stategies) was easy to do.

from layeredconfig import LayeredConfig, YamlSource, strategies

cfg = LayeredConfig(
    INISource('/path/to/cfg.ini'),
    YamlSource('/path/to/cfg.yml'),
    strategies={'paths': strategies.collect}
)

assert cfg.paths == [['/path/from/yaml1', '/path/from/yaml2'], ['/path/from/ini']]

Also it was fairly easy to introduce another feature. For example providing a custom type-conversion map which I definitely would have opened another issue for, too.

from pathlib import Path
from datetime import datetime

from layeredconfig import YamlSource

cfg = YamlSource('/path/to/cfg.yml',
    type_map={
        'homedir': lambda p: Path(p),
        'date': lambda d: datetime.strptime(d, '%Y-%m-%d')
   }
)

assert cfg.homedir == Path('/path/from/yaml1')
assert cfg.extra.date == datetime(2017, 4, 4)

Meanwhile I also could fix (at least in my opinien) a couple of smaller inconveniences. For example the dump-feature which requires the config object as a parameter although it can be called from the config itself.

from layeredconfig import YamlSource

cfg = YamlSource('/path/to/cfg.yml')

# currently
assert cfg.dump(cfg.mymodule) == {'extra': True}
# also the following statement does not what one might expect
assert cfg.mymodule.dump(cfg) == {'extra': True}  # fails, as it returns {'mymodule': {'extra': True}}

# in this version
assert cfg.mymodule.dump() == {'extra': True}
assert cfg.dump() == {'mymodule': {'extra': True}}

I also added dictionary-like access without the need for dumping the config beforehand. That should simplify accessing values programmatically.

from layeredconfig import YamlSource

cfg = YamlSource('/path/to/cfg.yml')

assert cfg.mymodule['extra'] == True
assert cfg['mymodule'].extra == True

Additionally the keys are resolved in a lazy fashion, so changes to source files are visible in the moment one asks for a key. This could be easily extended with caching. Sources now only need to implement at least one method _read and optionally a second one if they are writable _write.

# source implementation for yaml file
class YamlFile(Source):
    def __init__(self, source, **kwargs):
        super(YamlFile, self).__init__(**kwargs)
        self._source = source

    def _read(self):
        with open(self._source) as fh:
            return yaml.load(fh)

    def _write(self, data):
        with open(self._source, 'w') as fh:
            yaml.dump(data, fh)

For the test suite I created a mocked etcd-storage. So there is no need to install an etcd storage on localhost (at least for quick checkouts). Also the yaml and etcd (requests) dependencies are optional, too. So the code checks at runtime whether they are missing. For a personal use case I find that convenient as I am deploying to my raspbian where it is a hurdle to install pyyaml.
(excerpt from tests)

from layeredconfig import YamlSource

def test_etcd():
    cfg = EtcdStore('/bogus/url')
    cfg._connector = FakeConnector()

    assert cfg.mymodule['extra'] == True
    assert cfg['mymodule'].extra == True

Test coverage is almost 100%.

That's it for now. I hope I did not overwhelm you. Also I did not provide implementations for all sources as I wanted to get your feedback first. As everything was written from scratch I can't provide a simple "branch" to look at. I might create a repository to push stuff into it. However maybe you already have some strong feelings one way or the other. If you think that a merge is not applicable or if you simply do not want to go that route this is fine, too. Just let me know what you think.

@staffanm
Copy link
Owner

Hi hakkeroid,

Apologies for the late answer. I'm very impressed with the amount of effort you've poured into this, and would be happy to merge your changes. I do have some questions though:

  1. You write that it's now possible to call .dump() on a config instance instead of using the static LayereredConfig.dump() method. The main reason that I used the static method way (which also is used eg in LayeredConfig.save()) was that I wanted to avoid reserving keywords on the config object. With your approach, is it still possible to have a config key named "dump"?

  2. The type_map feature seems like a flexible way of allowing arbitrary type conversions. How would it work when using a source that has full typing support (eg DefaultSource)? Would the "old" way of type conversions still be supported?

  3. Could a single source instance (not wrapped up in a LayeredConfig instance) be used in all places that a regular LayeredConfig instance can be be used, ie could you pass it to LayeredConfig.save() or .dump() and have it work exactly the same way?

  4. Optional/runtime dependencies on pyyaml is ok, but I'd really like if a standard "pip install layeredconfg" by default installed everything needed for YamlSource, including pyyaml. Maybe the fancy dependency specification mechanism could be used to specify whether or not you wanted YAML support, like "pip install layeredconfig[noyaml]"? I've never written a package with optional dependencies, but it seems like it could be possible.

  5. Dictionary-like access to the config object would definitely be useful (I've been forced to use getattr on a LayeredConfig instance plenty of times in my code), but I'm concerned that there would be two different but very similar ways of getting config information from an instance. Couldn't the static method LayeredConfig.get() be used for programmatic access to fields instead?

Mocking etcd-storage in the unit tests is fine -- we'll still need to keep tests that run against a live etcd instance, but that's really more of an integration test, not a unit test.

Again, thank you so much for all the work you've put into this! I really like how it's shaping up!

@hakkeroid
Copy link
Contributor Author

hakkeroid commented Apr 12, 2017

Hi staffanm,

welcome back and I am very happy to hear that you like the approach. So let me clarify your questions.

  1. I am glad you asked. For some reason I forgot to mention this particular case. Basically yes, it
    is possible. You can simply use the dictionary-like access style. I tried to stick as close as possible to the way dictionaries work as this is a very well known behavior. So the question applies similarily to the convenience methods: keys(), items(), get(), setdefault() and update().
assert cfg.mydata['dump'].other_key.dump() == {'subkey': 'value'}

For private methods I also considered using double underscores to automatically prepend the class name and further reduce any naming conflicts so that for example _write() becomes _Source__write(). Although using them is generally discouraged I think that might be a valid use case.

  1. Sorry, I made a mistake when talking about the type_map. First off regarding your question, the type_map is an optional map that gets applied on top of the returned key of any source. That means no matter what the underlying source returns, the user can additionally specify a conversion. I hope that answers your question regarding the "old" way. Now for the part I messed up. Actually the type_map does not contain a key with a simple function because then it would be impossible to convert the value back when writing to a source. Instead you have to specify a CustomType() with a customizing and reset function:
# consider we want to turn a list into a set while we work with it
source_data = {'simple_list': [1, 2, 3]}

# on read: customize it to set
# on write: reset it to list
type_map = {'simple_list': CustomType(customize=set, reset=list)}  

cfg = DefaultDict(source_data, type_map=type_map)

assert isinstance(cfg.simple_list, set)
cfg.simple_list.add(3)  # does not change it
cfg.simple_list.add(4)
assert cfg.simple_list == set([1, 2, 3, 4])
# source_data was written while adding "4"
assert source_data == {'simple_list': [1, 2, 3, 4]}

Regarding the naming of customize and reset; maybe more descriptive names might be to_type and to_source.

  1. In theory yes. Both the source and the layered configs share the same dictionary-like interface. So for the code that is using a config object it is intransparent and unimportant whether it has a simple source or a complex layered one at hands. However regarding dump() and save() there is no "compatibility layer" that allows you to pass config objects to them in the current implementation. Mainly because this is not required anymore. I will get back to that below.

  2. You are right. I forgot about that feature. It is based on setuptools extras_require and from quickly skimming through the documentation it seems that it does not work with an "opt-out" style. So consider the following setup.py.

setup(
    ...
    extras_require = {
    	'yaml': ['pyyaml'],
        'etcd': ['requests'],
    }
)

Now you can install the whole suite with the following command.

pip install layeredconfig[yaml, etcd]

So from the looks of it there does not seem to be a way in extras_require to remove requirements from the full list.

  1. Why are you concerned about there beeing two ways? You can still use get() to retrieve any key. Especially when you want to provide a default value in case it could not be found. So this is pretty much the same as how dict works.
cfg.nested.get('my_key', False)
cfg.nested['my_key']
cfg.nested.setdefault('my_key', []).append(True)
  1. You are absolutely right about the integration tests. One of the reasons I "mocked" it was because I used detox to test all python environments at once. This however fails if you only have one real etcd-store.

Now regarding the "compatibility layer". The current implementation is not fully compatible with the way layeredconfig works right now. So I think it would definitly be a major version bump if we follow semver.org. However we should of course create a small facade that provides the current behavoir and possibly deprecate some features like calling the static methods and instead promote the non-static counterparts. This facade would be released as a minor version so that people can convert their code at their own speed.

I really don't want to bloat this comment section but I think it is important to clearify some things. So because you were asking for the save method let me explain how that works. Basically there is no save-method at all. By default changes are immediately written to the underlaying source through _write(data). Similarly keys are always read through _read() in the moment they were accessed on the config object. That made the required interfaces for inherited sources extremly simple and also implements real "lazy access". So if you already load the configuration and the source changed meanwhile you actually get the latest data. That might be important for etcd stores and continously running webapps. However, I also added a cache-mixin which adds the method write_cache() to the sources and can be activated through a parameter. Consider this example

cfg = EtcdStore('/some/host')
# this calls the etcd store three times (two reads, one writes)
cfg.sum = cfg.num_a + cfg.num_b

cfg = EtcdStore('/some/host', cached=True)
 # this calls it only once
cfg.sum = cfg.num_a + cfg.num_b
# and now a second time
cfg.write_cache()

In reality the etcdstore uses caching by default. Also the naming of the method could be changed but I found it helpfull when the name precisely reflects what it actually does. When providing a facade we could still add a save method though and activate caching by default for all sources.

There is more to it but it is already way too much text. Anyway you are welcome and I really enjoyed wrapping my head around this stuff. :-)

[edit] I think I will use the README in my repository to add a couple of examples to show how it might work right now. Or maybe you can have a look into the test cases for the sources.

[edit2] One last thing; generally the dump method became really unimportant now that the configs behave almost exactly like dictionaries.

[edit3] I updated the type_map question (2) as I made a mistake.

@hakkeroid
Copy link
Contributor Author

Hey staffanm,

I kept thinking about how to best approach merging both code bases. On the one hand I would like to use the strategy part in another project as soon as possible and on the other hand both code bases are very different from each other and it requires some effort to merge them. For that we should take small steps to prevent any breakage.
I think the best approach is to publish my code base as a separate project for now and then work towards a merge. That solves the following issues:

  • I can already "officially" make use of my code base
  • we can take small (refactoring) steps to merge both projects down the road
  • it takes away the timely pressure from both of us :-)

@staffanm
Copy link
Owner

Hi hakkeroid,

I would be totally ok with you publishing your code base as a separate project. As you've noticed, I am unable to devote much time to Layeredconfig right now. As I understand it, the core of your code base is a clean-room (no code shared) implementation of the API?

@hakkeroid
Copy link
Contributor Author

Now I was busy.. darn!

Yes, you are right about it. What is shared is mostly the API, so basically how everything works from the user's perspective (although I have to rename the tool for now, of course). Significant differences code wise are:

  • The sources are first class citizens and responsible for traversing a single source(-tree) vertically which is why they can be used standalone. As a result they are not thin layers with getters, setters and subsection methods but fully functional dict-like containers.
  • On the opposite site the layering class becomes an addition on top of it which only handles traversing the sources(-list) horizontally. So it only has very limited knowledge about subsections and as such became thinner and simpler to extend.
  • And finally nested (subsections) keys do not return layered objects with stripped down source objects that have been created from the sources subsections but with the same original source objects and a key-chain that specifies on which level of the source tree to look at next.

To make it clearer, I am talking about this part. Specifically creating, storing and returning stripped down subsources. In comparison I separated the vertical source tree traversal from the horizontal sources list traversal which simplyfied applying the strategy and custom type parts.

I am preparing some documentation with extended examples right now and will add a link to the released version to my comment afterwards. Then you can have a look at it if you like. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants