Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate loaders #305

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sneakers-the-rat
Copy link
Contributor

@sneakers-the-rat sneakers-the-rat commented Mar 13, 2024

Merging in functionality from main linkml loaders (see: linkml/linkml#1967 )

Copy link

codecov bot commented Mar 13, 2024

Codecov Report

Attention: Patch coverage is 52.30769% with 31 lines in your changes are missing coverage. Please review.

Project coverage is 62.59%. Comparing base (ed36311) to head (0bd751f).
Report is 1 commits behind head on main.

Files Patch % Lines
linkml_runtime/loaders/delimited_file_loader.py 36.66% 18 Missing and 1 partial ⚠️
linkml_runtime/loaders/yaml_loader.py 28.57% 4 Missing and 1 partial ⚠️
linkml_runtime/loaders/json_loader.py 33.33% 4 Missing ⚠️
linkml_runtime/loaders/passthrough_loader.py 75.00% 2 Missing ⚠️
linkml_runtime/loaders/loader_root.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #305      +/-   ##
==========================================
- Coverage   62.70%   62.59%   -0.12%     
==========================================
  Files          63       64       +1     
  Lines        8580     8637      +57     
  Branches     2444     2455      +11     
==========================================
+ Hits         5380     5406      +26     
- Misses       2583     2612      +29     
- Partials      617      619       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cmungall
Copy link
Member

Hi @sneakers-the-rat - it looks like this makes the loaders stateful. Currently they are stateless (you may ask why they are objects at all, given that they are used just as if they were functions, ...)

This isn't necessarily wrong, but was wondering if there was a specific justification, vs for example making an iterable class...

@sneakers-the-rat
Copy link
Contributor Author

I was just trying to preserve existing behavior as much as possible! Literally copy and pasting methods from main linkml repo to here :) So they are stateful for the things that the main linkml repo was using them for, and not stateful for the ways linkml-runtime was using them.

No strong preference from me - having state would be useful in some contexts where one would like to be able to refer to the source schema in addition to the target class, etc. Where the class doesnt have all the info of the schema, but that can also be accomplished in other ways. Happy to modify to make it be one or the other.

@pkalita-lbl
Copy link
Contributor

Ah, I'm the source of the confusion here! Quoting from your other PR:

Noticed that these seem to be duplicated and older, and most of the work on the loaders is over on linkml-runtime

It's actually the other way around. The ones in the linkml_runtime.loaders package are the older ones (so old, in fact, they predate my time on this project!), and the ones in the linkml.validator package are newer.

An attempt at a brief historical recap: The linkml_runtime.loaders classes were originally built around deserializing data into a schema-based target class (the load/loads/load_any methods). They're also somewhat opinionated (for instance the linkml_runtime.loaders.loader_root.Loader.json_clean method or the use of DupCheckYamlLoader in linkml_runtime.loaders.yaml_loader.YAMLLoader). Or maybe a better way to say it is that they mix concerns about deserializing, reparing, and validating. They were used in the old linkml-validate code. But it led to all sorts of undesirable behavior where the loader would raise an exception because of it's own validation work before the main validation code even got a chance to run (linkml/linkml#891).

This led to the introduction of the load_as_dict methods, which at least removed some of the inadvertent validation of the load_any method by returning plain dicts instead of schema-backed objects. But as I developed the new linkml-validate (linkml/linkml#1494), I wanted something 1) less opinionated still and 2) something iterator based. That's why I introduced the linkml.validator.loaders package.

I guess I had always intended that the linkml.validator.loaders classes would be semi-private to the linkml.validator code -- not general purpose classes. I also may have caused less confusion if I had named them something other than "loaders" given that was a term that already existed in our ecosystem. Someone should have handed me a thesaurus!

I think what gives me the most pause about these changes is that the resulting linkml_runtime.loaders classes becomes a bit Frankenstein-ish, and not just the stateful vs stateless aspects. Like the DelimitedFileLoader classes would have entirely different parsing logic depending on if you called iter_instances vs load_as_dict.

I guess I'd suggest two possible paths forward:

  1. Work to design truly unified loader classes in linkml_runtime that behave consistently across methods that return iterators vs dicts/lists vs schema-based classes and give control over the more opinionated features. Not a small task!
  2. Keep the linkml.validator classes where they are and consider renaming them or otherwise making it clear that they are not intended as general purpose replacements of linkml_runtime.loaders.

Hopefully that all makes sense. @sneakers-the-rat I've really appreciated your enthusiasm for contributing lately so please let me know if that all makes sense and if you have other thoughts on paths forward.

@sneakers-the-rat
Copy link
Contributor Author

OK catching up on these open PRs - thanks for the history @pkalita-lbl !!!

Trying to figure out what we want to do here, bc i think it would be really good to clean up and unify the loaders/dumpers

I guess I had always intended that the linkml.validator.loaders classes would be semi-private to the linkml.validator code -- not general purpose classes.

This is totally fair, but i think it sort of points towards a need for a refactor! If the loaders can't handle loading and iterating to the point where it's easier to fork them, then let's clean them up!

I think I'm interested in these bc they seem to be the main place where the 'rubber hits the road' of working with data, and it would be very cool to be able to just point it at the data and press play, particularly as they relate to the transformers/mappers sitting in between the loaders and dumpers.

This'll be especially critical for translating existing data formats like NWB, which will require some special loading logic to pack them into python classes (like how HDMF does), so might as well do some tidying now before it's time fro that :)

I think what gives me the most pause about these changes is that the resulting linkml_runtime.loaders classes becomes a bit Frankenstein-ish, and not just the stateful vs stateless aspects

Also completely agree, i was mostly just consolidating them into one place so that they could be called identically to how they're being used now so that we can improve on them iteratively.

So far we have...

Stateful/stateless

Currently they are stateless (you may ask why they are objects at all, given that they are used just as if they were functions, ...)

I actually like treating classes as collections of functions, makes perfect sense to me. I think we can do both here.

It makes sense to want to treat them as just functions, pass arguments, receive models/data. It also makes sense to have a bit of state in them - there are a decent amount of parameters to be passed around, and the anonymity of *args and **kwargs can make it hard to understand the internals and makes us lose type information. Also for things like wanting to, say, load many datasets from the same schema, or the same data to dump to many formats, or use a bunch of parameters, it would be nice to be able to instantiate a model and only need to vary it along the one parameter that would be varying, caching the rest. Eg. thinking about linkml.utils.converter where a set of python models is generated for a loaded schema - could that be made part of the loaders so there's more of a seamless bridge between data and schema?

would a combination of using @classmethods or __call__ methods that keep the function-like behavior that then internally instantiate a loader/dumper with defaults and return the result make sense? The pattern of Loader().load_*(schema) sort of implies that something could happen with the instance, so that could be signaled clearly by making the functional form something like Loader.load_*(schema) and instantiation like Loader(schema).load_*()

we could also (and probably should) have pure functional forms that are just def load_yaml() and the like as top-level functions. idk whatever works!

Logic Unification

Like the DelimitedFileLoader classes would have entirely different parsing logic depending on if you called iter_instances vs load_as_dict.

ya that sounds super bad lol.

It seems like linkml-runtime loaders are for casting the data into classes from a schema, and the validators loaders are for just iterating over the contents of the csv. It might make sense to make this a general pattern across loaders:

basing them around generators/iterators makes sense for dealing with data where one might not want to have to load the entire thing into memory esp for things like csvs where you can easily iterate over rows, so:

  • an __iter__ method for each just iterates over the raw data from load
  • an iter_into method for casting raw data into provided models: this would also make it easier to handle cases where eg. there is a dataset that has heterogeneous types with a type indicator

and then:

  • the basic load method just returns the "unmodified" form from __iter__
  • a load_into method is for casting data into a supplied class

load_as_dict is clear, the base load method returning a model isn't exactly what i'd expect at first, but it's not bad, but i'm really not sure what load_any means, it seems like that's the plural version of load?

an _into() form potentially gives us clearer semantics for being able to specify common operations as strings without needing to have the class loaded, eg the utils.schemaview:load_schema_wrap is a thin wrapper around YAMLLoader into a SchemaDefinition, but that could also be YAMLLoader.load_into(path, 'schema') (or even infer something as a schema by conformance to the schema) - that's sorta an important one, since at the moment there isn't really a clear way to load a linkml schema, I just end up using schemaview for everything but that's not terribly obvious atm - that should probably be a top-level thing like linkml.load_yaml() or .load_schema().

Broader Unification

there's no getting away from it, linkml is a yaml-driven package! so having a single means of loading yaml i think would potentially bring clarity to some other parts of the package. I'm not saying "let's put all yaml things into the dumpers/loaders," but i am saying we could put all yaml I/O things into them.

I already mentioned the converter example, but there is also

  • linkml.utils.datautils that does some loading logic for inferring class and getting the right dumper.loader,
  • linkml.utils.rawloader that is another wrapper around loading into a schemadefinition
  • linkml.utils.schemaloader from what i gather this is on its way out in favor of schemaview? but yes here is another loader
  • linkml_runtime.utils.yamlutils also has its own loader that diverges a bit further, with the DupCheckYamlLoader, number handling, etc. that would probably be super useful to have in a generalized loader.

This is particularly important given the perhaps unexpected extended universe of yaml forms for ppl who aren't all that familiar with the format - getting all these weird directives and tags and whatnot might be bewildering, and offering a single load_yaml entrypoint is a bit friendlier than needing to understand yaml.load(path, Loader) or yaml.safe_load (what? you mean it could be unsafe!?!?)


so just sketching some ideas and paths forward, but yes i think @pkalita-lbl I would pick option (1) - maybe not all in one PR, but this gets us started down that direction, and then we can roadmap out what else we want here. I'd also be down to do a bit of docs work since this was the part where i got a little lost my first time through (ok i have a schema, now what?)

lmk what ya think, sorry for long comment <3

@cmungall
Copy link
Member

briefly: incremental PRs that don't change existing client-facing signatures or introduce new potentially unstable public methods are most welcome! Also it seems maybe we need a bit more depth in the core developers guide explaining some things. schemaloader and rawloader are indeed part of the older-style generators and don't concern us here (but I totally appreciate how their presence and naming confuses things).

There are a few bigger things I'd like to coalesce on before diving into a full refactor:

  1. strategy for schema introspection across generators
  2. strategy for database endpoints

These are both touched on in other issues, but for 1, pydantics built in field introspection is sufficient for json/yaml loading and dumping, but for say rdf loading/dumping, there is insufficient metadata in the python itself. Now we are making great progress towards the "great convergence" where pydanticgen is on a par with the "good bits" of pythongen but I'd be a bit more comfortable finishing some of that first - e.g. making the curies/uris of classes and slots introspectable. It may be the case that we can make future versions of rdf loading and dumping standalone, no need for the client to orchestrate creating a schemaview object, which would be nice.

For 2, a design question is whether we want to have some kind of unified interface for loading/dumping from sqldbs, duckdb, mongo, etc, or whether to keep this as a separate concern (if we do this then we'd want to obviously keep linkml-runtime light and have the implementations for some of these backends be plugins... we are already addressing this to some extent with linkml-arrays and hdf5...). Either way, linkml-sqldb feels a little isolated at the moment and is wanting to be a bit more unified with the way we handle other files.

@sneakers-the-rat
Copy link
Contributor Author

you know about a billion times more of the lay of the land than i do, obviously, so thx for the perspective.

re: 1) that shouldn't be too hard actually - basically it will be

field.py.jinja2 (pseudocode, pretend the rest of the field generator stays intact)

Field(json_schema_extra={"linkml_meta": { 
  {% for key, val in meta.items() %}
  {{key}}: {{val}}
  {% endfor %}
}})

class.py.jinja2 (same thing)

class {{ name }}({{bases}}):
    linkml_meta: ClassVar[LinkMLMeta] = LinkMLMeta(
       {% for key, val in meta.items() %}
       {{ key }} = {{ val }}
       {% endfor %}
       )

and then since we have domain models that tell us which fields are consumed by the template and which don't have direct representation, we would just do something like this:

class TemplateModel(BaseModel):
    meta: Optional[Dict[str,Any]] = None
    
    def fill_meta(self, def: Definition):
        items = remove_empty_items(def)
        items = {k:v for k,v in items.items() if k not in self.model_fields}
        self.meta = items

and that's pretty much it. we don't even need to worry about formatting in the template now that we just format it all with black. dump whatever we want in there and we can make introspection methods in a parent class for making field-level metadata easier to access.

re: 2 I don't really know! I think we should clean up these classes so that we can make them hookable - if we make a clean interface by which someone might be able to write an interface to their favorite DB, then i feel like that'll be more powerful than us trying to implement them all in main repo (ultimately that is what i am trying to do with the pydantic generators).

I feel like there are a few different different concerns that converge towards a consolidation of loading and dumping behavior in its different forms:

  • dealing with schema yaml
  • schematized data
  • interop with other formats like sql/rdf stores
  • developing with linkml

and so yes it seems like we just need a plan :) i feel like getting started with refactoring the behavior we already have is a low hanging fruit en route to grander visions

def _construct_target_class(self,
@abstractmethod
def iter_instances(self) -> Iterator[Any]:
"""Lazily load data instances from the source
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor: can be iterated on in new PR]

Can we clarify what a data instance would be?

It seems that this is canonically a dict, never an instance of a class (whether dataclass or pydantic)? or would it be (e.g. pkl serialization)? Would rdflib_loader eventually implement this with a Triple/Quad object, or a 3-or-4-tuple?

I'm tending towards a more predictable signature (iterates over dicts) with some guarantees

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally - so the type annotation here would be refined in the child loaders, keeping it "any" here is just to say "there will be some iterator (it could actually just be Iterator and then we would do Iterator[dict[str, JsonObj | dict | list]] or whatever in the child objects. We would make this type a union of all the child types but it wouldn't really give us much bc the child impl should override it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants