Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for retrieving original XMLAnnotation value as string #273

Open
jmuhlich opened this issue Jan 3, 2025 · 5 comments
Open

Support for retrieving original XMLAnnotation value as string #273

jmuhlich opened this issue Jan 3, 2025 · 5 comments

Comments

@jmuhlich
Copy link
Collaborator

jmuhlich commented Jan 3, 2025

For XMLAnnotations, we know from the schema that "The contents of this is not processed as OME XML but should still be well-formed XML". Inside the OMERO code and the Postgres database, the value is indeed just a plain string and no XML processing at all is done. In fact I've noticed that Glencoe's PathViewer app even stores raw JSON in XMLAnnotations (no wrapping root element or anything)!

I'm trying to add support for XMLAnnotations to omero-cli-transfer, but the way ome-types represents the XMLAnnotation Value as a parsed DOM-like AnyElement structure is making it hard to ship XMLAnnotation content faithfully from OMERO A -> ome-xml -> OMERO B. In order to recreate an annotation in the receiving OMERO instance, I need to provide the Value as a string, but without the opening and closing <Value> tag. I tried calling to_xml() on the annotation or the Value, then doing some DOM munging and re-serializing, but since Value can contain mixed XML content it's actually a little tricky and I'm not confident I can reliably obtain the exact original string. I'm currently leaning toward constructing ome_types XMLAnnotation.Value objects explicitly as Value(any_elements=[value_string_from_omero]) so any actual XML content ends up getting xml-escaped and not parsed, but this seems hacky. Do you have any other thoughts? Could ome-types be modified to retain the original string value in addition to the parsed AnyElement thing? Or maybe the AnyElement representation could be produced on-demand from the stored string rather than serve as the internal representation.

@tlambert03
Copy link
Owner

Could ome-types be modified to retain the original string value in addition to the parsed AnyElement thing? Or maybe the AnyElement representation could be produced on-demand from the stored string rather than serve as the internal representation.

both of these could perhaps work...
I know i should probably be able to glean it from your issue here, but do you happen to have a quick example of an XMLAnnotation (perhaps already in the tests?) and a small test or bit of code demonstrating what you wish you could achieve? that would make it easier to implement

@jmuhlich
Copy link
Collaborator Author

jmuhlich commented Jan 3, 2025

Sure! The test is to encode the 3 strings below as XMLAnnotation Values, write them to an ome-xml document (inside a larger OME structure of course), then read that ome-xml document back in and recover the original strings exactly. The problem arises as soon as the XMLAnnotations are constructed, before the xml round-trip:

from ome_types.model import OME, XMLAnnotation

values = [
    '{"foo":1,"bar:"other"}',
    '  <outer xmlns="http://example.com/">  <inner>ABC</inner>  </outer>  ',
    'hello <mis> </matched> world',
]

ome = OME()
for v in values:
    ome.structured_annotations.append(XMLAnnotation(value=v))

def get_xmlannotation_value(a):
    # What goes here?
    ...

values2 = [get_xmlannotation_value(a) for a in ome.structured_annotations]
for v, v2 in zip(values, values2):
     print("before:", repr(v))
     print("after: ", repr(v2))
     print()
assert values2 == values, "value mismatch"

I tried a number of different things without success. For example this still leaves a bunch of extra namespace declarations in values 2 and 3, and it seems like a lot of work go back and forth between text and DOM multiple times:

import lxml.etree
def get_xmlannotation_value(a):
     e = lxml.etree.fromstring(a.value.to_xml())
     return e.text + ''.join(lxml.etree.tostring(c, encoding=str) for c in e)

Tangentially, I learned that passing certain invalid XML breaks our to_xml() without raising an actual exception:

>>> print(XMLAnnotation(value='before</mismatched>after').to_xml())
<XMLAnnotation xmlns="http://www.openmicroscopy.org/Schemas/OME/2016-06" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openmicroscopy.org/Schemas/OME/2016-06 http://www.openmicroscopy.org/Schemas/OME/2016-06/ome.xsd" ID="Annotation:38">
  <Value>before</Value>after</XMLAnnotation>

Note that the text "after" is outside <Value></Value> which violates the ome-xml schema.

Maybe we could handle this like the _quantity fields -- .value would contain the raw text and .value_element (or some other name) would contain a parsed/structured representation. In the constructor or when setting .value, perhaps the caller would need to pass an actual DOM node (e.g. lxml.etree.Element) if they want actual XML content. If they pass a plain string then it would be taken as a literal string and we wouldn't implicitly parse it. My omero-cli-transfer code would then need to attempt to parse the values I get from OMERO to decide which path to take, but I think that's OK.

@jmuhlich
Copy link
Collaborator Author

jmuhlich commented Jan 3, 2025

I guess there is also the small issue that actual XML content won't have whitespace preserved (like in my second example value) but if it's semantically equivalent that should be fine I guess. Here are 2 examples of actual XML content in XMLAnnotations I found on our OMERO server, for reference:

 <SegmentDefinition>
           <Name>CCND1-high</Name>
           <CollectionOrder>1</CollectionOrder>
           <DisplayColor>#83e8f5</DisplayColor>
           <BlueSelection>3</BlueSelection>
           <GreenSelection>3</GreenSelection>
           <YellowSelection>3</YellowSelection>
           <RedSelection>1</RedSelection>
           <Erode>1</Erode>
           <Dilate>2</Dilate>
           <HoleSize>160</HoleSize>
           <ParticleSize>50</ParticleSize>
         </SegmentDefinition>
 <ChannelIntensity>
           <Name>Blue</Name>
           <Minintensity>10</Minintensity>
           <Maxintensity>3594</Maxintensity>
         </ChannelIntensity>

@joshmoore
Copy link
Collaborator

A quick but big 👍🏽 from my side for having strong support of XMLAnnotations in ome-types. This might be something that needs to be bubbled upstream as well so that we can achieve well-defined semantics. That's obviously a bigger project, but since the XMLAnnotation element says:

The contents of this is not processed as OME XML but should still be well-formed XML.

I could see wrapping any text like JSON with a minimal XML block. (We could go so far as to define a well-known element like <EmbeddedJSON/>...) This would then make the decoding of symbols like & which are tricky in XML less guess work.

cc: @chris-allan @sbesson @erickmartins

@jmuhlich
Copy link
Collaborator Author

jmuhlich commented Jan 3, 2025

This is off-topic but since the OME team has been summoned 😁 , why does PathViewer store raw JSON in XMLAnnotations in the first place rather than something like CommentAnnotation? Was it just to hide it from end users (since the interactive clients don't expose XMLAnnotations)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants