Memory leak in parse_string? (or misleading documentation?) #45

samoconnor · 2016-03-13T22:20:37Z

I received a report of memory leaks from an XMLDict user JuliaCloud/XMLDict.jl#1.

I'm somewhat surprised. The LightXML.jl doc says, under the heading "Create an XML Document", that "When you create XML documents and elements directly you need to take care not to leak memory". However, under the heading "Read an XML file" there is no mention of memory management. XMLDict.jl only calls LightXML.parse_string, so it does not "create XML documents and elements directly".

The documentation seems to say that if you are just "read[ing] an xml file", as opposed to "create[ing] XML documents and elements directly", then you do not need to worry about memory management.

Is parse_string broken? or is the documentation broken?

This seems related: #19

The text was updated successfully, but these errors were encountered:

tkelman · 2016-03-13T22:27:39Z

https://github.com/JuliaLang/LightXML.jl/blob/d31f1f43dc04fd1546ec5e4ad039c6f1981892fd/src/document.jl#L103
XMLDocument(p) is creating an XML document. So the word "directly" should probably be adjusted. If you know you're never adding children or other references to the same underlying pointer you can manually add a finalizer to any XMLDocuments you create. I don't think it's safe to do that in general without bookkeeping reference counts in all of the Julia types.

JuliaIO/LightXML.jl#45 (comment) #1

samoconnor · 2016-03-13T22:50:33Z

I suggest at a minimum:

Add a finalizer in the parse_* methods
Describe in the documentation: under what circumstances users of parse_* could still get into trouble with memory management.
Provide an example of "this is ok", and an example of "this is not ok", and "here is how to handle it".

tkelman · 2016-03-13T22:56:53Z

My concern is if you do parse_string to create an initial XMLDocument, then add some children to it, if it has a finalizer and the Julia XMLDocument object goes out of scope and gets garbage collected but the children are still alive, you could cause a segfault since the Julia GC doesn't know about the C-level reference between the two objects.

samoconnor · 2016-03-13T23:04:55Z

So, per my suggestion above, please describe that in the documentation.
i.e. say that everything is fine if...

you don't manually add children.
when you do add children, you make sure to hold a reference to the parent doc until after you're done with the children.

And, please provide examples of correct usage in the documentation.

(FWIW, a segfault is far preferable to a memory leak. A segfault gets found in testing. A memory leak causes stuff to break months later in production and costs way more to fix.)

yuyichao · 2016-03-13T23:07:32Z

A segfault caused by a GC will not necessarily be found in tests.

yuyichao · 2016-03-13T23:10:56Z

I'm not familiar with libxml2 API but is the memory own'ed by a root object and is it always possible to find know what the owner object is when the library returns a pointer (either from the argument or from some libxml2 api)?

tkelman · 2016-03-13T23:24:25Z

I don't know exactly how to answer your question @yuyichao. libxml2 is a C library and until someone implements a full reference count system here (which is what I believe the R wrapper and maybe also the Python wrapper do), you're responsible for C-like manual memory management when using it through this wrapper.

yuyichao · 2016-03-14T00:13:46Z

What I mean is that we don't necessarily need to keep track of reference counting and we can instead track the reference directly. This may not be simpler in general (although also won't be much more complicated since for RC you also need to lookup the julia object corresponding to a libxml2 pointer) but in the case where each object has a clear owner this would be much simpler.

My question was basically if this is true for libxml2. For example, can a xml node belong to multiple xml document simultaneously? And if this is reasonable to keep track of given the API. (i.e. if it is clear from the API when and how the ownership is transfered.)

tkelman · 2016-03-14T00:20:15Z

can a xml node belong to multiple xml document simultaneously

Looks like.

julia> xdoc = XMLDocument()
<?xml version="1.0" encoding="utf-8"?>
julia> xroot = create_root(xdoc, "States")
<States/>
julia> xs1 = new_child(xroot, "State")
<State/>
julia> xdoc2 = XMLDocument()
<?xml version="1.0" encoding="utf-8"?>
julia> xdoc
<?xml version="1.0" encoding="utf-8"?>
<States>
  <State/>
</States>
julia> xroot2 = create_root(xdoc2, "States2")
<States2/>
julia> add_child(xroot2, xs1)
true
julia> xroot2
<States2>
  <State/>
</States2>
julia> xs2 = new_child(xs1, "StateChild2")
<StateChild2/>
julia> xdoc
<?xml version="1.0" encoding="utf-8"?>
<States>
  <State>
    <StateChild2/>
  </State>
</States>
julia> xdoc2
<?xml version="1.0" encoding="utf-8"?>
<States2>
  <State>
    <StateChild2/>
  </State>
</States2>

yuyichao · 2016-03-14T00:40:23Z

Actually it looks like this is not allowed since freeing both xdoc and xdoc2 here causes a segfault.

julia> using LightXML

julia> xdoc = XMLDocument()
<?xml version="1.0" encoding="utf-8"?>


julia> xroot = create_root(xdoc, "States")
<States/>

julia> xs1 = new_child(xroot, "State")
<State/>

julia> xdoc2 = XMLDocument()
<?xml version="1.0" encoding="utf-8"?>


julia> xroot2 = create_root(xdoc2, "States2")
<States2/>

julia> add_child(xroot2, xs1)
true

julia> xroot2
<States2>
  <State/>
</States2>

julia> xroot
<States>
  <State/>
</States>

julia> fre
free  frexp
julia> free
free (generic function with 4 methods)

julia> free(xdoc)
Ptr{Void} @0x0000000000000000

julia> free(xdoc2)

signal (11): 段错误
while loading no file, in expression starting on line 0
xmlFreeNodeList at /usr/bin/../lib/libxml2.so (unknown line)
xmlFreeNodeList at /usr/bin/../lib/libxml2.so (unknown line)
xmlFreeDoc at /usr/bin/../lib/libxml2.so (unknown line)
free at /home/yuyichao/projects/contrib/LightXML.jl/src/document.jl:71
unknown function (ip: 0x7fa6af99c045)
[inline] at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/julia_internal.h:69
jl_call_method_internal at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/gf.c:1879
do_call at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/interpreter.c:66
eval at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/interpreter.c:185
jl_toplevel_eval_flex at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/toplevel.c:561
[inline] at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/julia.h:1379
jl_eh_restore_state at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/builtins.c:548
eval at ./boot.jl:267
[inline] at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/julia_internal.h:69
jl_call_method_internal at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/gf.c:1879
[inline] at ./REPL.jl:3
eval_user_input at ./REPL.jl:62
unknown function (ip: 0x7fa6afbca5b7)
[inline] at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/julia_internal.h:69
jl_call_method_internal at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/gf.c:1879
[inline] at ./REPL.jl:92
#1 at ./task.jl:59
unknown function (ip: 0x7fa8b80c1560)
[inline] at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/julia_internal.h:69
jl_call_method_internal at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/gf.c:1903
start_task at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/task.c:245
unknown function (ip: (nil))
Allocations: 4859550 (Pool: 4857857; Big: 1693); GC: 10

tkelman · 2016-03-14T00:42:32Z

That's expected, xmlFreeDoc frees all children http://xmlsoft.org/html/libxml-tree.html#xmlFreeDoc

yuyichao · 2016-03-14T00:45:14Z

Then is there a way to free all the memory allocated without segfault after creating this?

yuyichao · 2016-03-14T00:48:07Z

xmlNode for example also have a parent field so I think it is not supposed to be added to two trees simultaneously.

tkelman · 2016-03-14T00:52:20Z

It might be one of those things that the library lets you do but you're not supposed to do it? If we want to do something clever and automatic here, I think we'd really have to carefully study how some other proven, well-tested wrapper of ~~libgit2~~ libxml2 for some garbage collected language handles these things.

yuyichao · 2016-03-14T05:21:19Z

IIUC lxml.etree uses libxml2 and when you try to add a node to another tree, it is deleted from the original one.

In [1]: from lxml import etree

In [2]: root = etree.Element("root")

In [3]: root2 = etree.Element("root2")

In [4]: c1 = etree.Element("child1")

In [5]: root.append(c1)

In [6]: print(etree.tostring(root).decode())
<root><child1/></root>

In [7]: print(etree.tostring(root2).decode())
<root2/>

In [8]: root2.append(c1)

In [9]: print(etree.tostring(root).decode())
<root/>

In [10]: print(etree.tostring(root2).decode())
<root2><child1/></root2>

nalimilan · 2016-03-14T10:25:56Z

FWIW, here's a detailed description of how this works in the R wrapper for libxml2:
https://github.com/omegahat/XML/blob/master/Docs/MemoryManagement.pdf

samoconnor · 2016-03-15T10:10:06Z

In XMLDict, I've added finalizer(doc, LightXML.free) after parse_string, and finalize(doc) after doc has been converted to a Dict. JuliaCloud/XMLDict.jl@146af78

The user who reported JuliaCloud/XMLDict.jl#1 says that they can now process large (6 gb) files with constant memory usage.

I'd like to reiterate my request to the maintainers of LightXML.jl: Please add, ASAP, a description and and example to the documentation for the simple read-only case that says "if you do x, y and z you'll be fine". I understand that it might take time to figure out a robust way to handle all the corner cases, but in the meantime, the documentation should at least tell the user how to use the library safely in its current form.

tkelman · 2016-03-15T10:20:57Z

Pull requests encouraged.

samoconnor · 2016-04-18T10:36:01Z

FYI I made a further tweak to the XMLDict.jl work-around: JuliaCloud/XMLDict.jl@d7b63d0
-- "keep back refernce to top level XMLDocument object to avoid premature finalisation".

samoconnor added a commit to JuliaCloud/XMLDict.jl that referenced this issue Mar 13, 2016

Set up finalizer for the result of LightXML.parse_string()

146af78

JuliaIO/LightXML.jl#45 (comment) #1

samoconnor mentioned this issue Mar 13, 2016

Free xml doc JuliaCloud/XMLDict.jl#1

Closed

samoconnor mentioned this issue Mar 14, 2016

Tag XMLDict v0.0.7 JuliaLang/METADATA.jl#4773

Closed

samoconnor mentioned this issue Feb 15, 2018

free usage should be in the first example #75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in parse_string? (or misleading documentation?) #45

Memory leak in parse_string? (or misleading documentation?) #45

samoconnor commented Mar 13, 2016

tkelman commented Mar 13, 2016

samoconnor commented Mar 13, 2016

tkelman commented Mar 13, 2016

samoconnor commented Mar 13, 2016

yuyichao commented Mar 13, 2016

yuyichao commented Mar 13, 2016

tkelman commented Mar 13, 2016

yuyichao commented Mar 14, 2016

tkelman commented Mar 14, 2016

yuyichao commented Mar 14, 2016

tkelman commented Mar 14, 2016

yuyichao commented Mar 14, 2016

yuyichao commented Mar 14, 2016

tkelman commented Mar 14, 2016

yuyichao commented Mar 14, 2016

nalimilan commented Mar 14, 2016

samoconnor commented Mar 15, 2016

tkelman commented Mar 15, 2016

samoconnor commented Apr 18, 2016

Memory leak in parse_string? (or misleading documentation?) #45

Memory leak in parse_string? (or misleading documentation?) #45

Comments

samoconnor commented Mar 13, 2016

tkelman commented Mar 13, 2016

samoconnor commented Mar 13, 2016

tkelman commented Mar 13, 2016

samoconnor commented Mar 13, 2016

yuyichao commented Mar 13, 2016

yuyichao commented Mar 13, 2016

tkelman commented Mar 13, 2016

yuyichao commented Mar 14, 2016

tkelman commented Mar 14, 2016

yuyichao commented Mar 14, 2016

tkelman commented Mar 14, 2016

yuyichao commented Mar 14, 2016

yuyichao commented Mar 14, 2016

tkelman commented Mar 14, 2016

yuyichao commented Mar 14, 2016

nalimilan commented Mar 14, 2016

samoconnor commented Mar 15, 2016

tkelman commented Mar 15, 2016

samoconnor commented Apr 18, 2016