memory leak: session/observers are not cleared on ungracefully closed websocket #1769

Tarang · 2014-01-19T21:37:09Z

After monitoring atmosphere over the past week (sorry about the downtime during this) I think i've found a potential issue when sockets are ungracefully closed. Meteor keeps these sessions in memory and they build up.

It's quite difficult to test this because its hard to emulate what ungraceful exit is. I could see lots of sessions in memory on atmosphere that were over 14 hours old so this is what I started to watch for.

Additionally meteorite, until last week, used process.exit() to exit instead of disconnecting the active socket. I'm not sure if this could have played a role in all this.

There were about 430 active sessions in memory, 30% being from meteorite, most over 14 hours old (310 sessions). I say this because those 30% didn't subscribe to the typical methods one would on atmosphere). The meteorite process.exit() issue was fixed last week: oortcloud/meteorite@6d89351

I tried to make an app with a modified livedata package that keeps track of all the active sockets at any given time. (Sorry about the js errors, I don't think linker plays well with custom modified core packages. it doesn't work without --debug) : https://github.com/Tarangp/Meteor-memory-leak-bug

To give you an idea, this is what the memory on atmosphere looks like at the moment, so this may be what is causing it.

How to reproduce the issue

Deploy the app from : https://github.com/Tarangp/Meteor-memory-leak-bug (with --debug. Also hosted at : https://testsessionbug.meteor.com/
Visit the app in a web browser to monitor activity
Visit app on an iPhone (should work with android, etc but this is what I tried with). Ensure websockets are used/There isn't a proxy with the carrier like bytemobile (or just use https).
When iPhone's session is visible on the browser set the iPhone in airplane mode/switch off wifi (must have no 3g connectivity)
The iphone won't be cleared from the list (since socket.on('close'..) isn't firing) :

meteor/packages/livedata/stream_server.js

Line 70 in b5cd66d

socket.on('close', function () {

This might be a bug in sockjs or there might be a different way to handle the ungracefully closed connection?

Amplification

While this bug may be small, when a session is left in memory this way all its observer's and publish methods stay active. This is probably the memory used builds up so quickly.

I tested this in a separate project similar to the above just monitoring to see if onStop was called. If the socket is closed ungracefully onStop isn't called and all the observers remain active.

This is probably why Nick mentioned there might be a CPU leak too, likely from calculating the diffs on the observer methods from the publish functions. This would have been done for all these unclosed sessions wastefully when a document is changed.

There are also > 800 packages on atmosphere which are subscribed to in entirety for each of these sessions. Meteor keeps track of each users subscription's documents too (to identify what to remove when the subscription is unsub'd and what to send .changed for a particular client) so this would add up for each of these sessions.

The issues on atmosphere started when the custom download counter was added in and its likely that it contributed to using more memory and cpu per session.

Alongside higher traffic than usual this is what might have taken atmosphere down, especially when the sessions build over a weeks time. I've been redeploying atmosphere every 2 days for the moment to try and prevent this.

mikowals · 2014-01-20T00:28:04Z

That does sound difficult to reproduce and track down. I looked at your packages and saw that you were not using the new Facts package in your reproduction. It counts observers, subscriptions, and other related things and might be useful to confirm what you are seeing. Just mentioning since the package is relatively new and undocumented.

tmeasday · 2014-01-20T00:35:23Z

@Tarangp and anyone interested may find the server info package similarly useful. It gives you an instantaneous snapshot of what subs / LRSes etc Meteor has open.

Tarang · 2014-01-20T10:49:02Z

I started off using livedata because I wanted to know why the sessions stayed in memory. Originally I thought perhaps people aren't closing their tabs or the browser's thumbnail preview is doing this.

I managed to get the same thing to happen with the Facts package. That can be reproduced locally which may be better but another computer/iphone is still needed.

With a bare bones app with the facts package

Host https://github.com/Tarangp/Test-Observers locally
Visit it from a web browser over wifi (I used safari mobile on iPhone) and one locally on the hosting computer
Have the WiFi disconnect and reconnect (only to the phone) in a loop every 15 seconds or so. Alternatively use the drawer to switch WiFi on and off repeatedly. This tries th emulate an ungraceful disconnection. This is what I got after a while :

With locally hosted Atmosphere and the packages dump:

Repeated the same with a atmosphere and a dump of all 881 packages (removed force-ssl package & added the facts package. The sessions were let to build up to 210. The memory usage very nearly matches that on live atmosphere with ~200 sessions.
Memory on locally hosted atmosphere after just over an hour, adding and ungracefully removing ~2.9 sessions per min on average.

So its quite likely this is the main memory leak on atmosphere as there was only actually one session active at the end of this.

Possible Fixes

Maybe we could watch for the missing heartbeats in sockjs and consider the socket closed if there's no heartbeat for +50% of the heartbeat interval?
When there is a reconnection with the same connection id to clear the old one from memory.

cscott · 2014-01-20T15:05:10Z

We did see a handles leak in our mystery hunt meteor app over the course of a weekend as well, FWIW. We went from ~5k handles open to ~7k handles open over the course of 24 hours, then reset to around ~5k after a restart. Of course, our conditions weren't very reproducible, so kudos to @Tarangp for his substantial work above.

n1mmy · 2014-01-22T00:24:34Z

Thanks for the great writeup and replication, @Tarangp. I've replicated this and we're looking into it.

n1mmy · 2014-01-22T01:35:13Z

We've narrowed this down to the http proxy. The symptoms go away if you use port 3001 instead of 3000.

The real fix is probably the same as #1543: ddp level heartbeats. However, we can hopefully fix the proxy to pass errors along better and avoid the leak in this case.

Previously, certain errors on the client/proxy socket would result in the proxy keeping the websocket to the server open forever rather than closing it. For example, disconnecting the client from the internet without a graceful shutdown. This could easily be reproduced with any app using `facts`, eg https://github.com/tarangp/test-observers. Run the app, connect locally with one browser, connect from a second computer, observe sessions going to 2, disconnect the second computer. With this commit, sessions will go back to 1 in about a minute. Without it, it never will. This particular fix is not very compelling, since it uses undocumented features of the stream interface. I will file an issue with the node-http-proxy project tonight asking how we're supposed to do this. This addresses #1769. I'll close that once we have a more compelling fix, and once the similar bug is fixed in the proxies used in the `meteor deploy` server and in Galaxy.

glasser · 2014-01-22T02:19:12Z

@n1mmy meant to say "ddp level heartbeats". (Specifically, heartbeats with responses that are actually used by both client and server to terminate the connection when they don't arrive.) DDP level heartbeats will be more consistent and faster than TCP/websocket level detection. But that doesn't excuse not also getting the TCP-level error handling right.

n1mmy · 2014-01-22T06:27:28Z

Oops, yeah. I meant heartbeats. Brain-o =). Comment updated to avoid confusing future readers.

Tarang · 2014-01-23T08:50:01Z

Wouldn't it be better to hook into sockjs' heartbeats?

It makes sense with server to server DDP because there's no sockjs layer underneath.

It would be a bit redundant with the ordinary browser-client meteor-server scenario because there's this 'h' sockjs sends every 30s or 60s not too sure.

glasser · 2014-01-24T05:19:25Z

Well, there are two things that DDP-level heartbeats could provide that would be improvements over the sockjs heartbeats:

(a) They could be bidrectional, so that clients could detect vanishing servers (like in #1543).

(b) SockJS heartbeats don't have a reply. As far as I understand, their main purpose is to ensure that something goes over the wire every so often. This allows polling transports to finish intermediate queries, and for websocket transports to be pushing something over the network so that TCP will eventually notice that transmissions are timing out. But many apps will want something stronger: to be able to clean up resources associated with clients (or servers) that aren't actively responding to pings right now.

n1mmy · 2014-01-24T05:23:14Z

Also, (c): they could happen on raw websockets when there is no sockjs framing (eg on server-to-server DDP connections). We could in theory use websocket ping frames in this case, but that may not work on future transport layers (eg raw TCP, etc).

n1mmy · 2014-01-29T22:12:28Z

Status update for clarity: this is fixed in meteor run on devel with @glasser's PR in [email protected], but not yet fixed in meteor deploy.

Tarang · 2014-03-15T11:43:04Z

@n1mmy Which version of meteor was this fix applied to?

There might be another way it still occurs, at least on the old atmosphere which runs 0.7.1.2. Its not as bad as before so there may have been another way a session stays in memory.

mizzao · 2014-03-16T20:26:38Z

I am observing this issue on my test app for https://github.com/mizzao/meteor-user-status as well, except the mystery connections are from a normal Google Chrome desktop client. http://user-status.meteor.com/ makes it really easy to see the ghost sessions (there are some right now, which can be easily seen as dupes from the same IP, and will probably stay until the server is restarted.)

This issue was hypothesized in tmeasday/meteor-presence#17 and I'm glad it was discovered and replicated.

n1mmy · 2014-03-17T06:57:52Z

This is fixed in the development mode proxy in Meteor 0.7.1+. It is not totally fixed on meteor deploy servers. When running deployed to meteor servers, there may still be some connection leakage. However, the meteor proxy servers restart about once a day, so at least the leaks should be relatively bounded.

The real solution here is DDP heartbeats. See #1865.

glasser · 2015-01-31T01:20:44Z

We're using the new http-proxy on meteor deploy these days, and we have DDP heartbeats, so I think this is done.

glasser mentioned this issue Jan 22, 2014

No way to close proxy/server websocket on client/server error http-party/node-http-proxy#559

Closed

dburles mentioned this issue Feb 1, 2014

Remove heartbeat? dburles/meteor-presence#2

Open

mizzao mentioned this issue Mar 18, 2014

TypeError: Object #<Object> has no method 'on' Meteor-Community-Packages/meteor-user-status#15

Closed

glasser closed this as completed Jan 31, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory leak: session/observers are not cleared on ungracefully closed websocket #1769

memory leak: session/observers are not cleared on ungracefully closed websocket #1769

Tarang commented Jan 19, 2014

mikowals commented Jan 20, 2014

tmeasday commented Jan 20, 2014

Tarang commented Jan 20, 2014

cscott commented Jan 20, 2014

n1mmy commented Jan 22, 2014

n1mmy commented Jan 22, 2014

glasser commented Jan 22, 2014

n1mmy commented Jan 22, 2014

Tarang commented Jan 23, 2014

glasser commented Jan 24, 2014

n1mmy commented Jan 24, 2014

n1mmy commented Jan 29, 2014

Tarang commented Mar 15, 2014

mizzao commented Mar 16, 2014

n1mmy commented Mar 17, 2014

glasser commented Jan 31, 2015

memory leak: session/observers are not cleared on ungracefully closed websocket #1769

memory leak: session/observers are not cleared on ungracefully closed websocket #1769

Comments

Tarang commented Jan 19, 2014

How to reproduce the issue

Amplification

mikowals commented Jan 20, 2014

tmeasday commented Jan 20, 2014

Tarang commented Jan 20, 2014

With a bare bones app with the facts package

With locally hosted Atmosphere and the packages dump:

Possible Fixes

cscott commented Jan 20, 2014

n1mmy commented Jan 22, 2014

n1mmy commented Jan 22, 2014

glasser commented Jan 22, 2014

n1mmy commented Jan 22, 2014

Tarang commented Jan 23, 2014

glasser commented Jan 24, 2014

n1mmy commented Jan 24, 2014

n1mmy commented Jan 29, 2014

Tarang commented Mar 15, 2014

mizzao commented Mar 16, 2014

n1mmy commented Mar 17, 2014

glasser commented Jan 31, 2015