Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak: session/observers are not cleared on ungracefully closed websocket #1769

Closed
Tarang opened this issue Jan 19, 2014 · 16 comments
Closed

Comments

@Tarang
Copy link
Contributor

Tarang commented Jan 19, 2014

Cc: @tmeasday @n1mmy

After monitoring atmosphere over the past week (sorry about the downtime during this) I think i've found a potential issue when sockets are ungracefully closed. Meteor keeps these sessions in memory and they build up.

It's quite difficult to test this because its hard to emulate what ungraceful exit is. I could see lots of sessions in memory on atmosphere that were over 14 hours old so this is what I started to watch for.

Additionally meteorite, until last week, used process.exit() to exit instead of disconnecting the active socket. I'm not sure if this could have played a role in all this.

There were about 430 active sessions in memory, 30% being from meteorite, most over 14 hours old (310 sessions). I say this because those 30% didn't subscribe to the typical methods one would on atmosphere). The meteorite process.exit() issue was fixed last week: oortcloud/meteorite@6d89351

I tried to make an app with a modified livedata package that keeps track of all the active sockets at any given time. (Sorry about the js errors, I don't think linker plays well with custom modified core packages. it doesn't work without --debug) : https://github.com/Tarangp/Meteor-memory-leak-bug

To give you an idea, this is what the memory on atmosphere looks like at the moment, so this may be what is causing it.

screen shot 2014-01-15 at 1 48 34 pm

How to reproduce the issue

  1. Deploy the app from : https://github.com/Tarangp/Meteor-memory-leak-bug (with --debug. Also hosted at : https://testsessionbug.meteor.com/
  2. Visit the app in a web browser to monitor activity
  3. Visit app on an iPhone (should work with android, etc but this is what I tried with). Ensure websockets are used/There isn't a proxy with the carrier like bytemobile (or just use https).
  4. When iPhone's session is visible on the browser set the iPhone in airplane mode/switch off wifi (must have no 3g connectivity)
  5. The iphone won't be cleared from the list (since socket.on('close'..) isn't firing) :
    socket.on('close', function () {

This might be a bug in sockjs or there might be a different way to handle the ungracefully closed connection?

Amplification

While this bug may be small, when a session is left in memory this way all its observer's and publish methods stay active. This is probably the memory used builds up so quickly.

I tested this in a separate project similar to the above just monitoring to see if onStop was called. If the socket is closed ungracefully onStop isn't called and all the observers remain active.

This is probably why Nick mentioned there might be a CPU leak too, likely from calculating the diffs on the observer methods from the publish functions. This would have been done for all these unclosed sessions wastefully when a document is changed.

There are also > 800 packages on atmosphere which are subscribed to in entirety for each of these sessions. Meteor keeps track of each users subscription's documents too (to identify what to remove when the subscription is unsub'd and what to send .changed for a particular client) so this would add up for each of these sessions.

The issues on atmosphere started when the custom download counter was added in and its likely that it contributed to using more memory and cpu per session.

Alongside higher traffic than usual this is what might have taken atmosphere down, especially when the sessions build over a weeks time. I've been redeploying atmosphere every 2 days for the moment to try and prevent this.

Related

@mikowals
Copy link

That does sound difficult to reproduce and track down. I looked at your packages and saw that you were not using the new Facts package in your reproduction. It counts observers, subscriptions, and other related things and might be useful to confirm what you are seeing. Just mentioning since the package is relatively new and undocumented.

@tmeasday
Copy link
Contributor

@Tarangp and anyone interested may find the server info package similarly useful. It gives you an instantaneous snapshot of what subs / LRSes etc Meteor has open.

@Tarang
Copy link
Contributor Author

Tarang commented Jan 20, 2014

I started off using livedata because I wanted to know why the sessions stayed in memory. Originally I thought perhaps people aren't closing their tabs or the browser's thumbnail preview is doing this.

I managed to get the same thing to happen with the Facts package. That can be reproduced locally which may be better but another computer/iphone is still needed.

With a bare bones app with the facts package

  1. Host https://github.com/Tarangp/Test-Observers locally
  2. Visit it from a web browser over wifi (I used safari mobile on iPhone) and one locally on the hosting computer
  3. Have the WiFi disconnect and reconnect (only to the phone) in a loop every 15 seconds or so. Alternatively use the drawer to switch WiFi on and off repeatedly. This tries th emulate an ungraceful disconnection. This is what I got after a while :

screen shot 2014-01-20 at 10 35 57 am

With locally hosted Atmosphere and the packages dump:

  1. Repeated the same with a atmosphere and a dump of all 881 packages (removed force-ssl package & added the facts package. The sessions were let to build up to 210. The memory usage very nearly matches that on live atmosphere with ~200 sessions.
  2. Memory on locally hosted atmosphere after just over an hour, adding and ungracefully removing ~2.9 sessions per min on average.

screen shot 2014-01-20 at 12 42 56 pm

So its quite likely this is the main memory leak on atmosphere as there was only actually one session active at the end of this.

Possible Fixes

  • Maybe we could watch for the missing heartbeats in sockjs and consider the socket closed if there's no heartbeat for +50% of the heartbeat interval?
  • When there is a reconnection with the same connection id to clear the old one from memory.

@cscott
Copy link

cscott commented Jan 20, 2014

We did see a handles leak in our mystery hunt meteor app over the course of a weekend as well, FWIW. We went from ~5k handles open to ~7k handles open over the course of 24 hours, then reset to around ~5k after a restart. Of course, our conditions weren't very reproducible, so kudos to @Tarangp for his substantial work above.

@n1mmy
Copy link
Contributor

n1mmy commented Jan 22, 2014

Thanks for the great writeup and replication, @Tarangp. I've replicated this and we're looking into it.

@n1mmy
Copy link
Contributor

n1mmy commented Jan 22, 2014

We've narrowed this down to the http proxy. The symptoms go away if you use port 3001 instead of 3000.

The real fix is probably the same as #1543: ddp level heartbeats. However, we can hopefully fix the proxy to pass errors along better and avoid the leak in this case.

glasser added a commit that referenced this issue Jan 22, 2014
Previously, certain errors on the client/proxy socket would result in
the proxy keeping the websocket to the server open forever rather than
closing it.  For example, disconnecting the client from the internet
without a graceful shutdown. This could easily be reproduced with any
app using `facts`, eg https://github.com/tarangp/test-observers.
Run the app, connect locally with one browser, connect from a second
computer, observe sessions going to 2, disconnect the second computer.
With this commit, sessions will go back to 1 in about a minute. Without
it, it never will.

This particular fix is not very compelling, since it uses undocumented
features of the stream interface. I will file an issue with the
node-http-proxy project tonight asking how we're supposed to do this.

This addresses #1769.  I'll close that once we have a more compelling
fix, and once the similar bug is fixed in the proxies used in the
`meteor deploy` server and in Galaxy.
@glasser
Copy link
Contributor

glasser commented Jan 22, 2014

@n1mmy meant to say "ddp level heartbeats". (Specifically, heartbeats with responses that are actually used by both client and server to terminate the connection when they don't arrive.) DDP level heartbeats will be more consistent and faster than TCP/websocket level detection. But that doesn't excuse not also getting the TCP-level error handling right.

@n1mmy
Copy link
Contributor

n1mmy commented Jan 22, 2014

Oops, yeah. I meant heartbeats. Brain-o =). Comment updated to avoid confusing future readers.

@Tarang
Copy link
Contributor Author

Tarang commented Jan 23, 2014

Wouldn't it be better to hook into sockjs' heartbeats?

It makes sense with server to server DDP because there's no sockjs layer underneath.

It would be a bit redundant with the ordinary browser-client meteor-server scenario because there's this 'h' sockjs sends every 30s or 60s not too sure.

@glasser
Copy link
Contributor

glasser commented Jan 24, 2014

Well, there are two things that DDP-level heartbeats could provide that would be improvements over the sockjs heartbeats:

(a) They could be bidrectional, so that clients could detect vanishing servers (like in #1543).

(b) SockJS heartbeats don't have a reply. As far as I understand, their main purpose is to ensure that something goes over the wire every so often. This allows polling transports to finish intermediate queries, and for websocket transports to be pushing something over the network so that TCP will eventually notice that transmissions are timing out. But many apps will want something stronger: to be able to clean up resources associated with clients (or servers) that aren't actively responding to pings right now.

@n1mmy
Copy link
Contributor

n1mmy commented Jan 24, 2014

Also, (c): they could happen on raw websockets when there is no sockjs framing (eg on server-to-server DDP connections). We could in theory use websocket ping frames in this case, but that may not work on future transport layers (eg raw TCP, etc).

@n1mmy
Copy link
Contributor

n1mmy commented Jan 29, 2014

Status update for clarity: this is fixed in meteor run on devel with @glasser's PR in [email protected], but not yet fixed in meteor deploy.

@Tarang
Copy link
Contributor Author

Tarang commented Mar 15, 2014

@n1mmy Which version of meteor was this fix applied to?

There might be another way it still occurs, at least on the old atmosphere which runs 0.7.1.2. Its not as bad as before so there may have been another way a session stays in memory.

@mizzao
Copy link
Contributor

mizzao commented Mar 16, 2014

I am observing this issue on my test app for https://github.com/mizzao/meteor-user-status as well, except the mystery connections are from a normal Google Chrome desktop client. http://user-status.meteor.com/ makes it really easy to see the ghost sessions (there are some right now, which can be easily seen as dupes from the same IP, and will probably stay until the server is restarted.)

This issue was hypothesized in tmeasday/meteor-presence#17 and I'm glad it was discovered and replicated.

@n1mmy
Copy link
Contributor

n1mmy commented Mar 17, 2014

This is fixed in the development mode proxy in Meteor 0.7.1+. It is not totally fixed on meteor deploy servers. When running deployed to meteor servers, there may still be some connection leakage. However, the meteor proxy servers restart about once a day, so at least the leaks should be relatively bounded.

The real solution here is DDP heartbeats. See #1865.

@glasser
Copy link
Contributor

glasser commented Jan 31, 2015

We're using the new http-proxy on meteor deploy these days, and we have DDP heartbeats, so I think this is done.

@glasser glasser closed this as completed Jan 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants