Article: How to consume RSS safely

June 12, 2003
tags: html, rss, security

(Original source: http://diveintomark.org/archives/2003/06/12/how_to_consume_rss_safely)

First of all, I apologize to those of you who subscribe to my RSS feed and use web-based or browser- based news aggregators. If you checked your news page in the last 12 hours, you no doubt saw my little prank: an entire screen full of platypuses. (Please, let’s not turn this into a discussion of proper pluralization. Try to stay with me.) They’re gone from my feed now, although depending on your software you may need to delete the post in question from your local news page as well.

Now that the contrition is out of the way, let’s face facts: if this prank affected you, your software is dangerously broken. It accepts arbitrary HTML from potentially 100s of sources and blindly republishes it all on a single page on your own web server (or desktop web server). This is fundamentally dangerous.

Now, the current situation is not entirely your software’s fault. RSS, by design, is difficult to consume safely. The RSS specification allows for description elements to contain arbitrary entity- encoded HTML. While this is great for RSS publishers (who can just throw stuff together and make an RSS feed), it makes writing a safe and effective RSS consumer application exceedingly difficult. And now that RSS is moving into the mainstream, the design decisions that got it there are becoming more and more of a problem.

HTML is nasty. Arbitrary HTML can carry nasty payloads: scripts, ActiveX objects, remote image web bugs, and arbitrary CSS styles that (as you saw with my platypus prank) can take over the entire screen. Browsers protect against the worst of these payloads by having different rules for different zones. For example, pages in the general Internet are marked untrusted and may not have privileges to run ActiveX objects, but pages on your own machine or within your own intranet can. Unfortunately, the practice of republishing remote HTML locally eliminates even this minimal safeguard.

Still, dealing with arbitrary HTML is not impossible. Web-based mail systems like Hotmail and Yahoo allow users to send and receive HTML mail, and they take great pains to display it safely. It’s a lot of work, and there have been several high-profile failures over the years, but they’re coping.

Let me be clear: by design, RSS forces every single consumer application to cope with this problem.

So, to anyone who wants to write a safe RSS aggregator (or who has already written an unsafe one), I offer this advice:

Strip script tags. This almost goes without saying. Want to see the prank I didn’t pull? More seriously, script tags can be used by unscrupulous publishers to insert pop-up ads onto your news page. Think it won’t happen? Some larger commercial publishers are already inserting text ads and banner ads into their feeds.
Strip embed tags.
Strip object tags.
Strip frameset tags.
Strip frame tags.
Strip iframe tags.
Strip meta tags, which can be used to hijack a page and redirect it to a remote URL.
Strip link tags, which can be used to import additional style definitions.
Strip style tags, for the same reason.
Strip style attributes from every single remaining tag. My platypus prank was based entirely on a single rogue style attribute.

Alternatively, you can simply strip all but a known subset of tags. Many comment systems work this way. You’ll still need to strip style attributes though, even from the known good tags.

Selected comments from the source page:

You forgot two important ones:

If you strip style attributes, you want to strip event handlers too.

Otherwise:

… onLoad=”location.href=’http://www.playboy.com‘” …

Plus, there are the layout-breaking tags, like a closing DIV or closing TABLE.

.

Emmanuel: the issue has been raised many times in many forums. See, for instance:

http://www.intertwingly.net/blog/940.html http://feeds.archive.org/validator/docs/warning/ContainsScript.html http://webservices.xml.com/pub/a/ws/2002/11/19/rssfeedquality.html?page=2 http://www.securiteam.com/unixfocus/6L00H205PY.html http://project.antville.org/stories/200348/ http://www.peerfear.org/rss/permalink/1028943207.shtml http://diveintomark.org/archives/2002/10/10/more_on_evolvable_formats.html http://philringnalda.com/blog/2002/04/thinking_about_rss.php http://groups.yahoo.com/group/radio-userland/message/9965 http://radio.weblogs.com/0100887/categories/rss/2002/05/23.html#a265 http://vyom.org/cat_internet/rss_security_vulnerabilities.php

A Google search for “rss strip html tags” will turn up dozens more.

.

And you should use regular expressions to remove them. /<(script|noscript|object|embed|style|frameset|frame|iframe)[>\s\S]</\1>/i /</?!?(param|link|meta|doctype|div|font)[^>]>/i /(class|style|id)=”[^"]*”/I

.

One more tag to be wary of: <body>. When IE encounters a <body onload> inside the main <body>-section, it will execute that script as if it was on the outer-<body>.

.

I’d be surprised if this problem can be solved properly using regular expressions - for example, the examples regexps pasted in above would miss out on tags that don’t have a closing tag and unquoted attributes. I know from experience (http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker) that there are a huge number of HTML “tricks” for causing problems, especially if your browser is IE (which is reknowned for accepting pretty much any garbage markup).

To be truly safe, you need to use a proper HTML parser to pre-process the markup. Even worse, the parser can’t just be a standard HTML parser - it will need to closely match the parser of the eventual consuming browser (generally IE) as otherwise it could miss stuff that IE will still process.

It’s a very nasty problem.

.

There are a lot of feeds out there that blindly copy out the bad html that has been entered by someone else in a comments box. Either that or they include all the formatting used on the blog itself. To avoid the item being too long they then chop after N characters and add “…”. The end result of this is <description> containing not malicious but annoying tags like <font and <table and because this isn’t cleaned up before chopping these are often unmatched. I think all this is much more of a problem than the rare occasions where someone deliberately tries an exploit, dangerous though that might be.

So I can strip_tags() selectively, and use some simple regex to get rid of the worst of the tag attributes. But I’ve still got to build an HTML tidy to catch the unmatched tags.

It’s enough to make me want to exclude everything except <a href, <img and I’m not too sure about those either.

.

Also, be sure to restrict the URLs of images, links, etc. For Mozilla, you must disallow links to javascript: and data: URLs. For IE and NS4, I think there are a few synonyms for javascript: you also have to disallow.

Open Issues • API Reference

Project Information

Usage

Internals

Contributing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Article: How to consume RSS safely

Selected comments from the source page:

Clone this wiki locally