-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode the HTML before loading static assets #60
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if you see a big downside to doing a full decode, something like this:
diff --git a/Cargo.toml b/Cargo.toml
index dd779cf..7f371c8 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -13,6 +13,7 @@ license = "Apache-2.0"
[dependencies]
goose = { version = "0.17", default-features = false }
+html-escape = "0.2"
http = "0.2"
log = "0.4"
rand = "0.8"
diff --git a/src/lib.rs b/src/lib.rs
index de87860..5d4c2f6 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -983,11 +983,6 @@ fn valid_local_uri(user: &mut GooseUser, uri: &str) -> bool {
}
}
-/// Decodes the HTML. Currently it just decodes the encoded ampersand character.
-fn decode_html(html: &str) -> String {
- html.replace("&", "&")
-}
-
/// Extract all local static elements defined with a `src=` tag from the the provided html.
///
/// While you can invoke this function directly, it's generally preferred to invoke
@@ -997,10 +992,9 @@ pub async fn get_src_elements(user: &mut GooseUser, html: &str) -> Vec<String> {
// <foo> is the URL to local image and js assets.
// @TODO: parse HTML5 srcset= also
- let html = decode_html(html);
let src_elements = Regex::new(r#"(?i)src="(.*?)""#).unwrap();
let mut elements: Vec<String> = Vec::new();
- for url in src_elements.captures_iter(html.as_str()) {
+ for url in src_elements.captures_iter(html_escape::decode_html_entities(html).as_ref()) {
if valid_local_uri(user, &url[1]) {
elements.push(url[1].to_string());
}
@@ -1015,10 +1009,9 @@ pub async fn get_src_elements(user: &mut GooseUser, html: &str) -> Vec<String> {
pub async fn get_css_elements(user: &mut GooseUser, html: &str) -> Vec<String> {
// Use a case-insensitive regular expression to find all href=<foo> in the html, where
// <foo> is the URL to local css assets.
- let html = decode_html(html);
let css = Regex::new(r#"(?i)href="(.*?\.css.*?)""#).unwrap();
let mut elements: Vec<String> = Vec::new();
- for url in css.captures_iter(html.as_str()) {
+ for url in css.captures_iter(html_escape::decode_html_entities(html).as_ref()) {
if valid_local_uri(user, &url[1]) {
elements.push(url[1].to_string());
}
Either way, the test needs to be moved into its own file.
tests/validate.rs
Outdated
@@ -212,3 +215,48 @@ async fn test_invalid_header_value() { | |||
} | |||
assert!(goose_metrics.errors.len() == 1); | |||
} | |||
|
|||
#[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a in a new file. tests/validate
is intended to provide test coverage for the Validate
struct. (Likely a comment should be added to the top of tests/validate
to make this more clear, as I see that's non-obvious.)
src/lib.rs
Outdated
@@ -983,6 +983,11 @@ fn valid_local_uri(user: &mut GooseUser, uri: &str) -> bool { | |||
} | |||
} | |||
|
|||
/// Decodes the HTML. Currently it just decodes the encoded ampersand character. | |||
fn decode_html(html: &str) -> String { | |||
html.replace("&", "&") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to add a dependency here and to offload this logic, and to be sure we properly decode everything. What about instead using this?
https://docs.rs/html-escape
ed8a57f
to
d6be995
Compare
d6be995
to
5026074
Compare
Fantastic, thanks for the improvement, and quickly reworking based on feedback. |
With this PR, goose-eggs will decode the HTML before trying to detect and load static assets from it.
Right now, it just supports the
&
->&
decoding.Without this PR, goose might making requests with encoded ampersand symbols, which could cause servers to return a
4xx
response. At least that's what happened in one of my loadtests; I was testing a Drupal 10.1 site that aggregates assets which then get included into the page with many query parameters.You can apply the following diff if you wish to see the new failing test: