Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Third Party 2021 Queries #2355

Merged
merged 27 commits into from
Oct 21, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
9263fb3
Conversion of 2021 queries
tunetheweb Sep 30, 2021
b3e3556
Merge branch 'main' into third-parties-2021-queries
tunetheweb Oct 3, 2021
300703a
Add Markdown file back
tunetheweb Oct 3, 2021
ce31c24
Add client
tunetheweb Oct 11, 2021
512f9c9
Merge branch 'main' into third-parties-2021-queries
tunetheweb Oct 12, 2021
db1fd1a
Fix queries
tunetheweb Oct 12, 2021
148b26d
Linting fixes
tunetheweb Oct 12, 2021
8b64d95
More query fixes
tunetheweb Oct 13, 2021
41bc062
Add ranking query
tunetheweb Oct 13, 2021
832a4a7
Merge branch 'main' into third-parties-2021-queries
tunetheweb Oct 17, 2021
e074b8d
Add final queries
tunetheweb Oct 17, 2021
f82c8e7
Add documentation
tunetheweb Oct 17, 2021
3bd6c5e
Linting
tunetheweb Oct 17, 2021
5f1c49f
Merge branch 'main' into third-parties-2021-queries
tunetheweb Oct 20, 2021
09dfff4
Update sql/2021/third-parties/percent_of_third_party_cache.sql
tunetheweb Oct 20, 2021
bccdb8d
Merge branch 'third-parties-2021-queries' of github.com:HTTPArchive/a…
tunetheweb Oct 20, 2021
acb9b3d
Update sql/2021/third-parties/percent_of_third_party_loaded_before_DO…
tunetheweb Oct 20, 2021
d22750c
Merge branch 'third-parties-2021-queries' of github.com:HTTPArchive/a…
tunetheweb Oct 20, 2021
7d6e299
Add client
tunetheweb Oct 20, 2021
bc1e7d2
Update sql/2021/third-parties/percent_of_websites_with_third_party.sql
tunetheweb Oct 20, 2021
c34685a
Update sql/2021/third-parties/percent_of_websites_with_third_party_by…
tunetheweb Oct 20, 2021
3849c51
Update sql/2021/third-parties/tao_by_third_party.sql
tunetheweb Oct 20, 2021
f5dfb63
Update sql/2021/third-parties/tao_by_third_party.sql
tunetheweb Oct 20, 2021
dc93211
Linting fixes
tunetheweb Oct 20, 2021
fc1f9bf
Change NET.HOST and removing hosting category
tunetheweb Oct 21, 2021
1eac213
Linting fixes
tunetheweb Oct 21, 2021
8d58136
Forgot to hit save
tunetheweb Oct 21, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions sql/2021/third-parties/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# 2021 Third-Party queries

<!--
This directory contains all of the 2021 Third-Party chapter queries.

Each query should have a corresponding `metric_name.sql` file.
Note that readers are linked to this directory, so try to make the SQL file names descriptive for easy browsing.

Analysts: if helpful, you can use this README to give additional info about the queries.
-->

Resources:

- [Chapter issue](https://github.com/HTTPArchive/almanac.httparchive.org/issues/2145)
- [Planning doc](https://docs.google.com/document/d/164HhV76iVT2qVfFY2kzyr44eIcGAw3fWqNTbuoRfRvE/edit?usp=sharing)
- [Results sheet](https://docs.google.com/spreadsheets/d/1tf4RMF8SYr6he9tbqt61yuFJ_QK-F-i7XPxaPkpKSDI/edit?usp=sharing/)
- [2019 chapter](https://almanac.httparchive.org/en/2019/third-parties)
- [2020 chapter](https://almanac.httparchive.org/en/2020/third-parties)
53 changes: 53 additions & 0 deletions sql/2021/third-parties/distribution_of_3XX_response_body_size.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#standardSQL
# Distribution of response body size by redirected third parties
tunetheweb marked this conversation as resolved.
Show resolved Hide resolved
# HTTP status codes documentation: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

WITH requests AS (
SELECT
_TABLE_SUFFIX AS client,
url,
status,
respBodySize AS body_size
FROM
`httparchive.summary_requests.2021_07_01_*`
),

third_party AS (
SELECT
domain
FROM
`httparchive.almanac.third_parties`
WHERE
date = '2021-07-01' AND
category != 'hosting'
),

base AS (
SELECT
client,
domain,
IF(status BETWEEN 300 AND 399, 1, 0) AS redirected,
tunetheweb marked this conversation as resolved.
Show resolved Hide resolved
body_size
FROM
requests
LEFT JOIN
third_party
ON
NET.HOST(requests.url) = NET.HOST(third_party.domain)
)

SELECT
client,
percentile,
APPROX_QUANTILES(body_size, 1000)[OFFSET(percentile * 10)] AS approx_redirect_body_size
FROM
base,
UNNEST(GENERATE_ARRAY(1, 100)) AS percentile
WHERE
redirected = 1
GROUP BY
client,
percentile
ORDER BY
client,
percentile
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#standardSQL
# Distribution of third party requests size and time by category

WITH requests AS (
SELECT
_TABLE_SUFFIX AS client,
url,
respBodySize AS body_size,
time
FROM
`httparchive.summary_requests.2021_07_01_*`
),

third_party AS (
SELECT
category,
domain
FROM
`httparchive.almanac.third_parties`
WHERE
date = '2021-07-01' AND
category != 'hosting'
),

base AS (
SELECT
client,
category,
body_size,
time
FROM
requests
INNER JOIN
third_party
ON
NET.HOST(requests.url) = NET.HOST(third_party.domain)
)

SELECT
client,
category,
percentile,
APPROX_QUANTILES(body_size, 1000)[OFFSET(percentile * 10)] AS body_size,
APPROX_QUANTILES(time, 1000)[OFFSET(percentile * 10)] AS time
FROM
base,
UNNEST(GENERATE_ARRAY(1, 100)) AS percentile
GROUP BY
client,
category,
percentile
ORDER BY
client,
category,
percentile
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#standardSQL
# Distribution of third parties by number of websites

WITH requests AS (
SELECT
_TABLE_SUFFIX AS client,
pageid AS page,
url
FROM
`httparchive.summary_requests.2021_07_01_*`
),

third_party AS (
SELECT
domain,
canonicalDomain
FROM
`httparchive.almanac.third_parties`
WHERE
date = '2021-07-01' AND
category != 'hosting'
),

base AS (
SELECT
client,
canonicalDomain,
COUNT(DISTINCT page) AS pages_per_third_party
FROM
requests
LEFT JOIN
third_party
ON
NET.HOST(requests.url) = NET.HOST(third_party.domain)
WHERE
canonicalDomain IS NOT NULL
GROUP BY
client,
canonicalDomain
)

SELECT
client,
percentile,
APPROX_QUANTILES(pages_per_third_party, 1000)[OFFSET(percentile * 10)] AS approx_pages_per_third_party
FROM
base,
UNNEST([10, 25, 50, 75, 90]) AS percentile
GROUP BY
client,
percentile
ORDER BY
client,
percentile
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#standardSQL
# Distribution of websites by number of third party

WITH requests AS (
SELECT
_TABLE_SUFFIX AS client,
pageid AS page,
url
FROM
`httparchive.summary_requests.2021_07_01_*`
),

third_party AS (
SELECT
domain
FROM
`httparchive.almanac.third_parties`
WHERE
date = '2021-07-01' AND
category != 'hosting'
),

base AS (
SELECT
client,
page,
COUNT(domain) AS third_parties_per_page
FROM
requests
LEFT JOIN
third_party
ON
NET.HOST(requests.url) = NET.HOST(third_party.domain)
GROUP BY
client,
page
)

SELECT
client,
percentile,
APPROX_QUANTILES(third_parties_per_page, 1000)[OFFSET(percentile * 10)] AS approx_third_parties_per_page
FROM
base,
UNNEST([10, 25, 50, 75, 90]) AS percentile
GROUP BY
client,
percentile
ORDER BY
client,
percentile
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#standardSQL
# Percent of third party requests by content type.

WITH requests AS (
SELECT
_TABLE_SUFFIX AS client,
url,
type AS contentType
FROM
`httparchive.summary_requests.2021_07_01_*`
),

third_party AS (
SELECT
domain
FROM
`httparchive.almanac.third_parties`
WHERE
date = '2021-07-01' AND
category != 'hosting'
)

SELECT
client,
contentType,
COUNT(0) AS requests,
SUM(COUNT(0)) OVER (PARTITION BY client) AS total_requests,
COUNT(0) / SUM(COUNT(0)) OVER (PARTITION BY client) AS pct_requests
FROM
requests
LEFT JOIN
third_party
ON
NET.HOST(requests.url) = NET.HOST(third_party.domain)
WHERE
domain IS NOT NULL
GROUP BY
client,
contentType
ORDER BY
client,
contentType
63 changes: 63 additions & 0 deletions sql/2021/third-parties/percent_of_third_party_cache.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#standardSQL
# Percent of third party requests cached
# Cache-Control documentation: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control#Directives

WITH requests AS (
SELECT
_TABLE_SUFFIX AS client,
resp_cache_control,
status,
respOtherHeaders,
reqOtherHeaders,
type,
url
FROM
`httparchive.summary_requests.2021_07_01_*`
),

third_party AS (
SELECT
domain
FROM
`httparchive.almanac.third_parties`
WHERE
date = '2021-07-01' AND
category != 'hosting'
),

base AS (
SELECT
client,
type,
IF(
(
status IN (301, 302, 307, 308, 410) AND
NOT REGEXP_CONTAINS(resp_cache_control, r'(?i)private|no-store') AND
NOT REGEXP_CONTAINS(reqOtherHeaders, r'Authorization')
) OR
(
status IN (301, 302, 307, 308, 410) OR
REGEXP_CONTAINS(resp_cache_control, r'public|max-age|s-maxage') OR
REGEXP_CONTAINS(respOtherHeaders, r'Expires')
), 1, 0) AS cached
FROM
requests
LEFT JOIN
third_party
ON
NET.HOST(requests.url) = NET.HOST(third_party.domain)
WHERE
domain IS NOT NULL
)

SELECT
client,
type,
SUM(cached) AS cached_requests,
COUNT(0) AS total_requests,
tunetheweb marked this conversation as resolved.
Show resolved Hide resolved
SUM(cached) / COUNT(0) AS pct_cached_requests
FROM
base
GROUP BY
client,
type
Loading