Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSS 2020 queries #1281

Merged
merged 17 commits into from
Sep 30, 2020
Merged

CSS 2020 queries #1281

merged 17 commits into from
Sep 30, 2020

Conversation

rviscomi
Copy link
Member

@rviscomi rviscomi commented Sep 8, 2020

Progress on #898

Usage

  • How big is it?
  • How is it included?
  • Is it preprocessed?

Custom Properties

  • Names
  • Values
  • Usage (properties, functions)
  • Constants
  • Custom properties in JS

Selectors

  • Specificity
  • Pseudo-classes

Values & Units

  • calc()
  • Length units and unitless 0
  • Global keywords

Color

  • Popularity of different ways to declare colors

Images

  • Gradients

Layout

  • Sizing model
  • Are floats still used?
  • Flexbox vs grid
  • grid-template-areas
  • Named lines

Transitions & Animations

  • Transitions & Animations

Responsive Design

  • Media query features
  • Most common properties used inside media queries

Browser support

  • Vendor prefixes
  • @supports
  • CSS hacks

Internationalization

  • direction property
  • Logical vs physical properties

CSS & JS

  • Houdini
  • CSS-in-JS

Meta

  • Shorthands vs longhands
  • Which properties are most commonly used together?
  • Invalid syntax
  • Sass stylesheets
  • !important

@rviscomi rviscomi added the analysis Querying the dataset label Sep 8, 2020
@rviscomi rviscomi added this to the 2020 Analysis milestone Sep 8, 2020
@rviscomi
Copy link
Member Author

rviscomi commented Sep 8, 2020

The parsed_css table has been generated from the contents of external stylesheets and inline style blocks for all 2020 pages. @LeaVerou has started on some JS-based utilities to help with parsing the Rework JSON and analyzing selectors. When these scripts are stable, we should add them to the HTTP Archive GCS bucket so they can be included in the queries. AFAICT all of the chapter queries are blocked on these utils.

@LeaVerou
Copy link
Member

LeaVerou commented Sep 8, 2020

Thank you for the summary Rick. It would be good to get some further input on these helpers from you or other analysts, since APIs designed in a vacuum tend to not be very good APIs. I've spent some time today documenting them (just JSDoc with default settings since time is of the essence).

Playground: https://projects.verou.me/rework-utils/
Docs: https://projects.verou.me/rework-utils/docs/
And I also wrote this selector parser & specificity calculator for the selector-related queries: https://projects.verou.me/parsel

Please try using these utils to write JS for the issues tagged Needs JS. Commit your code in the js directory with the issue number in the filename and linked via the commit message: https://github.com/LeaVerou/css-almanac/tree/master/js and change the issue labels & project column to "Has JS" accordingly. You can also test these against any CSS or CSS file via the Rework Utils playground above. Use the function testQuery(). You can do so locally as well if you clone both repos.

Currently I'm writing these without imports, since it sounds like the BigQuery JS doesn't really understand ESM and we'll need to bundle them up. Is my assumption correct?

If while using them you realize the utils API needs improvement please open an issue or PR in the corresponding repo (rework-utils or parsel).
Please self-assign or comment on an issue you're working on, to prevent effort duplication.

AFAICT all of the chapter queries are blocked on these utils.

Except the ones that depend on custom metrics.

@rviscomi
Copy link
Member Author

rviscomi commented Sep 8, 2020

Currently I'm writing these without imports, since it sounds like the BigQuery JS doesn't really understand ESM and we'll need to bundle them up. Is my assumption correct?

+1 we should bundle the JS and access the globals via SQL. After implementing a simple metric in LeaVerou/css-almanac#47 my only question is how the SQL that uses the scripts would look. Do you see us doing much work in the BigQuery UDF, basically the body of the compute function copied into SQL, or will the extent of the JS-in-SQL be a one-liner returning the single-purpose functions like return getSizingModel(); which are implemented in the bundle along with the utils?

@LeaVerou
Copy link
Member

LeaVerou commented Sep 9, 2020

@rviscomi Excellent point. I was expecting the former, but the latter works too. My experience with BigQuery query authoring is minimal, so I'll defer to you for what is preferable. If we are to follow the latter approach, we should start giving the functions meaningful names instead of just compute(), then we can just write a utility to concat everything and remove the export and default keywords (since Rollup doesn't quite work like that). In terms of the Rework Utils playground, the name of the function doesn't matter, so long as it's the default export.

@rviscomi
Copy link
Member Author

Yeah, I think having more of the metric logic in the SQL would better separate the concerns of the util and the analysis. So for example:

CREATE TEMPORARY FUNCTION countBorderBoxDeclarations(ast STRING) RETURNS NUMERIC LANGUAGE js AS '''
try {
  return countDeclarationsByProperty(ast.stylesheet.rules, {properties: 'box-sizing', values: 'border-box'});
} catch (e) {
  return null;
}
'''
OPTIONS (library="gs://httparchive/lib/css-almanac-utils.js");

SELECT
  client,
  APPROX_QUANTILES(declarations, 1000)[OFFSET(500)] AS median_declarations
FROM (
  SELECT
    client,
    countBorderBoxDeclarations(css) AS declarations
  FROM
    `httparchive.almanac.parsed_css`
  WHERE
    date = '2020-08-01')
GROUP BY
  client

@LeaVerou
Copy link
Member

Sure, though keep in mind not all are this small. E.g. the one about color formats is at least 89 loc.

@rviscomi
Copy link
Member Author

The size of the code shouldn't be an issue for BigQuery. For maintainability, if there are any functions that can be reused in other queries, we should bake those into the utils file.

@rviscomi
Copy link
Member Author

@LeaVerou could you provide me with a JS file containing all of the available JS utils we would need in the queries? We can continue iterating on it but it'd be good to test out what we have so far. I'll start with sizing-model.js.

@LeaVerou
Copy link
Member

Sure, do you have any ideas on how to produce it? Rollup doesn't really do that, they'd be under a namespace. I'm thinking probably a custom gulp utility that uses gulp-concat to concat and removes export default and import statements. Can you think of (or PR!) anything simpler?

I'll start with sizing-model.js.

Wait, I thought we decided the computation for each stat would be in the SQL unless it's reusable?

@rviscomi
Copy link
Member Author

I forked the repo and manually extracted each function into dist/rework-utils.js. It's uploaded to GCS at gs://httparchive/lib/rework-utils.js.

One issue is the bleeding edge optional chaining syntax (?.): SyntaxError: Unexpected token . at gs://httparchive/lib/rework-utils.js line 57, columns 28-29. We can either use a transpiler or rewrite the code. It's only used 5 times so I rewrote it to a && a.b syntax by hand. It may be a good idea to do this upstream in src to make it easier to merge incremental changes, or we can make changes directly to sql/lib/rework-utils.js (added in this PR) which I'll keep synced with the copy on GCS.

Here's a proof of concept query:

#standardSQL
# - Distribution of the number of occurrences of box-sizing:border-box per page.
# - Percent of pages with that style.
CREATE TEMPORARY FUNCTION countBorderBoxDeclarations(css STRING) RETURNS NUMERIC LANGUAGE js AS '''
try {
  const ast = JSON.parse(css);
  return countDeclarations(ast.stylesheet.rules, {properties: /^(-(o|moz|webkit|ms)-)?box-sizing$/, values: 'border-box'});
} catch (e) {
  return null;
}
'''
OPTIONS (library="gs://httparchive/lib/rework-utils.js");

SELECT
  percentile,
  client,
  COUNT(DISTINCT IF(declarations > 0, page, NULL)) AS pages,
  COUNT(DISTINCT page) AS total,
  COUNT(DISTINCT IF(declarations > 0, page, NULL)) / COUNT(DISTINCT page) AS pct_pages,
  APPROX_QUANTILES(declarations, 1000 IGNORE NULLS)[OFFSET(percentile * 10)] AS declarations_per_page
FROM (
  SELECT
    client,
    page,
    SUM(countBorderBoxDeclarations(css)) AS declarations
  FROM
    `httparchive.almanac.parsed_css`
  WHERE
    date = '2020-08-01'
  GROUP BY
    client,
    page),
  UNNEST([10, 25, 50, 75, 90]) AS percentile
GROUP BY
  percentile,
  client
ORDER BY
  percentile,
  client
percentile client pages total pct_pages declarations_per_page
10 desktop 4,566,301 5,449,988 83.79% 0
10 mobile 5,353,932 6,197,401 86.39% 0
25 desktop 4,566,301 5,449,988 83.79% 3
25 mobile 5,353,932 6,197,401 86.39% 4
50 desktop 4,566,301 5,449,988 83.79% 14
50 mobile 5,353,932 6,197,401 86.39% 17
75 desktop 4,566,301 5,449,988 83.79% 35
75 mobile 5,353,932 6,197,401 86.39% 46
90 desktop 4,566,301 5,449,988 83.79% 85
90 mobile 5,353,932 6,197,401 86.39% 96

Each one of these queries processes 9.7 TB, which incurs ~$50 😬 . So if you wanted to write the queries it's best to use the smaller parsed_css_1k table instead, and I can run the "big queries" to export the full results to the spreadsheet.

One process suggestion. Rather than write the metric JS in css-almanac/js/* then copy into SQL, we can use this PR to review JS-in-SQL changes instead. That will also ensure that there's one source of truth, for example if any changes need to be made we won't have to copy it across repos. Does that work for you? There may be testing advantages to individual JS files that outweigh this. Curious to get your thoughts.

@LeaVerou
Copy link
Member

I forked the repo and manually extracted each function into dist/rework-utils.js. It's uploaded to GCS at gs://httparchive/lib/rework-utils.js.

Thank you! That's great for a proof of concept, but is already out of date, so we need to write a script to do it. :)

One issue is the bleeding edge optional chaining syntax (?.): SyntaxError: Unexpected token . at gs://httparchive/lib/rework-utils.js line 57, columns 28-29. We can either use a transpiler or rewrite the code. It's only used 5 times so I rewrote it to a && a.b syntax by hand. It may be a good idea to do this upstream in src to make it easier to merge incremental changes, or we can make changes directly to sql/lib/rework-utils.js (added in this PR) which I'll keep synced with the copy on GCS.

Optional chaining is not exactly bleeding edge, it's been supported since February. I assumed BQ's JS engine was similar to a recent Chromium. If not, do we know which version of Chrome/V8 it's running? I can just stop using optional chaining (which I've used a lot in queries too, it's not just the utils), but it would be good to know what else might not be supported.

One process suggestion. Rather than write the metric JS in css-almanac/js/* then copy into SQL, we can use this PR to review JS-in-SQL changes instead. That will also ensure that there's one source of truth, for example if any changes need to be made we won't have to copy it across repos. Does that work for you? There may be testing advantages to individual JS files that outweigh this. Curious to get your thoughts.

Indeed there are. With the current setup, we can test out queries on any CSS (via URL or direct input) via the Rework Utils playground (testQuery(filenameWithoutJS)), and iterate just by hitting and then Enter, since each run loads the module anew. Also, I think having essentially one thread to review 40+ queries can become quite messy. But I definitely see your point re: one source of truth. I wonder if this means it would be better to use the approach with named one liner functions instead of having the JS in the query?

@rviscomi
Copy link
Member Author

That's great for a proof of concept, but is already out of date, so we need to write a script to do it. :)

Could you submit a PR with any changes against sql/lib/rework-utils.js? I'll sync that file with GCS for testing on BigQuery.

I can just stop using optional chaining (which I've used a lot in queries too, it's not just the utils), but it would be good to know what else might not be supported.

Nothing specific on this in the BigQuery docs AFAICT.

I wonder if this means it would be better to use the approach with named one liner functions instead of having the JS in the query?

Maybe. This PR will contain 40+ queries regardless of where the JS logic lives, so I think it'd be easier to review if everything is in one place. That also helps Almanac readers down the line if they want to look at an SQL file to scrutinize how a metric was calculated.

Iterative testing is also possible in BigQuery, even if a bit forced:

#standardSQL
CREATE TEMP FUNCTION parseCSS(stylesheet STRING)
RETURNS STRING LANGUAGE js AS '''
  try {
   var css = parse(stylesheet)
   return JSON.stringify(css);
  } catch (e) {
    '';
  }
'''
OPTIONS (library="gs://httparchive/lib/parse-css.js");

CREATE TEMPORARY FUNCTION countBorderBoxDeclarations(css STRING) RETURNS NUMERIC LANGUAGE js AS '''
try {
  const ast = JSON.parse(css);
  return countDeclarations(ast.stylesheet.rules, {properties: /^(-(o|moz|webkit|ms)-)?box-sizing$/, values: 'border-box'});
} catch (e) {
  return null;
}
'''
OPTIONS (library="gs://httparchive/lib/rework-utils.js");

SELECT
  countBorderBoxDeclarations(parseCSS('''
#foo {
  color: red;
  box-sizing: border-box;
}

.bar:first-child {
  color: blue;
  box-sizing: border-box;
}
''')) AS declarations

Results:

declarations
2

@rviscomi
Copy link
Member Author

rviscomi commented Sep 15, 2020

@LeaVerou interesting discrepancy in custom property adoption between the 2019 approach and your css-variables custom metric.

https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/2019/02_CSS/02_01.sql measures the % of websites with custom properties. I've rerun the query with 2020 data and it's producing ~15% for desktop and ~20% for mobile, which is really interesting growth in itself, up from 5% in 2019.

However I've also been analyzing custom property usage with your css-variables data and it's producing adoption numbers more like ~27% for both desktop and mobile.

Queries: https://gist.github.com/rviscomi/71328c6b395f377e7d7f6c7be5ab6da7

Which approach would you want to use for the 2020 chapter: the one with a comparable methodology as last year, or the one you specially built to study custom properties?

Aside: the most popular custom property name is --wp-color--primary, used by ~12% of pages, so my guess is WordPress was a significant driver of adoption this year.

@LeaVerou
Copy link
Member

Hi @rviscomi,

Could you submit a PR with any changes against sql/lib/rework-utils.js? I'll sync that file with GCS for testing on BigQuery.

Given that I'm still actively iterating on them, I don't think doing this manually is a good use of our time. One of us should write a script to do it. I could, but naturally, this means I'd have less time for actual querying code. Your call.

Nothing specific on this in the BigQuery docs AFAICT.

It's equivalent to Chrome 75 it turns out. Ok, I can work with that!

Maybe. This PR will contain 40+ queries regardless of where the JS logic lives, so I think it'd be easier to review if everything is in one place. That also helps Almanac readers down the line if they want to look at an SQL file to scrutinize how a metric was calculated.
Iterative testing is also possible in BigQuery, even if a bit forced:

You asked me in your previous message if there's a testing benefit to having the js separately in the css-almanac repo, and I explained in detail how this helps iterating on them. So, I'm very surprised you would then go ahead to suggest having everything in one big PR anyway, almost as if I hadn't responded to your question at all. Yes, we can sort of iterate with BigQuery, but it's much more clunky, and since there's better testing infrastructure in place, I'm not sure why we would do it that way.

I can see how large PRs with all queries may work for other, smaller chapters, but I believe it will end up being a mess here. Furthermore, having the js separate in the css-almanac repo means that:

  • We can test and iterate more easily
  • We can cross-link to the corresponding issues more easily
  • We can easily build a playground where Almanac readers can run the queries on their own CSS, which I think can be quite interesting

I can see how you'd like a single source of truth in the almanac repo, but since the queries are written after the JS is finalized, it's unlikely that the SQL will get out of sync with the JS. But there are ways around this if DRYness is a concern: from build tool includes to JS functions that we just call in the queries. Also, you have full commit permissions in the css-almanac repo, so you don't actually need to send PRs for stats you don't need reviewed, you can just commit directly, as Dmitry already did.

Lastly, the expected time commitment for analysts is 12 hours total. I've easily donated 2-3x that already, and there's still plenty of work to do. I'm not complaining, as I'm really into this project and I'm enjoying this work, but It would be good if I could contribute without jumping through too many hoops.

@LeaVerou
Copy link
Member

@LeaVerou interesting discrepancy in custom property adoption between the 2019 approach and your css-variables custom metric.

https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/2019/02_CSS/02_01.sql measures the % of websites with custom properties. I've rerun the query with 2020 data and it's producing ~15% for desktop and ~20% for mobile, which is really interesting growth in itself, up from 5% in 2019.

However I've also been analyzing custom property usage with your css-variables data and it's producing adoption numbers more like ~27% for both desktop and mobile.

That's really interesting!!
I could explain the discrepancy more easily if it was the other way around: if my custom metric reported lower adoption, since it went off document.styleSheets, which is blocked on cross-origin stylesheets.
My theory now is that it's because my custom metric also takes var references into account, so if a stylesheet is using var(--foo) but does not declare a --foo, it will still be counted, whereas with the old query it wouldn't be. But it still seems strange if ~7-12% of websites only use CSS variables by reference. I hope it's not indicative of a bug!

Queries: gist.github.com/rviscomi/71328c6b395f377e7d7f6c7be5ab6da7

Which approach would you want to use for the 2020 chapter: the one with a comparable methodology as last year, or the one you specially built to study custom properties?

We definitely need to use the custom metric, since there's a lot more to report and not all can be determined via the CSS AST, but we should clarify in the text that the methodology is different and how, so that people don't compare it with last year's 5% (and we should also report how that number increased, so they have something to compare).

Aside: the most popular custom property name is --wp-color--primary, used by ~12% of pages, so my guess is WordPress was a significant driver of adoption this year.

Fascinating. What were the others?

@LeaVerou
Copy link
Member

@rviscomi A few questions:

@rviscomi
Copy link
Member Author

Given that I'm still actively iterating on them, I don't think doing this manually is a good use of our time. One of us should write a script to do it. I could, but naturally, this means I'd have less time for actual querying code. Your call.

You hit the nail on the head that we found ourselves in a position where we need to sacrifice time we could be spending on the analysis itself to bridge the infrastructure gaps we've created. Code for this chapter's analysis is now spread across three different repositories: almanac.httparchive.org for SQL, css-almanac for UDF scripts, and rework-utils for UDF script helpers. I think we need to reevaluate whether this complexity is helping or hurting our productivity. There are real benefits to it, like the unit testing advantages you outlined, and I'm sorry for not acknowledging that earlier. I think those benefits may be outweighed by the liabilities, and shifting to a scrappier workflow may be the tradeoff we need to get this over the finish line on time. So what might a scrappy workflow look like?

  • I can open this PR up for review so we can get the initial round of metrics checked in. We can divide the queries into smaller PRs that are easier/faster to review. And we can continue to use the issue tracking system in css-almanac to manage the remaining tasks.
  • Changes to the Rework utils can be done directly in rework-utils.js. The file wouldn't rely on modules or other features that aren't supported by BigQuery so that we wouldn't need to maintain dev and prod versions.
  • The Rework Utils web tool can continue to be used for prototyping UDF JS as needed, loading the rework-utils.js file from this repo as the source of truth. Local development on the utils themselves could be done in that tool's console and/or BigQuery.
  • Metrics' SQL and UDF JS can be developed concurrently by the same analyst in this repo. Functional tests can happen quickly by querying the parsed_css_1k sample data.

We do lose unit testing among other things, but I think the agility we gain makes up for that loss. Does that resonate with you? If not, do you feel like the current technical benefits of having query code split up this way are worth the process overhead and possible delays? From the project management perspective, this is one of the largest chapters and it has already started to slip past its analysis milestones, so the delays are becoming a greater threat to getting this chapter released on time. If that happens, the contingency plan would be to release it post-launch, which I'm afraid would cause it to lose out on a lot of early readership.


We definitely need to use the custom metric, since there's a lot more to report and not all can be determined via the CSS AST, but we should clarify in the text that the methodology is different and how, so that people don't compare it with last year's 5% (and we should also report how that number increased, so they have something to compare).

Fascinating. What were the others?

+1 SGTM. All of the results from queries I've written so far are available in the chapter sheet.

Since querying parsed_css is expensive, is it better to consolidate queries? E.g. I've realized the query for LeaVerou/css-almanac#31 can easily also calculate LeaVerou/css-almanac#6. Is that better, or is it better to have a 1-1 of queries to metrics?

My ideal is to have 1-1 query-to-metric, optimizing for the readers' experience of wanting to scrutinize how the stats were calculated or remix the query themselves. That said, it's ok to have a query with multiple similar/related metrics. If this approach is too expensive for your BigQuery quota, you can develop the queries using parsed_css_1k and I'd be happy to run it against the full dataset.

In calculating LeaVerou/css-almanac#21, I wondered if it would be better to calculate the average, median etc of durations in the JS and return that, or to return an object/array of numbers and calculate the stats in SQL. Similar questions have come up in other queries too. What do you think?

The UDF JS operates on one CSS payload at a time, so one pattern we could use here would be to extract an array of durations in the JS, then UNNEST each duration in SQL, and aggregate the distribution of all durations across all stylesheets/pages. The summary stats we've typically used for distributions are the 10, 25, 50, 75, and 90th percentiles, as opposed to average/mode. Do those work for you?

@LeaVerou
Copy link
Member

Hi @rviscomi,

Thank you for the thoughtful response. In general, I agree with your points.

However, I'm not sure the reason for any deadline slippage is infrastructure. Out of its three analysts:

  • One dropped out
  • Another had their laptop out of order for TWO (!) weeks, and only got limited time on their spouse's laptop when said spouse was not working.
  • And the third analyst is also an analyst in a number of other chapters and project manages the whole project, so has very limited time to contribute to this one.

Furthermore, the analysis is not split across three repos, only two: this one and css-almanac. The fact that my work in the Almanac inspired me to write two libraries (Parsel and Rework Utils) isn't a fragmentation of the analysis effort any more than Rework being in its own repo is a fragmentation of effort.

The remaining two repos mainly reflect the lack of consensus about where the analysis should happen, not an insufficiently scrappy workflow: I have so far exclusively worked in css-almanac, and you have almost exclusively worked in this one.

So what's the best way to go forwards? I propose adopting your proposal, with a few slight tweaks:

  • Responding to your point, I just wrote a small Node script to build rework-utils.js in the rework-utils repo. So, all we need to do to update it in BigQuery is literally cp, which hardly slows us down. If you're sure BigQuery does not support modules (I assumed it did since Chrome 75 does), I can do something similar with Parsel as well.
  • We can iterate on JS in the css-almanac repo, and once the JS is ready, we can iterate on the SQL here, which gives us the best of both worlds. This can happen concurrently, as the JS is pending for many metrics, whereas others are ready for SQL. You can see the different states of each one here. I don't think this is any less agile, since we can't write JS without running it to see if it works!

If that happens, the contingency plan would be to release it post-launch, which I'm afraid would cause it to lose out on a lot of early readership.

Yes, we should definitely avoid that at any cost. What is the next hard deadline? I see the roadmap lists September as analyzing data, so we're still within that and I'm confident we can finish the analysis and results review by the end of the month. I do see we're past the Sep 7 and Sep 14 sub-deadlines, but since I'm also the main author, and I review stats as we go, that's a much tighter, and therefore faster, feedback cycle. I can reach out to the other authors as we go to review stats for their own sections.


My ideal is to have 1-1 query-to-metric, optimizing for the readers' experience of wanting to scrutinize how the stats were calculated or remix the query themselves. That said, it's ok to have a query with multiple similar/related metrics.

Thanks, I will keep that in mind. I think the specific example I asked about definitely falls in the category of "similar/related metrics", so I guess we're good.

The UDF JS operates on one CSS payload at a time, so one pattern we could use here would be to extract an array of durations in the JS, then UNNEST each duration in SQL, and aggregate the distribution of all durations across all stylesheets/pages.

If I'm reading this right, wouldn't a CSS file with e.g. 100 durations be weighed 100 times higher than another with only one? Is this desirable? (not a rhetorical question, I could argue this both ways :) )

The summary stats we've typically used for distributions are the 10, 25, 50, 75, and 90th percentiles, as opposed to average/mode. Do those work for you?

Sure, that's better!

@rviscomi
Copy link
Member Author

I'm happy to try your suggested compromise and adapt if needed.

The only hard deadline is the launch date in mid-November. Reviewing during analysis sounds like a good way to keep things moving.

If I'm reading this right, wouldn't a CSS file with e.g. 100 durations be weighed 100 times higher than another with only one? Is this desirable?

We could deduplicate durations at the page level before aggregating if that would give you the data you're looking for.

@LeaVerou
Copy link
Member

We could deduplicate durations at the page level before aggregating if that would give you the data you're looking for.

I'm just not sure if a website should be weighed higher than others in this aggregate based on how they are using CSS animations.

A few more questions about things that have come up:

  • We had originally planned a bunch of font-related metrics, however it looks like there is huge overlap with the Fonts chapter. I'm not sure exactly how much overlap, but definitely >70%. Should we not have a Fonts section perhaps? I looked at the 2019 Almanac and it did have a Fonts section, which stats clearly distinct from the Fonts chapter, but this year it looks like the Fonts chapter is doing a lot of CSS-based stats.
  • In many stats, the output data structure is an object literal whose keys are values we are measuring and the values are frequencies, i.e. how many times it appeared in the CSS. We should come up with a plan for aggregating these over the website corpus to find the most common whatever. Again, it's the same question as for the durations above: Is it a good idea for a given website to be weighed higher than others based on its CSS usage?

@rviscomi
Copy link
Member Author

Should we not have a Fonts section perhaps? I looked at the 2019 Almanac and it did have a Fonts section, which stats clearly distinct from the Fonts chapter, but this year it looks like the Fonts chapter is doing a lot of CSS-based stats.

My advice would be to have the Fonts section with a few of the most relevant/interesting stats, and include a note at the end like "see the Fonts chapter for more data on fonts usage". It would be good to coordinate with the Fonts chapter lead to ensure that the results are all harmonious.

Is it a good idea for a given website to be weighed higher than others based on its CSS usage?

It's appropriate for some metrics, depending on the questing being answered. I think readers have an easier time grokking stats that are presented in terms of the number of pages, rather than the number of values. For example "2% of pages include duration values longer than 1 second" or "among pages that set a duration, the median page sets 7 different values", as opposed to value-aggregated stats like "the most common value is 75ms" or "the median value is 1020ms".

@LeaVerou
Copy link
Member

My advice would be to have the Fonts section with a few of the most relevant/interesting stats, and include a note at the end like "see the Fonts chapter for more data on fonts usage". It would be good to coordinate with the Fonts chapter lead to ensure that the results are all harmonious.

Ok, so these are our fonts-related issues:

I see these options:

  1. Have a Fonts section that primarily discusses things like units, font names, or shorthands, and pointing to the Fonts chapter for everything else
  2. Not have a Fonts section and discuss things like units, font names, or shorthands in other sections, such as e.g. Values & Units
  3. Go ahead and measure whatever we'd normally measure if I hadn't noticed the overlap and yolo it 😅

Actually, looking at the Fonts queries I wonder if the overlap is less than I thought, which makes me more inclined to go with 3. Btw they may find the Rework Utils useful since they're doing similar things.

Thoughts?

@rviscomi
Copy link
Member Author

I'd go with option 1, in which the focus is more on the CSS than the fonts themselves.

The utils may be useful, so it's worth offering, but my hunch is that it'd be easier for them to reuse 2019 SQL.

@rviscomi rviscomi marked this pull request as ready for review September 20, 2020 21:59
@rviscomi rviscomi requested a review from a team September 20, 2020 21:59
@LeaVerou
Copy link
Member

I'd go with option 1, in which the focus is more on the CSS than the fonts themselves.

@rviscomi in which of the issues I linked did you feel the focus was on the fonts and not on the CSS? It seems to me that all of them are about the CSS, yet some overlap anyway.

The utils may be useful, so it's worth offering, but my hunch is that it'd be easier for them to reuse 2019 SQL.

Ah ok then. I didn't realize they were using last year's SQL.

@rviscomi
Copy link
Member Author

LeaVerou/css-almanac#2 and LeaVerou/css-almanac#15 feel more Font-y than CSS-y to me for some of the metrics. For example "How many websites use variable fonts?" and popular font families/stacks.

Copy link
Contributor

@Tiggerito Tiggerito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All a bit beyond me, I learnt a few things.

Nit picking: the reported lack of newlines at the end of the queries.

If you find some queries slow I found pre-extracting the custom metric in the SQL worked a lot faster. e.g.

getCssInJS(JSON_EXTRACT_SCALAR(payload, '$._css')) AS cssInJs

@rviscomi
Copy link
Member Author

Thanks for reviewing! This is helpful to make sure we're iterating on the queries and keeping the reviews small. There will be more PRs for this chapter! 😅

Nit picking: the reported lack of newlines at the end of the queries.

This is ok with me, personally.

If you find some queries slow I found pre-extracting the custom metric in the SQL worked a lot faster. e.g.

Thanks! These can definitely get slow so I'll keep that tip in mind.

@rviscomi rviscomi merged commit 7bae531 into main Sep 30, 2020
@rviscomi rviscomi deleted the css-sql-2020 branch September 30, 2020 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants