Skip to content

Commit

Permalink
replace outdated BQ lang table with GH Archive PR lang extract
Browse files Browse the repository at this point in the history
The table [bigquery-public-data:github_repos.languages] was last updated in Nov 2022. This is a significant issue since, without any further updates, we can only count events that are happening for these outdated lists of repositories. Hence, we need a new method to obtain a large enough sample of repository primary language metadata. Fortunately, we can directly extract the language from PullRequest events, because they provide such a language field. So, whenever there is a PullRequest for any of the repos we want to include in our ranking, we are able to determine the language. These amount to many millions. The drawback is that we cannot include repositories that did not have any pull request for the current quarter. I think this is a fair trade-off for now until maybe there is some better solution.
  • Loading branch information
madnight committed Mar 30, 2024
1 parent 8ab2725 commit f8adb52
Showing 1 changed file with 7 additions and 3 deletions.
10 changes: 7 additions & 3 deletions scripts/query.js
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,13 @@ const queryBuilder = (tables) => {
FROM ${tables} WHERE NOT LOWER(actor.login) LIKE "%bot%") a
JOIN ( SELECT repo_name as name, lang FROM ( SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY repo_name ORDER BY lang) as num FROM (
SELECT repo_name, FIRST_VALUE(language.name) OVER (
partition by repo_name order by language.bytes DESC) AS lang
FROM [bigquery-public-data:github_repos.languages]))
SELECT
JSON_EXTRACT_SCALAR(payload, "$.pull_request.base.repo.language") as lang,
repo.name as repo_name
FROM ${tables}
WHERE
JSON_EXTRACT_SCALAR(payload, "$.pull_request.base.repo.language") IS NOT NULL
))
WHERE num = 1 order by repo_name)
WHERE lang != 'null') b ON a.name = b.name)
GROUP by type, language, year, quarter, actor.login
Expand Down

0 comments on commit f8adb52

Please sign in to comment.