About generating <podcast:guid>: how to deal with URL encoding and punycode? #440

JonOfUs · 2023-02-18T16:38:00Z

JonOfUs
Feb 18, 2023

Hi all,

the section on <podcast:guid>contains the standardization that protocol scheme and trailing slashes should be stripped off before generating the UUID. However, it says nothing about further standardization such as percent-encoding (encoding characters using only ASCII characters) or punycode (e.g. German umlauts in domain names).

Is there a definition somewhere of how URLs containing umlauts or percent-encoding (or domain names with uppercase letters) should be handled before the UUID is generated? Or is this officially undefined and theoretically there could just be multiple semantically equivalent GUIDs for a podcast?

Answered by daveajones

Feb 26, 2023

I’m not sure why anything in the feed url would be re-encoded. Why would the punicode or percent need to be modified before calculating the guid? The feed url needs to be in its native form as the input to the UUIDv5 algo so that we all get the same thing as output. In a sense it’s just the seed value. Since UUIDv5 uses SHA-1, the input doesn’t need to be in any particular form. It will just be treated as byte data in 512 bit chunks.

From RFC 4122: “Convert the name to a canonical sequence of octets”

Forgive me if I’ve misunderstood the issue.

View full answer

saerdnaer · 2023-02-25T08:43:12Z

saerdnaer
Feb 25, 2023

Actually it does not really matter as the podcast:guid is constant and should be part of the feed itself – the described mechanism is only a fallback if the feed does not specify one. Especially when the feed url changes, the podcast:guid stays the same.

2 replies

keunes Feb 26, 2023

I would say it does matter. I don't have the numbers but I bet there are many podcasts put there that don't define a GUID. Especially for decentralised systems like AntennaPod, in the context of sync, common agreement on how to produce GUIDs for this podcasts that don't have one is important: it lowers the chance of misalignment.

keunes Feb 26, 2023

Let me give an example in which agreement on this could help:

AntennaPod (A) is fully local, not connected to any server. It calculates GUIDs based on the URLs it has for podcasts that don't define one in the feed. It does so for Podcast x (P) and encodes characters (Pa).
An installation of Kasts (K) is fully local as well. Like A, K calculates GUIDs for podcasts that don't specify one. It does so for the same podcast P, but it does not encode characters, and thus has a different GUID (Pb).
The user then sets up an Open Podcast API server (S). A, K and S are not connected to external systems, and thus cannot retrieve the GUIDs of the feeds from an authoritative source like Podcast Index.
The user then hooks up both A and K to S. As S receives two different podcast GUIDs (Pa & Pb), it'll have two podcasts in its database and sync them both separately, leading to duplicates both in A & K.

This is not a great example and mitigations could be put in place (e.g. the server could check not only GUIDs but also the RSS URLs to find matches). But if there was agreement on how to deal with encoding when calculating GUIDs, this situation could be avoided.

daveajones · 2023-02-26T13:41:27Z

daveajones
Feb 26, 2023
Maintainer

I’m not sure why anything in the feed url would be re-encoded. Why would the punicode or percent need to be modified before calculating the guid? The feed url needs to be in its native form as the input to the UUIDv5 algo so that we all get the same thing as output. In a sense it’s just the seed value. Since UUIDv5 uses SHA-1, the input doesn’t need to be in any particular form. It will just be treated as byte data in 512 bit chunks.

From RFC 4122: “Convert the name to a canonical sequence of octets”

Forgive me if I’ve misunderstood the issue.

3 replies

JonOfUs Feb 26, 2023
Author

What I meant is that it maybe isn't always known whether the feed url is in its native form.
There might be several reasons why a feed url could be encoded differently, for example if a podcast feed contains url-encoded characters and podcast clients deal with them differently (one resolves encoding, the other not).
And when the guid is generated at different places simultaneously (e.g. different synchronization servers), it may be preferrable if the generated guids were the same.
Especially, since some normalization is already done with the removal of the protocol and the trailing slash.

jamescridland Feb 27, 2023

The canonical feed address would normally be in the RSS feed itself as a link rel="self".

<link rel="self" href="https://podnews.net/rss" xmlns="http://www.w3.org/2005/Atom" />

(I'm not sure if the xmlns is required here).

It's not a bad shout to suggest that this is always present in an RSS feed.

JonOfUs Feb 27, 2023
Author

Thanks! I didn't know about that, that sounds very useful for a normalized uuid generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About generating <podcast:guid>: how to deal with URL encoding and punycode? #440

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

About generating <podcast:guid>: how to deal with URL encoding and punycode? #440

JonOfUs Feb 18, 2023

Replies: 2 comments · 5 replies

saerdnaer Feb 25, 2023

keunes Feb 26, 2023

keunes Feb 26, 2023

daveajones Feb 26, 2023 Maintainer

JonOfUs Feb 26, 2023 Author

jamescridland Feb 27, 2023

JonOfUs Feb 27, 2023 Author

JonOfUs
Feb 18, 2023

Replies: 2 comments 5 replies

saerdnaer
Feb 25, 2023

daveajones
Feb 26, 2023
Maintainer

JonOfUs Feb 26, 2023
Author

JonOfUs Feb 27, 2023
Author