Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log reason for unavailability for unavailable TikTok posts #467

Open
stijn-uva opened this issue Nov 19, 2024 · 2 comments
Open

Log reason for unavailability for unavailable TikTok posts #467

stijn-uva opened this issue Nov 19, 2024 · 2 comments

Comments

@stijn-uva
Copy link
Member

The TikTok URLs data source now lists skipped posts as 'not found, may have been removed, skipping' in the logs - but the reason it is not found (404, private, etc) can be interesting as well. We have a dataset with such unavailable posts, which we could use to detect the reason and log that data more accurately.

(thought: is the log the best place to store that information? probably not, but no other obvious place to put it except the dataset itself, but that should include only the existing posts)

@dale-wahl
Copy link
Member

I think there is an argument for including them in the dataset itself, after all, the URLs to those datasets were specifically provided by the user in order to get a response (in a crawl dataset--e.g. Telegram--it may be more ambiguous). We could add a column (status/error/something like that) that describes whether it was collected or what we know about why it was not. That could then easily be used as a filter if you are only interested in available videos for example.

The knock on effects may need to be handled as some processors are not prepped. I do not think anything would fail, but perhaps mislead if you are say counting likes or something and it is not clear that certain posts do not have 0 likes and instead their likes are just unavailable. That could be handled by using MissingMappedField.

@stijn-uva
Copy link
Member Author

Hm, yes, and it could be an option in the data source - include unavailable items yes/no. If no, only log them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants