Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/fix url and add better ordering with numbers #3

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

GatorQue
Copy link

@GatorQue GatorQue commented Nov 9, 2019

Thank you for creating the General Conference Downloader tool. I fixed a few issues and added a new feature. Please accept this pull request or comment on what you would like me to change.

@GatorQue
Copy link
Author

While testing the full range I ran into a problem trying to download MP3 files from 2016. I'm going to investigate and try to find a fix for this.

@GatorQue
Copy link
Author

OK it should be fixed now, waiting to see how it does on older talks.

This was referenced Nov 14, 2019
@jdshaeffer
Copy link

any update on this?

@GatorQue
Copy link
Author

I haven't heard anything from the original author but I have been told that if you use this branch it works great.

@clarkshaeffer
Copy link

Hi, I'm experiencing a problem with your branch:

Problem with http request (https://www.churchofjesuschrist.org/languages: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)>The given language (eng) is not available. Please choose one of the following:

I run on Python 3.8.0 on MacOS 10.14.6.
Anything helps! I need more of President Nelson in my life!!

@GatorQue
Copy link
Author

Thank you for reaching out, I did a quick Google search on your error and came up with the following:
https://stackoverflow.com/questions/22027418/openssl-python-requests-error-certificate-verify-failed
Which suggests that you type this in:
pip install certifi
You might also search for the Install Certificates program described here:
https://stackoverflow.com/questions/52805115/certificate-verify-failed-unable-to-get-local-issuer-certificate
/Applications/Python\ 3.8/Install\ Certificates.command

The issue is that some of the "root" certificates on your computer are missing and unable to validate the SSL connection to the church's website. Installing these "root" certificates will enable you to run the program.

@clarkshaeffer
Copy link

Worked like a charm! Thanks!

@rafaelmx
Copy link

Hello,
I'm a bit embarrassed that I have to ask this but... How do I run this script?
I'm totally new to this but I'm super excited to get this working.
So far I've done the following:

  1. Installed Python 3.8 on Windows 10
  2. Downloaded the original files and extracted them in a folder I created (C:\Users\rsanc\Documents\Rafa\LDS-GC)
  3. Modified the three files GatorQue edited (I didn't know how to download them with the changes, so I made the changes one by one).
    And this is what I think I'm not doing correctly:
  4. I opened a terminal and navigated to the folder with the extracted files of the script.
  5. I typed on the terminal window python pip install -r requirements.txt and didn't get any message (previously I didn't use "python" at the beginning but got an error message, with no message I assumed I was doing it right)
  6. I typed on the terminal window python gen_conf_downloader.py. Nothing. No message. I also tried python gen_conf_downloader.py -s 2018 -d C:\GC with no success.

Steps 4 through 6 were made both before and after modifying the files with the same results. Any help?

@GatorQue
Copy link
Author

Hello!
Welcome! I'm not 100% sure, but I suspect that your terminal window doesn't know how to find Python. Do you remember if you checked the "Add Python 3.8 to PATH" checkbox on the first screen? If not, can you try uninstalling Python 3.8 and reinstalling it again and make sure this checkbox is checked?
I think once you do this the command "pip install -r requirements.txt" should work as expected (you will see a bunch of things downloaded and installed probably) and "python gen_conf_downloader.py -s 2018" should work.

@rafaelmx
Copy link

rafaelmx commented Jan 31, 2020

Hello!
Welcome! I'm not 100% sure, but I suspect that your terminal window doesn't know how to find Python. Do you remember if you checked the "Add Python 3.8 to PATH" checkbox on the first screen? If not, can you try uninstalling Python 3.8 and reinstalling it again and make sure this checkbox is checked?
I think once you do this the command "pip install -r requirements.txt" should work as expected (you will see a bunch of things downloaded and installed probably) and "python gen_conf_downloader.py -s 2018" should work.

Wow... it worked! Just as you imagined, I didn't check the "Add Python 3.8 to PATH" option, so I uninstalled it and installed it again. It worked perfectly. Thank you very much for your help. I'm impressed with the result.

One question, what will happen with the current files the next time I download new audios? For instance, I downloaded just from 2018 and 2019, what if I want to download from 2016? Will this script skip those already downloaded?

@GatorQue
Copy link
Author

GatorQue commented Feb 1, 2020

During the download the python script usually makes a cache of all the HTML pages it downloads. This enables it to avoid re-downloading those files again. As long as you don't remove the cache directory then I think it should work as you expect. I believe it does recreate the "play list" files though since those are usually affected. There are play list files created by topic, speaker, and session if I recall correctly.

@Jacobobber1087
Copy link

There have been a few changes to the church website, is there any chance this gets an update?

@GatorQue
Copy link
Author

GatorQue commented Mar 1, 2024

@Jacobobber1087, I have been keeping this tool updated under my Github fork of this project. Have you given that a try?
https://github.com/GatorQue/LDSGeneralConferenceDownloader/releases
I use it for myself after every conference. If my version isn't working, I will be happy to look into it.

@Jacobobber1087
Copy link

@GatorQue Oh ok, thank you! It seems to work, but the destination folder is empty after it completes, do you know what could cause this?

@GatorQue
Copy link
Author

GatorQue commented Mar 2, 2024

@Jacobobber1087, I see the same results. Let me look into what is causing this and post a new version. Something must have changed in the format of the HTML to prevent the program from working right.

@GatorQue
Copy link
Author

GatorQue commented Mar 2, 2024

@Jacobobber1087 - It seems that the church has hidden the MP3 download link behind the "Options" side panel which only seems to load when you click on the "Options" button (3 dots) and then click on the Download arrow. There is Javascript code which loads the Options side panel and the Download arrow loads the link somehow. I haven't found a good way to do that with my current way of doing things. I will need to see if I can find a Python based web browser that is capable of performing the Javascript commands needed to trigger the MP3 media link to appear in order to fix this. I will keep looking into this but it isn't going to be an easy fix like I was hoping.

@Jacobobber1087
Copy link

@GatorQue Yeah, I was very curious how you were getting around the Javascript in previous versions of this haha... I ended up writing an automation in Microsoft Power Automate Desktop that uses Firefox to iterate through the sites and manually click to the download link. It technically worked but it took forever and was super clunky. Is there any way to interact with Javascript through a script that you know of?

@GatorQue
Copy link
Author

GatorQue commented Mar 3, 2024

@Jacobobber1087,
Great question. In the recent past the MP3 URL could be found in the giant BASE64 content in the initial HTML download. This has changed at least sometime after October 2023.
From my research today, I have found that if it is possible to execute the following javascript lines after the page loads it should provide a DOM that includes an element (the last one mentioned) whose href value is what we want for the MP3 file:
document.querySelector('[title="Options"]').click()
document.querySelector('button[data-testid="download-menu-button"]').click()
document.querySelector("a[data-testid="download-link-0"]").href

As far as tools are concerned, I have initially looked at splash, a docker image with Qt5 WebKit and a HTTP API for performing queries (usually paired with a Python scrapy-splash package). I have also discovered requests-html which uses a headless chromium install downloaded using the pyppeteer python package (but since that package has been abandoned the download fails). There is also a Python package Selenium that also uses a headless chromium to perform web scrapes which I haven't done anything with yet. I think if we can combine the above javascript lines somehow with a headless install of chromium, we might be able to retrieve the information we need. Another approach would be to identify WHAT/HOW the javascript downloads and modifies the DOM to create the "This Page (MP3)" download reference element when we click on the Options and Download arrows. Yet another approach might be to "predict" the media URL by guessing the filename that would be used from the information in the initial HTML but I suspect that might not be as stable (but certainly faster) approach.
Thoughts?

@Jacobobber1087
Copy link

@GatorQue Ok cool. I hadn't heard of a headless browser before, that seems like a really good solution. Would the browser need to be in the foreground? I assume not if you're sending requests through JS?
Predicting the URL would be tricky, they use titles for some General Authorities (but not all) and you would have to know the mp3 bitrate. If this information is in the HTML that could work really well.
How did you get the list of the links to each conference? I had to do that manually because of how the church groups the conferences on their website.
I wonder if there is any way to access /assets/general-conference/ on the media2.ldscdn.org site? It doesn't allow a direct visit, maybe wget?

@GatorQue
Copy link
Author

GatorQue commented Mar 5, 2024

@Jacobobber1087,
A headless browser means it doesn't provide a GUI/Window. This means requests must be sent some other way, usually through some REST api or other technique. For Splash it uses a custom REST api which allows for injecting some additional JavaScript commands to be processed after the page loads (which I haven't gotten to work fully yet).
As far as getting the list of conferences, I perform a HTTP GET request for /study/general-conference and parse the HTML using several regular expressions to extract each conference, sessions, and talks into Python tuple objects. Feel free to look at the gen_conf_downloader.py file in my repository for more details.
I will need to re-review the talk and conference HTML to see if enough information could be extracted to predict the media2 URL to use to get the MP3 file. As far as mp3 bitrate, we could just have it try a few different bitrates until it finds one that works.
Unfortunately, there is no way to "browse" for a list of all files on the media2.ldscdn.org site that I have found. Perhaps there is a hidden index file that would give the complete list but I haven't seen evidence of this yet. The wget program wouldn't likely yield any different results against the media2 website. I did a wget against the talk and it practically started downloading all conference years and talks since they are all interlinked together so I gave up since I want people to be able to limit the conferences they wish to download. I didn't let it run long enough to see if the mp3 files could be discovered but I suspect it wouldn't because of the Javascript menu factor.

@GatorQue
Copy link
Author

GatorQue commented Mar 8, 2024

@Jacobobber1087,
I am happy to report that using Selenium was successful in obtaining the media URL. The results are cached to a file, which is enabled by default now, such that future downloads will be faster. I am doing some more testing but should have an updated release posted soon.

@Jacobobber1087
Copy link

@GatorQue Sorry for the late reply. I am currently serving as a missionary for the church so I do not have reliable access to a computer. I will look forward to the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants