Skip to content

Conversation

@peter-tanner
Copy link

Apologies for putting multiple features in one PR, I was modifying it for my courses while scraping them so they were all added spontaneously.

Happy to split into multiple PRs, though I won't be able to for a while due to IRL stuff.

Features:

  • Subtitles are downloaded when the --subtitles flag is passed and only downloaded if they do not exist (Resolves transcript download? #77)
  • Lectures containing media files are automatically downloaded. The only one I've seen so far has a mediaType of Presentation and is used to include PDF slides with the lecture video.
  • Add --dump-json flag for debugging and for anyone who wants to know what courses were not read etc since the scraping process will mark all videos as read.

Fixes:

@soraxas
Copy link
Owner

soraxas commented Feb 11, 2025

Awesome! Thanks for the great PR with these fixes and functionalities!

I've added some comments on something to fix and some clarification needed on some parts.

@MRDGH2821
Copy link

Is this being merged?

cookie["name"]: cookie["value"] for cookie in self._driver.get_cookies()
}
response = requests.get(
"https://echo360.net.au/user/enrollments", cookies=cookies
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hardcoding the hostname. Should this comes from the user provided link?

self._output_dir, "{0}".format(self._course.nice_name).strip()
)
# replace invalid character for folder
self.regex_replace_invalid.sub("_", self._output_dir)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this still be applied? to avoid invalid chars

# no information about the course.
for v in self.course_data["data"]:
try:
self._course_name = v["lesson"]["video"]["published"]["courseName"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the enrollment methods always works for all institution? Should it be used as a fallback (one way or the other) instead?

for ch in illegal_chars:
path = path.replace(ch, "_")

reserved_names = {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what these means? what's the use of f_?


def strip_illegal_path(path: str) -> str:
illegal_chars = '<>:"/\\|?*' + "".join(chr(c) for c in range(0, 32))
for ch in illegal_chars:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't a regex be suitable for this?

def __init__(self, video_json, driver, hostname, alternative_feeds, subtitles):
self.hostname = hostname
self._driver = driver
self._path_prefix = video_json["path_prefix"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this new dict item is set in line 212 inside an If branch. Does this always exists?

else:
response = requests.get(media_url, cookies=cookies)
if response.status_code == 200:
print("> Downloading media {}...".format(media_filename))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the content has already been fully downloaded after the requests.get(..) in line 330. Either (i) move this print statement to be before the GET request, or (ii) use a HEAD request (and if 200) followed by a GET instead.

head = requests.head(vtt_url, cookies=cookies)
if head.status_code == 200:
print(
f"Original subtitle name: {head.headers['Content-Disposition']}"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Original mean?

return path


PERSISTENT_SESSION_FOLDER = "_browser_persistent_session"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the constant near the top of the file?

@soraxas
Copy link
Owner

soraxas commented Oct 17, 2025

Hey sorry not sure why github needs extra action to submit the review; the comments had stuck in limbo since Feb, thanks for pinging me.

There's some changes needed, eg, some url / links are hard coded and will not work for other people in general. (e.g. Hostname in RP is using echo360.net.au)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

transcript download? lecture name [[UNTITLED]], content with in its folder keeps cleaned and overrided by newer downloaded video

3 participants