Add Mangasee scraper #70

matoro · 2019-02-08T03:06:14Z

~~Adds support for Mangasee (mangaseeonline.us)~~

Edit: originally the goal of this pull was to add an additional scraper. Through normal use testing it has evolved into a patchset supporting additional tweaks, enhancements, additional use cases, and two complete scrapers. Logically distinct changes have been separated into individual pull requests and tweaks to the same change updated via force-push.

Adds support for Mangasee (mangaseeonline.us)

This basically ignores this exception that requests throws. My understanding is that it is raised when you attempt to iter_content() over the same content twice. Don't understand how that situation arises with the current code but it did somehow. https://stackoverflow.com/questions/45379903/

Assume that if the download_dir is explicitly set, the user wanted it exactly that way. If they include bad characters and it breaks things, that is their fault.

remove debugging imports, add more tests to mangasee scraper, add support for multi-volume/multi-season titles, fix 404 detection on mangasee scraper, change beautifulsoup element parsing to find() instead of find_all()

as discovered by this fix, there is a lack of test coverage on guaranteeing series/chapter URL regex matches which needs to be addressed, since directly invoking from_url() does not check matches against <scraper class>.url_re

Somehow I was still getting sqlalchemy.exc.IntegrityError exceptions thrown despite the explicit catch due to violating the unique url constraint. This only happened for Mangadex chapters. I still don't understand why, but this fixes it at least.

note that this requires an optional change to be made to scrapers. if you want your scraper to support retrying, you must pass the url of the image to the worker payload so that it can be re-requested if necessary. if the worker fails to download an image and the scraper has not passed this parameter, an error message will be emitted and the old behavior will be used, i.e., crash everything.

Mangahere has removed its legacy mobile interface that made for easy scraping; they are now protected by CloudFlare Bot Management (the desktop version already was). In my testing, the heuristic measures implemented here have managed to reliably bypass the protection. However, since their anti-bot measures are now heuristic-based, there is no guarantee that it will work for every host from every network location. Feedback is appreciated.

also added more flexibility as ad scripts are added/removed from the site

I don't know if something changed in BeautifulSoup to cause this, but this fixes the issue

also, update scraper

matoro force-pushed the mangasee branch 4 times, most recently from f43028a to 60a5b59 Compare February 8, 2019 20:32

Add Mangasee scraper

f04f80e

Adds support for Mangasee (mangaseeonline.us)

matoro force-pushed the mangasee branch from 60a5b59 to f04f80e Compare February 8, 2019 22:42

matoro force-pushed the mangasee branch from a1cba3a to 01b3823 Compare February 10, 2019 14:39

matoro added 2 commits February 10, 2019 21:15

mangasee: improve speed with persistent session

9d382ee

scraper: do not sanitize download_dir when explicitly set

9d4325c

Assume that if the download_dir is explicitly set, the user wanted it exactly that way. If they include bad characters and it breaks things, that is their fault.

matoro force-pushed the mangasee branch from 074a5f6 to 9d4325c Compare February 13, 2019 02:08

add mangahere scraper; numerous misc fixes

c708c3a

remove debugging imports, add more tests to mangasee scraper, add support for multi-volume/multi-season titles, fix 404 detection on mangasee scraper, change beautifulsoup element parsing to find() instead of find_all()

matoro force-pushed the mangasee branch from 2e1de8d to c708c3a Compare February 15, 2019 12:30

mangahere: fix chapter scraping for adult content warning

0f28bca

matoro force-pushed the mangasee branch from 181adb4 to 0f28bca Compare February 24, 2019 04:18

matoro added 3 commits February 23, 2019 23:33

mangahere: fix chapter regex pattern, add test

6b49af6

as discovered by this fix, there is a lack of test coverage on guaranteeing series/chapter URL regex matches which needs to be addressed, since directly invoking from_url() does not check matches against <scraper class>.url_re

allow download individual chapters via with get command

c5eba73

mangahere: print error message when bad status code received

61395ee

matoro force-pushed the mangasee branch from 0dd8213 to a6004b3 Compare November 27, 2019 23:55

mangasee: allow multiple retries of failed page fetch

6774150

matoro force-pushed the mangasee branch from a6004b3 to 6774150 Compare November 27, 2019 23:56

Merge remote-tracking branch 'upstream/master' into mangasee

b313be1

matoro force-pushed the mangasee branch 8 times, most recently from 5a2fca2 to 665884a Compare March 7, 2020 20:22

Add Manganelo scraper

79115df

matoro force-pushed the mangasee branch from 665884a to 79115df Compare March 7, 2020 20:23

matoro added 8 commits March 12, 2020 07:46

scraper: add retries for entire chapter

cb7ad09

mangadex: add resiliency to initial image request

fb53267

mangahere: mobile site changed again, move completely to desktop site

0c42254

also added more flexibility as ad scripts are added/removed from the site

mangasee: move from elem.text to elem.contents

2c3fcb8

I don't know if something changed in BeautifulSoup to cause this, but this fixes the issue

mangahere: update to use .contents instead of .text

81bbd5b

also, update scraper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Mangasee scraper #70

Add Mangasee scraper #70

matoro commented Feb 8, 2019 •

edited

Loading

Add Mangasee scraper #70

Are you sure you want to change the base?

Add Mangasee scraper #70

Conversation

matoro commented Feb 8, 2019 • edited Loading

matoro commented Feb 8, 2019 •

edited

Loading