Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Mangasee scraper #70

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

Add Mangasee scraper #70

wants to merge 20 commits into from

Conversation

matoro
Copy link

@matoro matoro commented Feb 8, 2019

Adds support for Mangasee (mangaseeonline.us)

Edit: originally the goal of this pull was to add an additional scraper. Through normal use testing it has evolved into a patchset supporting additional tweaks, enhancements, additional use cases, and two complete scrapers. Logically distinct changes have been separated into individual pull requests and tweaks to the same change updated via force-push.

@matoro matoro force-pushed the mangasee branch 4 times, most recently from f43028a to 60a5b59 Compare February 8, 2019 20:32
Adds support for Mangasee (mangaseeonline.us)
This basically ignores this exception that requests throws.  My
understanding is that it is raised when you attempt to iter_content()
over the same content twice.  Don't understand how that situation
arises with the current code but it did somehow.
https://stackoverflow.com/questions/45379903/
Assume that if the download_dir is explicitly set, the user
wanted it exactly that way.  If they include bad characters
and it breaks things, that is their fault.
remove debugging imports, add more tests to mangasee scraper,
add support for multi-volume/multi-season titles, fix 404
detection on mangasee scraper, change beautifulsoup element
parsing to find() instead of find_all()
as discovered by this fix, there is a lack of test coverage on
guaranteeing series/chapter URL regex matches which needs to be
addressed, since directly invoking from_url() does not check
matches against <scraper class>.url_re
@matoro matoro force-pushed the mangasee branch 8 times, most recently from 5a2fca2 to 665884a Compare March 7, 2020 20:22
Somehow I was still getting sqlalchemy.exc.IntegrityError
exceptions thrown despite the explicit catch due to violating
the unique url constraint.  This only happened for Mangadex
chapters.  I still don't understand why, but this fixes
it at least.
note that this requires an optional change to be made to
scrapers.  if you want your scraper to support retrying,
you must pass the url of the image to the worker payload
so that it can be re-requested if necessary.  if the worker
fails to download an image and the scraper has not passed
this parameter, an error message will be emitted and the
old behavior will be used, i.e., crash everything.
Mangahere has removed its legacy mobile interface that made
for easy scraping; they are now protected by CloudFlare
Bot Management (the desktop version already was).

In my testing, the heuristic measures implemented here
have managed to reliably bypass the protection.  However,
since their anti-bot measures are now heuristic-based, there
is no guarantee that it will work for every host from
every network location.

Feedback is appreciated.
also added more flexibility as ad scripts are added/removed from the site
I don't know if something changed in BeautifulSoup
to cause this, but this fixes the issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant