-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Mangasee scraper #70
Open
matoro
wants to merge
20
commits into
Hamuko:master
Choose a base branch
from
matoro:mangasee
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
matoro
force-pushed
the
mangasee
branch
4 times, most recently
from
February 8, 2019 20:32
f43028a
to
60a5b59
Compare
Adds support for Mangasee (mangaseeonline.us)
This basically ignores this exception that requests throws. My understanding is that it is raised when you attempt to iter_content() over the same content twice. Don't understand how that situation arises with the current code but it did somehow. https://stackoverflow.com/questions/45379903/
Assume that if the download_dir is explicitly set, the user wanted it exactly that way. If they include bad characters and it breaks things, that is their fault.
remove debugging imports, add more tests to mangasee scraper, add support for multi-volume/multi-season titles, fix 404 detection on mangasee scraper, change beautifulsoup element parsing to find() instead of find_all()
as discovered by this fix, there is a lack of test coverage on guaranteeing series/chapter URL regex matches which needs to be addressed, since directly invoking from_url() does not check matches against <scraper class>.url_re
matoro
force-pushed
the
mangasee
branch
8 times, most recently
from
March 7, 2020 20:22
5a2fca2
to
665884a
Compare
Somehow I was still getting sqlalchemy.exc.IntegrityError exceptions thrown despite the explicit catch due to violating the unique url constraint. This only happened for Mangadex chapters. I still don't understand why, but this fixes it at least.
note that this requires an optional change to be made to scrapers. if you want your scraper to support retrying, you must pass the url of the image to the worker payload so that it can be re-requested if necessary. if the worker fails to download an image and the scraper has not passed this parameter, an error message will be emitted and the old behavior will be used, i.e., crash everything.
Mangahere has removed its legacy mobile interface that made for easy scraping; they are now protected by CloudFlare Bot Management (the desktop version already was). In my testing, the heuristic measures implemented here have managed to reliably bypass the protection. However, since their anti-bot measures are now heuristic-based, there is no guarantee that it will work for every host from every network location. Feedback is appreciated.
also added more flexibility as ad scripts are added/removed from the site
I don't know if something changed in BeautifulSoup to cause this, but this fixes the issue
also, update scraper
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds support for Mangasee (mangaseeonline.us)Edit: originally the goal of this pull was to add an additional scraper. Through normal use testing it has evolved into a patchset supporting additional tweaks, enhancements, additional use cases, and two complete scrapers. Logically distinct changes have been separated into individual pull requests and tweaks to the same change updated via force-push.