Replies: 6 comments
-
@HU90m contributed their thoughts already in #185. See below. A simplified view of the current flow could be illustrated as: I think an elegant implementation of this idea would be to remove the collector and move the extractors into the pub struct Request {
/// A valid Uniform Resource Identifier of a given endpoint, which can be
/// checked with lychee
pub uri: Uri,
/// The resource which contained the given URI
pub source: InputSource,
// ...
/// Depth of the link,
/// where a depth of 0 is a link provided in the user input.
pub depth: u32,
} When the client handles a request it will fetch the file (form the file system or network). If successfully fetched, It will then parse the file extracting links and optionally fragments, if the file isn't in it's cache. After which, the file will be added to a cache in the client along with it's links and fragments. The extracted links will be converted to There will be This gives us recursive checks for free, when A simplistic illustration of the above: As it is possible for |
Beta Was this translation helpful? Give feedback.
-
@mre An alternative view here. The scope of ‘dead hyperlink checker’ seems to be naturally limited to interfacing with network hosts (most often, HTTP servers). Checking fragment identifiers etc. is within the application (MVC: view) layer. There are competent testing tools to test web apps etc. on these aspects. Lychee can never attain feature parity with such tools, without greatly increasing application scope to a scale that seems infeasible given the current dev team size. Note that for example, anchors are often generated after rendering a page in current front-end web dev paradigms. Why fight yesterday's battle and support checking fixed anchors? I, therefore, propose to not change the software architecture for this and to not check fragment IDs. |
Beta Was this translation helpful? Give feedback.
-
@sanmai-NL, these are reasonable concerns. It is true that this is an uphill battle, but I do think that there's a middle ground, however. The two main features that people seem to need are
As you mentioned, what we need to stay clear from is to assume too much about the underlying logic of web apps. But what we can support is to make a call to the correct URL to see if it is up. This is harder for Markdown than for anchors in URLs, because in Markdown the anchor that gets generated is not standardized. We already cover that part for GitHub, though, which is a quasi-standard supported by many platforms. We could still make that part configurable, and there really aren't that many rules. For HTTP links, it's not a parsing problem, but merely a limitation of the way we treat links. Fixing the architecture in that area will automatically fix that issue and simultaneously unlock many other features like recursion support and global rate-limiting, so it's a big win either way. In summary, the architecture discussion is mostly about planned features while still keeping support for HTTP anchors in mind when doing the refactoring. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Moved to discussion because it's not strictly an issue with lychee. |
Beta Was this translation helpful? Give feedback.
-
Sorry, it's taken me so long to type up my thoughts. I really like your architecture suggestion and the move to thinking in interfaces. The main thing that I'm not sure about is the placement of the cache(s). I think a unified cache sitting in front of the request relay would make sense. One could then move the extractor between the relay and cache, which would mean the extractor only has to parse the response body once for everything (links and fragments at the moment). I've written some pseudo structures below to help with the explanation. Inputs are converted to
The result also gets passed to the recursion check which checks the struct Request {
uri: Uri,
/// Where the link originated from originated from.
/// This will either be an CLI input or a linked page if using recursion.
source: Source,
// ...
retry_count: u32,
redirect_count: u32,
/// Depth of the link,
/// where a depth of 0 is a link provided in the user input.
depth: u32,
}
struct Response {
request: Request,
status: Status,
}
enum Status {
/// Request was successful
Ok(StatusOk),
/// Failed request
Error(ErrorKind),
/// Request timed out
Timeout(Option<StatusCode>),
/// Got redirected to different resource
Redirected(StatusCode),
// ...
}
struct StatusOk {
code: StatusCode,
links: Option<Vec<Uri>>,
fragments: Option<Vec<Fragments>>,
} I'm not sure about channels vs streams. I'd guess it may make sense to use streams in all but the few places where we want to enforce rate limiting. |
Beta Was this translation helpful? Give feedback.
-
Here's our current architecture:
In #185 we added preliminary support for anchor tags / fragments.
We discussed that supporting fragments in URLs (e.g.
https://foo/bar.html#frag
as in option 3 in the link types) won't be a small change.We'd probably have to make few changes to the architecture, because right now there's no way for the
check
step to "ask" for all links of a given input or even just ask if a link occurs in a given input (e.g. "ishttps://foo/bar.html#frag
valid?").One way to go about it might be to fully decouple input handling from link checking (where inputs can be files or websites).
The fragment cache is a basic version of that, but it's limited to fragments. I think we need a bigger cache for all inputs we encounter and a central entity, which manages this cache. We can think of it as an abstraction on top of the network and the file system, purpose-built for our use-case.
It could lazy-load resources on demand and store the parsed information from inputs, which would be used by the rest of the system; so our parsed representation would be the ground-truth for the rest of the link checking. For each input, it would contain a big map of the URI of the input (i.e. the path or URL) and its parsed links/fragments.
It should be fully async, and we will need read/write access throughout the program's runtime.
Maybe this is even a graph problem, but I don't feel comfortable going down that route.
In any case, we will need a lot of discussions to come up with a solid design.
However we model it, a check to see if an input contains a link or fragment should be trivial from other parts of the program. We should not deal with ad-hoc resource-fetching within the checking code.
I'd be happy for any design feedback.
Beta Was this translation helpful? Give feedback.
All reactions