Architecture changes for enabling future development #1385

mre · 2023-09-07T15:22:21Z

mre
Sep 7, 2023
Maintainer

Here's our current architecture:

In #185 we added preliminary support for anchor tags / fragments.

We discussed that supporting fragments in URLs (e.g. https://foo/bar.html#frag as in option 3 in the link types) won't be a small change.

We'd probably have to make few changes to the architecture, because right now there's no way for the check step to "ask" for all links of a given input or even just ask if a link occurs in a given input (e.g. "is https://foo/bar.html#frag valid?").

One way to go about it might be to fully decouple input handling from link checking (where inputs can be files or websites).

The fragment cache is a basic version of that, but it's limited to fragments. I think we need a bigger cache for all inputs we encounter and a central entity, which manages this cache. We can think of it as an abstraction on top of the network and the file system, purpose-built for our use-case.

It could lazy-load resources on demand and store the parsed information from inputs, which would be used by the rest of the system; so our parsed representation would be the ground-truth for the rest of the link checking. For each input, it would contain a big map of the URI of the input (i.e. the path or URL) and its parsed links/fragments.
It should be fully async, and we will need read/write access throughout the program's runtime.

Maybe this is even a graph problem, but I don't feel comfortable going down that route.
In any case, we will need a lot of discussions to come up with a solid design.

However we model it, a check to see if an input contains a link or fragment should be trivial from other parts of the program. We should not deal with ad-hoc resource-fetching within the checking code.

I'd be happy for any design feedback.

mre · 2023-09-07T15:25:01Z

mre
Sep 7, 2023
Maintainer Author

@HU90m contributed their thoughts already in #185. See below.

A simplified view of the current flow could be illustrated as:

I think an elegant implementation of this idea would be to remove the collector and move the extractors into the Client. The inputs are converted directly to Requests, with the only transformation being the de-globbing of globs. These initial requests will be given a depth of 0, depth being a new member of Request:

pub struct Request {
    /// A valid Uniform Resource Identifier of a given endpoint, which can be
    /// checked with lychee
    pub uri: Uri,

    /// The resource which contained the given URI
    pub source: InputSource,

    // ...

    /// Depth of the link,
    /// where a depth of 0 is a link provided in the user input.
    pub depth: u32,
}

When the client handles a request it will fetch the file (form the file system or network). If successfully fetched, It will then parse the file extracting links and optionally fragments, if the file isn't in it's cache. After which, the file will be added to a cache in the client along with it's links and fragments. The extracted links will be converted to Requests with a depth of their origin file plus one, and be put back into the request channel to be handled later.

There will be max_depth option that defaults to 1. If a request's depth is greater or equal to max_depth, it's extracted links won't be put back into the request channel.

This gives us recursive checks for free, when max_depth >= 1! Another advantage is the possibility of combining the fragment and link extractors so that files are only parsed once.

A simplistic illustration of the above:

As it is possible for Client to produce more requests than it consumes, we should probably use mpsc::unbounded_channel.

0 replies

sanmai-NL · 2023-11-14T09:51:38Z

sanmai-NL
Nov 14, 2023

@mre An alternative view here.

The scope of ‘dead hyperlink checker’ seems to be naturally limited to interfacing with network hosts (most often, HTTP servers). Checking fragment identifiers etc. is within the application (MVC: view) layer. There are competent testing tools to test web apps etc. on these aspects. Lychee can never attain feature parity with such tools, without greatly increasing application scope to a scale that seems infeasible given the current dev team size. Note that for example, anchors are often generated after rendering a page in current front-end web dev paradigms. Why fight yesterday's battle and support checking fixed anchors?

I, therefore, propose to not change the software architecture for this and to not check fragment IDs.

0 replies

mre · 2024-01-29T11:49:03Z

mre
Jan 29, 2024
Maintainer Author

@sanmai-NL, these are reasonable concerns. It is true that this is an uphill battle, but I do think that there's a middle ground, however. The two main features that people seem to need are

Anchor support in Markdown files. (Already supported)
Anchor support in URLs. (Not supported)

As you mentioned, what we need to stay clear from is to assume too much about the underlying logic of web apps. But what we can support is to make a call to the correct URL to see if it is up. This is harder for Markdown than for anchors in URLs, because in Markdown the anchor that gets generated is not standardized. We already cover that part for GitHub, though, which is a quasi-standard supported by many platforms. We could still make that part configurable, and there really aren't that many rules.

For HTTP links, it's not a parsing problem, but merely a limitation of the way we treat links. Fixing the architecture in that area will automatically fix that issue and simultaneously unlock many other features like recursion support and global rate-limiting, so it's a big win either way.

In summary, the architecture discussion is mostly about planned features while still keeping support for HTTP anchors in mind when doing the refactoring.

0 replies

mre · 2024-01-29T17:15:37Z

mre
Jan 29, 2024
Maintainer Author

I thought about the future architecture a little more. Here's the current state of the design process.

To be honest, if I saw it for the first time, I'd think that it's over-engineered for a CLI tool, but it feels like we need a better architecture to be able to handle recursion and rate limiting. It looks scarier than it is!

Pink are components, blue are channels.

Pros

Separation of concerns
Rate limiting through (bounded) channels
Each request handler can be configured independently, which allows for very granular request control. E.g. we could extend the config to allow tweaking the behavior of every endpoint (if we wanted to).
We could carry over metadata between the steps. This should fix issues like anchor validation for HTTP because the entire process is one big feedback loop that resolves to a final check state.
A lot of work can happen in middleware, which could be implemented and tested independently.
In fact, I think we should lean more towards thinking in interfaces for middleware in general. Then it doesn't matter if middleware/plugins are written in Rust or loaded with WebAssembly.

Cons

Too many channels?
Handlers need to be testable via unit tests. I don't know if that is easily possible with channels.

This is subject to change and just a snapshot of my current thought process. Feedback is still very welcome, though.
Specifically, I wonder if there's a better way to decouple the components than channels. I thought of streams, but I'm not sure if we can configure them as easily (e.g. to do rate limiting). It doesn't have to be one or the other, though, so we could mix it as we see fit.

0 replies

mre · 2024-03-01T11:13:58Z

mre
Mar 1, 2024
Maintainer Author

Moved to discussion because it's not strictly an issue with lychee.

0 replies

HU90m · 2024-03-14T09:58:31Z

HU90m
Mar 14, 2024

Sorry, it's taken me so long to type up my thoughts.

I really like your architecture suggestion and the move to thinking in interfaces.

The main thing that I'm not sure about is the placement of the cache(s). I think a unified cache sitting in front of the request relay would make sense. One could then move the extractor between the relay and cache, which would mean the extractor only has to parse the response body once for everything (links and fragments at the moment).

I've written some pseudo structures below to help with the explanation. Inputs are converted to Requests by the input handler. These requests are then filtered or remapped, before joining the request queue. The cache pulls from the queue. If it hits, it returns a cached Status straight away. If it misses, it will pass the request to the request relay.

Responses from the relay go through the retry check. If the status is a timeout or redirect and the retry_count or redirect_count are below their limit, it will retry the request or follow the redirect respectively. Otherwise, the response passes through the extractor which grab links and fragments from the body, if the status is OK and the file type is supported. The result is captured by the cache on it's way to the output queue.

The result also gets passed to the recursion check which checks the depth and, if below the limit, will create a new request for each of the links within the body.

struct Request {
    uri: Uri,

    /// Where the link originated from originated from.
    /// This will either be an CLI input or a linked page if using recursion.
    source: Source,

    // ...

    retry_count: u32,
    redirect_count: u32,

    /// Depth of the link,
    /// where a depth of 0 is a link provided in the user input.
    depth: u32,
}

struct Response {
    request: Request,
    status: Status,
}

enum Status {
    /// Request was successful
    Ok(StatusOk),
    /// Failed request
    Error(ErrorKind),
    /// Request timed out
    Timeout(Option<StatusCode>),
    /// Got redirected to different resource
    Redirected(StatusCode),
    // ...
}

struct StatusOk {
    code: StatusCode,
    links: Option<Vec<Uri>>,
    fragments: Option<Vec<Fragments>>,
}

I'm not sure about channels vs streams. I'd guess it may make sense to use streams in all but the few places where we want to enforce rate limiting.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture changes for enabling future development #1385

{{title}}

Replies: 6 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Architecture changes for enabling future development #1385

mre Sep 7, 2023 Maintainer

Replies: 6 comments

mre Sep 7, 2023 Maintainer Author

sanmai-NL Nov 14, 2023

mre Jan 29, 2024 Maintainer Author

mre Jan 29, 2024 Maintainer Author

mre Mar 1, 2024 Maintainer Author

HU90m Mar 14, 2024

mre
Sep 7, 2023
Maintainer

mre
Sep 7, 2023
Maintainer Author

sanmai-NL
Nov 14, 2023

mre
Jan 29, 2024
Maintainer Author

mre
Jan 29, 2024
Maintainer Author

mre
Mar 1, 2024
Maintainer Author

HU90m
Mar 14, 2024