Multi-threaded Link Checker
Let us use our new knowledge to create a multi-threaded link checker. It should start at a webpage and check that links on the page are valid. It should recursively check other pages on the same domain and keep doing this until all pages have been validated.
For this, you will need an HTTP client such as reqwest. Create a new Cargo project and reqwest
it as a dependency with:
$ cargo new link-checker $ cd link-checker $ cargo add --features blocking,rustls-tls reqwest
If
cargo add
fails witherror: no such subcommand
, then please edit theCargo.toml
file by hand. Add the dependencies listed below.
You will also need a way to find links. We can use scraper for that:
$ cargo add scraper
Finally, we’ll need some way of handling errors. We use thiserror for that:
$ cargo add thiserror
The cargo add
calls will update the Cargo.toml
file to look like this:
[dependencies] reqwest = { version = "0.11.12", features = ["blocking", "rustls-tls"] } scraper = "0.13.0" thiserror = "1.0.37"
You can now download the start page. Try with a small site such as https://www.google.org/
.
Your src/main.rs
file should look something like this:
use reqwest::blocking::{get, Response};
use reqwest::Url;
use scraper::{Html, Selector};
use thiserror::Error;
#[derive(Error, Debug)]
enum Error {
#[error("request error: {0}")]
ReqwestError(#[from] reqwest::Error),
}
fn extract_links(response: Response) -> Result<Vec<Url>, Error> {
let base_url = response.url().to_owned();
let document = response.text()?;
let html = Html::parse_document(&document);
let selector = Selector::parse("a").unwrap();
let mut valid_urls = Vec::new();
for element in html.select(&selector) {
if let Some(href) = element.value().attr("href") {
match base_url.join(href) {
Ok(url) => valid_urls.push(url),
Err(err) => {
println!("On {base_url}: could not parse {href:?}: {err} (ignored)",);
}
}
}
}
Ok(valid_urls)
}
fn main() {
let start_url = Url::parse("https://www.google.org").unwrap();
let response = get(start_url).unwrap();
match extract_links(response) {
Ok(links) => println!("Links: {links:#?}"),
Err(err) => println!("Could not extract links: {err:#}"),
}
}
Run the code in src/main.rs
with
$ cargo run
Tasks
- Use threads to check the links in parallel: send the URLs to be checked to a channel and let a few threads check the URLs in parallel.
- Extend this to recursively extract links from all pages on the
www.google.org
domain. Put an upper limit of 100 pages or so so that you don’t end up being blocked by the site.