多线程链接检查器

运用掌握的新知识创建一个多线程链接检查工具。应先从网页入手,并检查网页上的链接是否有效。该工具应以递归方式检查同一网域中的其他网页,并且一直执行此操作,直到所有网页都通过验证。

For this, you will need an HTTP client such as reqwest. You will also need a way to find links, we can use scraper. Finally, we’ll need some way of handling errors, we will use thiserror.

Create a new Cargo project and reqwest it as a dependency with:

  1. cargo new link-checker
  2. cd link-checker
  3. cargo add --features blocking,rustls-tls reqwest
  4. cargo add scraper
  5. cargo add thiserror

如果 cargo add 操作失败并显示 error: no such subcommand,请手动修改 Cargo.toml 文件。添加下面列出的依赖项。

cargo add 调用会将 Cargo.toml 文件更新为如下所示:

  1. [package]
  2. name = "link-checker"
  3. version = "0.1.0"
  4. edition = "2021"
  5. publish = false
  6. [dependencies]
  7. reqwest = { version = "0.11.12", features = ["blocking", "rustls-tls"] }
  8. scraper = "0.13.0"
  9. thiserror = "1.0.37"

您现在可以下载初始页了。请尝试使用一个小网站,例如 https://www.google.org/

您的 src/main.rs 文件应如下所示:

  1. use reqwest::blocking::Client;
  2. use reqwest::Url;
  3. use scraper::{Html, Selector};
  4. use thiserror::Error;
  5. #[derive(Error, Debug)]
  6. enum Error {
  7. #[error("request error: {0}")]
  8. ReqwestError(#[from] reqwest::Error),
  9. #[error("bad http response: {0}")]
  10. BadResponse(String),
  11. }
  12. #[derive(Debug)]
  13. struct CrawlCommand {
  14. url: Url,
  15. extract_links: bool,
  16. }
  17. fn visit_page(client: &Client, command: &CrawlCommand) -> Result<Vec<Url>, Error> {
  18. println!("Checking {:#}", command.url);
  19. let response = client.get(command.url.clone()).send()?;
  20. if !response.status().is_success() {
  21. return Err(Error::BadResponse(response.status().to_string()));
  22. }
  23. let mut link_urls = Vec::new();
  24. if !command.extract_links {
  25. return Ok(link_urls);
  26. }
  27. let base_url = response.url().to_owned();
  28. let body_text = response.text()?;
  29. let document = Html::parse_document(&body_text);
  30. let selector = Selector::parse("a").unwrap();
  31. let href_values = document
  32. .select(&selector)
  33. .filter_map(|element| element.value().attr("href"));
  34. for href in href_values {
  35. match base_url.join(href) {
  36. Ok(link_url) => {
  37. link_urls.push(link_url);
  38. }
  39. Err(err) => {
  40. println!("On {base_url:#}: ignored unparsable {href:?}: {err}");
  41. }
  42. }
  43. }
  44. Ok(link_urls)
  45. }
  46. fn main() {
  47. let client = Client::new();
  48. let start_url = Url::parse("https://www.google.org").unwrap();
  49. let crawl_command = CrawlCommand{ url: start_url, extract_links: true };
  50. match visit_page(&client, &crawl_command) {
  51. Ok(links) => println!("Links: {links:#?}"),
  52. Err(err) => println!("Could not extract links: {err:#}"),
  53. }
  54. }

使用以下命令运行 src/main.rs 中的代码

  1. cargo run

任务

  • 通过线程并行检查链接:将要检查的网址发送到某个通道,然后使用多个线程并行检查这些网址。
  • 您可以对此进行扩展,以递归方式从 www.google.org 网域的所有网页中提取链接。设置网页上限(例如 100 个),以免被网站屏蔽。