案例:使用BeautifuSoup4的爬虫

我们以 亚马逊Kindle电子书销售排行榜 商品页面来做演示:https://www.amazon.cn/gp/bestsellers/digital-text/116169071
TIM截图20181025174133.png
使用BeautifuSoup4解析器,将每件商品的的ASIN、标题、价格、star、评价数量,以及每件商品的链接爬取下来并存储在.csv文件中。

  1. import csv
  2. import requests
  3. from bs4 import BeautifulSoup
  4. def amazon():
  5. base_url = 'https://www.amazon.cn'
  6. url = 'https://www.amazon.cn/gp/bestsellers/digital-text/116169071'
  7. headers = {
  8. 'Connection': 'keep-alive',
  9. 'Cache-Control': 'max-age=0',
  10. 'Upgrade-Insecure-Requests': '1',
  11. 'DNT': '1',
  12. 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
  13. 'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
  14. 'Accept-Encoding': 'gzip, deflate, br',
  15. 'Accept-Language': 'zh-HK,zh-CN;q=0.9,zh;q=0.8',
  16. 'Referer': 'https://www.amazon.cn/gp/bestsellers/digital-text/116169071',
  17. }
  18. # 给亚马逊发送请求
  19. session = requests.session()
  20. session.get('https://www.amazon.com')
  21. resHtml = session.get(url, headers=headers).content.decode('utf-8')
  22. # 将获取的对象转化为bs4对象
  23. html_soup = BeautifulSoup(resHtml, 'lxml')
  24. # 获取所有商品的标签
  25. all_goods_li = html_soup.find('ol', id='zg-ordered-list').find_all('li', 'zg-item-immersion')
  26. for li in all_goods_li:
  27. # 准备一个空列表,用于储存商品信息
  28. goods_info_list = []
  29. # 商品链接
  30. link = base_url + li.find('a', target='_blank')['href']
  31. # 商品 asin
  32. asin = link.split('/dp/')[1].split('/')[0]
  33. # 标题
  34. title = li.select("div[data-rows='1']")[0].get_text().strip()
  35. # 价格
  36. price = li.find('span', 'p13n-sc-price').text
  37. # 星级
  38. star = li.find('span', 'a-icon-alt').text
  39. # 评价数
  40. reviews = li.select("a[class='a-size-small a-link-normal']")[0].get_text()
  41. # 将爬到的数据添加到列表中
  42. goods_info_list.append(asin)
  43. goods_info_list.append(title)
  44. goods_info_list.append(price)
  45. goods_info_list.append(star)
  46. goods_info_list.append(reviews)
  47. goods_info_list.append(link)
  48. # 将数据写入 名为amazon_book.csv的文件中
  49. csvFile = open('./amazon_book.csv', 'a', newline='', encoding='gb18030') # 设置newline,否则两行之间会空一行
  50. writer = csv.writer(csvFile)
  51. writer.writerow(goods_info_list)
  52. csvFile.close()
  53. if __name__ == '__main__':
  54. amazon()

打开当前目录下的 amazon_book.csv 可见如图效果:
TIM截图20181025174651.png


Copyright © 黑五电商学院 amzfriday.com all right reserved