When I’m writing webscrapers I mostly just pivot between selenium (because the website is too “fancy” and definitely needs a browser) and pure requests calls (both in conjunction with bs4).

But when reading about scrapers, scrapy is often the first mentioned Python package. What am I missing out on if I’m not using it?

  • Wats0ns@programming.dev
    link
    fedilink
    arrow-up
    2
    ·
    11 months ago

    The huge feature of scrapy is it’s pipelining system: you scrape a page, pass it to the filtering part, then to the deduplication part, then to the DB and so on

    Hugely useful when you’re scraping and extraction data, I reckon if you’re only extracting raw pages then it’s less useful I guess

    • qwertyasdef@programming.dev
      link
      fedilink
      arrow-up
      1
      ·
      11 months ago

      Oh shit that sounds useful. I just did a project where I implemented a custom stream class to chain together calls to requests and beautifulsoup.

      • Wats0ns@programming.dev
        link
        fedilink
        arrow-up
        2
        ·
        11 months ago

        Yep try scrapy. And also it handles for you the concurrency of your pipelines items, configuration for every part,…