HNPWA with Next.js

Google2Csv is a simple Google scraper that saves the results on a CSV file

panos_sa | 68 points

If you're looking to do any serious search engine scraping, you'll find yourself needing to use proxies.

For my thesis which required millions of datapoints, I used this tool: http://www.scrapebox.com/

elektor | 4 years ago

Nice, until the CSV part really. I mean you have a nice DataFrame there but then it gets serialized using probably the worst format out there for totally variable data like what you scrape from a site.

Unless the file uses the actual ASCII record separator you'll end up with a CSV file which can only be read by a handful of software, after telling it explicitly what the separator and quoting rules are. And even then it's hit or miss. And likely it does not use the RS because even though the chances it's unambiguous greatly increase and the RS was actually meant for that, software doesn't typically use it because when CSV was invented it's existence was unknown or ignored beacue it's not human readable (I guess, don't really know actually) and so the sad story began.

As you can see: I'm not a fan of CSV :) Just today I again had to waste time because at one point in the development of this otherwise fine piece of software I'm working on - even though I knew I'd regret it - I allowed it to export CSV files. Customer moved software to another machine, forgot that they once told the CSV exporter part to use the system settings, and now has CSV files with a comma separator (you know, the C in CSV). Oh the irony, that's not what they wanted.

stinos | 4 years ago

> Scraping google search results is illegal.

IANAL, can someone please elaborate? This sounds wrong in several ways, one of which being that Google results are almost exclusively scraped from somewhere else already.

lazyjones | 4 years ago

Further to the practical/technical issue of being blocked, there is a legal issue: this way of coding it (= not using an API key) violates Google's Terms and Conditions.

jll29 | 4 years ago

I made a wonderful little Google scraper in Clojure once. I was surprised to see I got IP blocked after only 20 or so searches.

lbj | 4 years ago

This is a great way of getting your ip permanently blacklisted and swamped by captchas.

nurettin | 4 years ago

This is a cool project. I've used Scraperr[0] for years, but it's always great to have alternatives.

[0] http://scraperr.com/

Minor49er | 4 years ago

Neat!

Have you run into issues with getting blocked by Google / issued captchas?

xur17 | 4 years ago