/ Published in: Bash

This code is POC only -- actually using it would violate Google's TOS, which forbids scraping. It is published here for educational value only.
Hypothetically, the following command should return a list of the top 500 or so hits in Google for onemorebug.com.
The results will be prepended with digits, followed by a dot and some whitespace (Lynx adds these).
You must have Lynx and Wget installed on your system for this to work.
Keep in mind that *nix shells don't like it when you double-quote strings, see the comments.
Expand |
Embed | Plain Text
perl -e "$i=0;while($i<1000){sleep 1; open(WGET,qq/|xargs lynx -dump/);printf WGET qq{http://www.google.com/search?q=site:onemorebug.com&hl=en&start=$i&sa=N},$i+=10}" | grep "\/\/[^/]*onemorebug.com\/"
Comments

You need to login to post a comment.
syntax error at -e line 1, near "=" Unterminated operator at -e line 1.
??
@hemanthhm I don't know what to tell you -- it works fine for me, I just double-checked.
For some reason if a perl script that is followed with quotes (i.e. perl -e ".....") produces syntax error, then try such an alternative -> perl -e '.....' Hence, applying the above pattern to the script at hand, we get -> perl -e '$i=0;while($i
I think I finally get what the problem was here: the *nix shell uses single quotes, while the DOS/Windows shell uses double quotes. So you have to be aware of which platform you are on and wrap the argument to
perl -e
in the appropriate type of quotes.Thanks for the nice perl command. That for sure is one more proof that perl is a spaghetti langauge but powerful :-)
While this perl/lynx code will work to get results it won't really work well.
I recently stumbled upon an article called "Scraping Google for Fun and Profit", it goes much deeper into that subject. It shows how you can scrape not only a few hundred, it can scrape millions of hits from Google. Free PHP code, including filtering of advertisement and parsing the data (title, descripion, host, url, etc) into an array is included.
Works for web and console.
Here is the article, hope you like it: http://google-scraper.squabbel.com