Recently I had reasons to scrape some results from Google. I came across the popular “xgoogle” library originally by Peteris Krumins. The returned Google results format has changed since it was last updated in 2010, and a large number of the examples simply didn’t work. Worse, some of the examples wouldn’t run! Even the simplest example included would only return 5 out of 50 expected search results.
I took it upon myself to fix these issues in my fork on github. I also put in a fully working example4.py, which fixes the issues with the previous example1 (only returning 5 out of 50 results), cycling through _all_ of the returned results, and showing how to be a “good Google citizen” by not spamming requests.
Let’s take a look at the new Example 4:
Each page of 50 results is being retrieved with results = gs.get_results(). There is a random sleep time between 15 and 60 seconds after each result page achieved with: time.sleep(randint(15,60)).
I also enabled user-agent randomization by default, something that previously was off, unless you knew to change the setting. This is important, and helps to not flag ourself as suspicious while getting a large list of results, or multiple results in sequence.
Hopefully others will be able to use this work to create more complex querying, storing, and analysis loops. Let me know if you make use of it in the comments below!