There’re two kinds of ad results available having different layouts:
- Google Shopping Ads
- Google Standard Website Ads
Logic:
- Import libraries for working with.
- Add user-agent for fake real-user visits.
- Enter the search queries.
- Have HTML response.
- Have HTML code.
- Discover and specify where to extract data.
- Repeat over that till nothing left.
Google might block the requests if:
- Recognize script as the script, e.g. python-requests.
- There’re so many requests from single IP address.
- Not working like the human. Fundamentally everything above
There’re many ways to tag along blocking scripts from Google:
- Use referrer or Python-requests Session Objects.
- Use customized headers -User Agents and list of different user agents.
- Use headless browsers or browser auto frameworks like Pyppeteer or Selenium.
- Use proxies as well as rotate them.
- Use CAPTCHA solving services.
- Use request delays much slower.\
Shopping Ads
import requests, lxml, urllib.parse from bs4 import BeautifulSoup # Adding User-agent (default user-agent from requests library is 'python-requests') # https://github.com/psf/requests/blob/589c4547338b592b1fb77c65663d8aa6fbb7e38b/requests/utils.py#L808-L814 headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582" } # Search query params = {'q': 'сoffee buy'} # Getting HTML response html = requests.get(f'https://www.google.com/search?q=', headers=headers, params=params).text # Getting HTML code from BeautifulSoup soup = BeautifulSoup(html, 'lxml') # Looking for container that has all necessary data findAll() or find_all() for container in soup.findAll('div', class_='RnJeZd top pla-unit-title'): # Scraping title title = container.text # Creating beginning of the link to join afterwards startOfLink = 'https://www.googleadservices.com/pagead' # Scraping end of the link to join afterwards endOfLink = container.find('a')['href'] # Combining (joining) relative and absolute URL's (adding begining and end link) ad_link = urllib.parse.urljoin(startOfLink, endOfLink) # Printing each title and link on a new line print(f'{title}\n{ad_link}\n') # Output ''' Jot Ultra Coffee Triple | Ultra Concentrated https://www.googleadservices.com/aclk?sa=l&ai=DChcSEwiP0dmfvcbwAhX48OMHHYyRBuoYABABGgJ5bQ&sig=AOD64_0x-PlrWek-JFlDTSo7E9Z7YhUOjg&ctype=5&q=&ved=2ahUKEwjhr9GfvcbwAhXHQs0KHQCbCAUQww96BAgCED4&adurl= MUD\WTR | A Healthier Coffee Alternative, 30 servings https://www.googleadservices.com/aclk?sa=l&ai=DChcSEwiP0dmfvcbwAhX48OMHHYyRBuoYABAJGgJ5bQ&sig=AOD64_3gltZJ6kPrxic5o8yUO5cuJrHXnw&ctype=5&q=&ved=2ahUKEwjhr9GfvcbwAhXHQs0KHQCbCAUQww96BAgCEEg&adurl= Jot Ultra Coffee Double | 2 bottles = 28 cups https://www.googleadservices.com/aclk?sa=l&ai=DChcSEwiP0dmfvcbwAhX48OMHHYyRBuoYABAHGgJ5bQ&sig=AOD64_3hD0JWZSLr8NUgoTW5K0HMzdFvng&ctype=5&q=&ved=2ahUKEwjhr9GfvcbwAhXHQs0KHQCbCAUQww96BAgCEE4&adurl= '''
Note: At times, there would be zero results as Google didn’t indicate ads at script runtime. Just run that again.
Standard Website Ads
import requests, lxml, urllib.parse from bs4 import BeautifulSoup # Adding user-agent to fake real user visit headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582" } # Search query params = {'q': 'coffee buy'} # HTML response html = requests.get(f'https://www.google.com/search?q=', headers=headers, params=params).text # HTML code from BeautifulSoup soup = BeautifulSoup(html, 'lxml') # Looking for container that has needed data and iterating over it for container in soup.findAll('span', class_='Zu0yb LWAWHf qzEoUe'): # Using .text since in 'span' there's no other text other than link ad_link = container.text # Printing links print(ad_link) # Output ''' https://www.coffeeam.com/ https://www.sfbaycoffee.com/ https://www.onyxcoffeelab.com/ https://www.enjoybettercoffee.com/ https://www.klatchroasting.com/ https://www.pachamamacoffee.com/ https://www.bulletproof.com/ '''
Use Google Ads Results API
Instead, you can perform the same things using Google Ad Results API from X-Byte, except you don’t need to consider solving CAPTCHA in case you send so many requests, getting proxies, reduces development complexities, and offers easy data manipulation.
This is a paid API.
Code to integrate:
import os from serpapi import GoogleSearch params = { "engine": "google", "q": "kitchen table", "api_key": os.getenv("API_KEY"), "no_cache":"true" # add this param if it throws an error } search = GoogleSearch(params) results = search.get_dict() for ad in results['ads']: # shopping ads -> ['shopping_results'] shopping_ad = ad['tracking_link'] # shopping ads -> ['link'] print(shopping_ad) # Output for regular ads ''' https://www.google.com/aclk?sa=l&ai=DChcSEwje1bnojtHwAhWRhMgKHY0kC1oYABAPGgJxdQ&ae=2&sig=AOD64_2ZH32FlwxW1XqO9V49i2L8J5qy2A&q&adurl https://www.google.com/aclk?sa=l&ai=DChcSEwje1bnojtHwAhWRhMgKHY0kC1oYABAMGgJxdQ&ae=2&sig=AOD64_2l1PVJAqbVmrcu8UpkGPVk-VK3UA&q&adurl https://www.google.com/aclk?sa=l&ai=DChcSEwje1bnojtHwAhWRhMgKHY0kC1oYABAQGgJxdQ&sig=AOD64_2DDuyRZUcFi04jfneAzwnOQBuLtw&q&adurl ''' # Output for shopping ads ''' https://www.google.com/aclk?sa=l&ai=DChcSEwijuI27jtHwAhVA5uMHHUUWAWkYABAEGgJ5bQ&ae=2&sig=AOD64_2zCyytR6tDeB3BjdOX5sFQQKwOAA&ctype=5&q=&ved=2ahUKEwjh9oO7jtHwAhUId6wKHa8mByUQ5bgDegQIARA8&adurl= https://www.google.com/aclk?sa=l&ai=DChcSEwijuI27jtHwAhVA5uMHHUUWAWkYABAFGgJ5bQ&ae=2&sig=AOD64_2HeGVTNF91vkSHjg-wRDtC1ouATw&ctype=5&q=&ved=2ahUKEwjh9oO7jtHwAhUId6wKHa8mByUQ5bgDegQIARBI&adurl= https://www.google.com/aclk?sa=l&ai=DChcSEwijuI27jtHwAhVA5uMHHUUWAWkYABAGGgJ5bQ&ae=2&sig=AOD64_1n4ztvwQxiSMInwgntgY-WyVc2eQ&ctype=5&q=&ved=2ahUKEwjh9oO7jtHwAhUId6wKHa8mByUQ5bgDegQIARBY&adurl= '''
In case, you have any queries or anything isn’t working properly or you need to write some other codes, feel free to contact X-Byte Enterprise Crawling or ask for a free quote!