A python program that automates browser actions for web scraping!
I wrote this program during an internship in the summer after my junior year of high school. The start-up that I was interning at provided a point-of-sale service for restaurants as their product, and they thought it could be useful to scrape the menu of specific locations of a chain because the menu can very by location.
The program uses Python language bindings for Selenium WebDriver to automate a chrome browser and interact with the webpage. In this way, we can scrape javascript generated content that we can’t get from normal scraping (using requests.get()
). It also uses BeautifulSoup
to parse HTML received from the browser.
Note: The program normally runs in headless mode. This was disabled in the video above for demonstration purposes.
Python Code Snippets (full code available here)
Getting options of an item in the menu:
def get_item(browser, id): # id is the html id
""" given an id, scrape a menu item and all of its options """
button = browser.find_element_by_id(id)
# click on the item to open options chooser:
browser.execute_script("arguments[0].click();", button)
time.sleep(1)
innerHTML = browser.page_source
html = BeautifulSoup(innerHTML, 'html.parser') # feed html to parser
_options = {}
# divide into option sections
options = html.find_all('div', class_='menuItemModal-options')
for option in options:
name = option.find(class_='menuItemModal-choice-name').text
choices = option.find_all('span', class_='menuItemModal-choice-option-description')
if ' + ' in choices[0].text:
# divide into option, price pairs
_choices = {choice.text.split(' + ')[0]:choice.text.split(' + ')[1] for choice in choices}
else:
_choices = [choice.text for choice in choices]
_options[name] = _choices
return _options
Getting page HTML with selenium:
chrome_options = Options()
chrome_options.add_argument("--headless") # run in headless mode
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)
time.sleep(10) # give page time to load everything
innerHTML = browser.page_source
Compiling menu and writing to JSON file:
(cat_titles, cat_items, price were defined earlier in the program)
full_menu = {}
for ind, title in enumerate(cat_titles): # category titles
all_items = []
# iterate through all items in a category
for ind2, itm_name in enumerate(cat_items[ind]):
item = {}
item['name'] = itm_name
item['price'] = prices[ind][ind2]
item['options'] = get_item(browser, ids[ind][ind2])
all_items.append(item)
full_menu[title] = all_items
# getting path of directory to find JSON file path
path = '/'.join(os.path.realpath(__file__).split('/')[:-1])
with open(f'{path}/data.json', 'w') as f:
json.dump(full_menu, f, indent=4) # writing to file with pretty printing
print('[Finished]')