Honestly, the majority of us might accept that we don’t know what can be reasonable pricing of the grocery we buy regularly. Apart from some highly-priced favorites (e.g., salmon and ribeyes) that you watch for sales, you honestly don’t have any idea about the regular price or the good deals.
It is a terrible way of managing your grocery budget – you perhaps spring for the “deals”, which are all sort of time. Therefore, being the data scientist, you need to think what if you can preserve tabs on that? Create a database about historic pricing for different items. Therefore, you must understand when to press peddle to accelerate sales.
Well, you need some data for a start. So, you can programmatically extract prices from online weekly ads for your local store with the help of grocery store ads scraping. In this blog, you will go through how you can get the setup of grocery store scraping.
Explore Data Formats When You Scrape Grocery Store Ads
The initial thing you need to do is start using Chrome developing tools for grocery store ads scraping and monitoring HTTP traffic towards a Kroger website whenever you load the weekly add for your local store (Remember, Kroger is your grocery store). Then a weekly ad gets stored at the URL, which will look like this https://wklyads-krogermidatlantic.kroger.com/flyers/krogermidatlantic-weekly?type=2&store_code=00342& chrome=broadsheet&flyer_run_id =##### in which 00342 is a store ID of local Kroger as well as ##### is an ID for current “run” (whatsoever it means) of the fliers getting distributed by the Kroger.
Loading the page turned into something easy and convenient – a few JavaScript codes, which define an object having all the data of the flyer. This code started using window[‘flyerData’] = as well as then listed a huge JavaScript object having all the information you require!
You will further discover that it doesn’t matter what you put for a flyer running ID – “asdf” will just work fine. Nevertheless, you don’t want to undertake that it will always be true, so you need to look for ways to search the ID. Loading the https://wklyads-krogermidatlantic.kroger.com/flyers/krogermidatlantic?type=2&store_code=
00342&chrome=broadsheet may solve the problem. This has a small little JavaScript body defined at the window [‘hostedStack’] = which lists out the currently accessible fliers, flyer type (seasonal vs. weekly) as well as their ad running IDs.
There are significant differences between JSON and JavaScript objects however, here, Kroger’s objects have happened to get formatted like a valid JSON. (As all the keys get enclosed in the quotes, although that’s the only requirement for the JavaScript objects while the key has the special character.) It means that you can just parse all these data like JSON as well as transform that into any object you want.
Do Grocery Store Ads Scraping with PHP
PHP isn’t very efficient like Python at the moment but still, it is a very decent scripting language. Therefore, you can code things up! It is pretty well-commented and self-explanatory, you don’t need to go in too much information here.
Fundamentally, the code hits a key ad page about your local Kroger for getting the listing of ads, finding the IDs for an ad run having a word “weekly” within the name, inquiries for the data, collect data from applicable fields, as well as write that all out into the CSV.
// Make sure we allow 30 seconds for execution
set_time_limit(30);
// Extract the list of available ads from the Kroger main page
$ad_id_data = get_json("https://wklyads-krogermidatlantic.kroger.com/flyers/krogermidatlantic?type=2&store_code=00342&chrome=broadsheet", "window["hostedStack"] = ");
// Loop through the ads looking for the ID of an ad with a name matching "weekly."
$id = false;
foreach ($ad_id_data as $ad) {
if (strpos(strtolower($ad["name"]), "weekly") !== false) {
$id = $ad["flyer_run_id"];
}
}
if ($id === false) die("Error finding ID of weekly ad.");
// Extract the list of items from the weekly Kroger ad
$ad_data = get_json("https://wklyads-krogermidatlantic.kroger.com/pub/krogermidatlantic?chrome=broadsheet&locale=en-US&store_code=00342&type=2=" . $id, "window["flyerData"] = ");
// Extract all of the items and add their details to an output array
$output = array();
$headings = array("brand", "display_name", "name", "description", "pre_price_text", "current_price", "price_text");
$output[] = $headings;
foreach($ad_data["items"] as $item) {
$current = array();
foreach($headings as $heading) {
$current[] = $item[$heading];
}
$output[] = $current;
}
// Write the array to a CSV
$listFile = fopen("kroger.csv", "w");
foreach($output as $current) fputcsv($listFile, $current);
fclose($listFile);
// Function to query a page and extract a JavaScript object.
// (Technically, it"s not JSON, but in this case it"s formatted like JSON, so this works.)
function get_json($url, $marker) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$page_data = curl_exec($ch);
if(curl_error($ch) != "") die("CURL error grabbing " . $url . ".");
curl_close ($ch);
$results = array();
if (preg_match("/" . preg_quote($marker) . ".*\n/", $page_data, $results) === 1) {
$results = str_replace($marker, "", $results[0]);
$results = str_replace(";\n", "", $results);
return json_decode($results, true);
} else {
die("Error finding requested data in " . $url . ".");
}
}
?>