— Scene II —
How many pizzas are made during peak periods at Plato's Pizzas?

Intro

This scene showcases my data skills with R and RStudio IDE. R is my primary statistical and data science programming language due to its domain specificity and heavy use within the realm of data analytics.

The source for this scene is available at in my Github. If you're interested in the RMarkdown file then here's a shortcut.

The question to answer is How many pizzas are made during peak periods at Plato's Pizzas? from the Maven Pizza challenge. Full details can be found in the Pizza Challenge's full description.

Skills

Basic SQL (pseudo)
Translating between languages (SQL to R)
The importance of functions
General R competency

RMarkdown files
Using packages and the Tidyverse
Reading data
Manipulating data

Piping
Mutating
Filtering
Summarising

Plotting

ggplot2
geom_col
geom_text
geom_label
theme

Act I: The data
Act II: The packages
Act III: The read
Act IV: Peak periods
Act V: Peak times
Act VI: The behaviour
Act VII: The interface
Act VIII: The translation
Act IX: A sneak peek
Act X: Piping hot
Act XI: Piping peaks
Act XII: Picking peaks
Act XIII: The results
Act XIV: Point of reference
Act XV: ggplot2
Act XVI: The plot
Act XVII: Summary of discovery

— Act I —

The data

The data for this project is the same as the Spreadsheets project. Therefore, I'll be omitting my previous commentary and repeating only the metadata for reference. You may find the data files in the Github repository.

orders

Field	Description
order_id	Unique identifier for each order placed by a table
date	Date the order was placed (entered into the system prior to cooking & serving)
time	Time the order was placed (entered into the system prior to cooking & serving)

order_details

Field	Description
order_details_id	Unique identifier for each pizza placed within each order (pizzas of the same type and size are kept in the same row, and the quantity increases)
order_id	Foreign key that ties the details in each order to the order itself
pizza_id	Foreign key that ties the pizza ordered to its details, like size and price
quantity	Quantity ordered for each pizza of the same type and size

pizzas

Field	Description
pizza_id	Unique identifier for each pizza (constituted by its type and size)
pizza_type_id	Foreign key that ties each pizza to its broader pizza type
size	Size of the pizza (Small, Medium, Large, X Large, or XX Large)
price	Price of the pizza in USD

pizza_types

Field	Description
pizza_type_id	Unique identifier for each pizza type
name	Name of the pizza as shown in the menu
category	Category that the pizza fall under in the menu (Classic, Chicken, Supreme, or Veggie)
ingredients	Comma-delimited ingredients used in the pizza as shown in the menu (they all include Mozzarella Cheese, even if not specified; and they all include Tomato Sauce, unless another sauce is specified)

— Act II —

The packages

I find dependencies one of the biggest pain points when programming. I've decided to install all dependencies upfront. From what I've read, many R users prefer to install and import as they go along instead. I'm likely to do this too in future.

 install.packages("tidyverse") 
install.packages("lubridate") 
install.packages("hms") 
install.packages("ggplot2")    

— Act III —

The read

This dataset is pretty small I'll read all files into memory ahead upfront. Often it's preferable to keep data loading and data use together but I wanted to explore it all using RStudio.

As a rule of thumb, do what is quickest, simplest, and easiest to change. Put another way, prioritise the ability to adapt and improve based on emerging use cases over pre-programming for hypothetical scenarios that often never materialise.

 library("readr") 
 
data_dictionary <- read_csv("../data/data_dictionary.csv") 
orders          <- read_csv("../data/orders.csv") 
order_details   <- read_csv("../data/order_details.csv") 
pizzas          <- read_csv("../data/pizzas.csv") 
pizza_types     <- read_csv("../data/pizza_types.csv")    

— Act IV —

Peak periods

Before we can answer the question How many pizzas are made during peak periods? we need to define what peak period is.

The simplest and least effortful is to directly specify time ranges for lunch and dinner. In the Scene I: Spreadsheets project I highlighted some approximate peak periods within the Scene I Act III: Visualise the day graph. We can use those time ranges as a start for filtering orders then play around with the ranges as we go.

— Act V —

Peak times

For reference, here's the graph from Scene I Act III: Visualise the day:

Line graph of average pizza orders for each hour in the day

Peak lunch time is from about 11:30 to 13:45 and dinner is from 15:45 to 19:30. We can define these times as parameters to a function (pizzas_made_at_peak_times ) allowing us to quickly try different peak periods once we've implemented it.

— Act VI —

The behaviour

We will need to combine the order and order_details tables to find the total pizzas. orders contains the time we need to filter on and order_details contains the individual pizza quantites to sum.

As I'm not as familiar with R as I am with SQL, typing out some pseudo SQL helps me figure out where the key battles lie so I can tackle them first. SQL is a personal pillar of strength from which I can confidently handly the complexity of describing a solution without the distraction of needing to learn the syntax and production rules of R.

I can do the language translation as a separate step to better utilise my limited capability. As I learn the equivalent operations in R I will be able to dispense with the preliminary SQL step and describe the solution directly.

 SELECT 
	SUM(details.quantity) as pizzas_made 
FROM 
	order_details AS details 
INNER JOIN 
	orders ON details.order_id = orders.order_id 
WHERE -- Lunch 
	(orders.time >= '11:15:00' AND orders.time <= '13:45:00') 
OR -- Dinner 
	(orders.time >= '15:45:00' AND orders.time <= '19:30:00')    

To those proficient in SQL it should be clear how we could adapt the query to filter for lunch or dinner, by date or range of dates, or on fields within order_details such as pizza_type . In a following project I intend to create a SQL database then use Go to query, transform, and plot data. I've wanted to use Go and SQLite in an analysis project for some time but lacked an appropriate opportunity.

— Act VII —

The interface

As a heuristic I combine behaviour driven and interface first strategies when programming. This means I define the required behaviour and desired interface first then implement accordingly. A top down approach.

The SQL code block in the prior section is a description of the behaviour. The following code block is an example of interface design-by-example. It is an example call to the function I want to write.

 print(pizzas_made_at_peak_times( 
	# Lunch 
	'11:30:00', '13:45:00', 
	# Dinner 
	'15:45:00', '19:30:00' 
)) 
 
# Example print out: 12345    

If copying these R snippets into an .R or .Rmd make sure to place the above snippet after the one on the next page.

From this I can identify a function signature. For the output I'm planning on returning a simple number. Because this is R, I may end up returning a data table that contains the result instead.

— Act VIII —

The translation

 library("dplyr") 
 
pizzas_made_at_peak_times <- function( 
	peak_lunch_start, 
	peak_lunch_end, 
	peak_dinner_start, 
	peak_dinner_end 
) { 
 
	# FROM order_details 
	x <- order_details %>% 
 
		# INNER JOIN orders ON order_details.order_id = orders.order_id 
		merge(orders, by=c("order_id")) %>% 
 
		# WHERE (orders.time >= '11:15:00' AND orders.time <= '13:45:00') 
		# OR (orders.time >= '15:45:00' AND orders.time <= '19:30:00') 
		mutate(time_str = format(time, "%H:%m:%S")) %>% 
		filter( 
			(time_str >= peak_lunch_start & time_str <= peak_lunch_end) | 
			(time_str >= peak_dinner_start & time_str <= peak_dinner_end) 
		) %>% 
 
		# SELECT SUM(details.quantity) as pizzas_made 
		summarise( 
			peak_lunch_start, 
			peak_lunch_end, 
			peak_dinner_start, 
			peak_dinner_end, 
			total_orders = n_distinct(order_id), 
			pizzas_made = sum(quantity) 
		) 
 
	return (x) 
}    

Above is the translation of the SQL into R with comments approximately mapping SQL clauses. It is a working implementation of the function with desired behaviour. As I thought, a data table was returned instead of just a number. However, this turned out advantageous as it enabled me to include the total_orders and other values as well.

It is likely more concise, readable, and performant implementations exist but I feel this is good enough for now. The beauty of isolating meaningful snippets of code within functions like this, rather than writing them at the top level, is that:

I can delay implementation but write calls to the function;
I can come back and reimplement it later with out having to detangle it from unrelated code;
I can copy, paste, and modify the functionality into new projects with ease and confidence;
and I can easily test it and automate such testing.

— Act IX —

A sneak peek

A print out detailing the pizzas made using the R code snippet from scene seven, 'the interface'

Invoking the function in RStudio as per Act VII code snippet I get the above output.

I've included total_orders to check that the pizzas_made value was more than the number of records. I glanced at the data before hand and saw quantity was greater than one for a number of pizza orders in the subset of data so if pizzas_made was not greater than total_orders then I probably made a mistake somewhere.

— Act X —

Piping hot

I'm going off the path a little in this section to make the new functionality more accessible and easier to integrate with other R code.

Piping

Piping is a neat R feature used to pipe the result of one expression as the first argument to the next. This allows chaining of an ordered set of expressions into a simple workflow. R can be written without employing pipes. Is it better to use pipes? It depends who you ask and lets leave it at that.

I use pipes a lot. Unfortunately, my initial design of the pizzas_made_at_peak_times function is not compatible with R pipes. I could redesign it but I feel that would be pushing to much compexity into one place. I do love conciseness but there is a point where it starts to work against readability and not with it.

Wrapping

 append_result <- function(acc, res) { 
	if (is.null(acc)) { 
		return (res) 
	} 
	return (union(acc, res)) 
} 
 
pizzas_made_at_peak_times_pipe <- function( 
	peak_lunch_start, 
	peak_lunch_end, 
	peak_dinner_start, 
	peak_dinner_end 
) { 
	x <- NULL 
 
	for (i in 1:length(peak_lunch_start)) { 
		res <- pizzas_made_at_peak_times( 
			peak_lunch_start[i], 
			peak_lunch_end[i], 
			peak_dinner_start[i], 
			peak_dinner_end[i] 
		) 
 
		x <- append_result(x, res) 
	} 
 
	return (x) 
}    

I've written a another function called pizzas_made_at_peak_times_pipe to wrap the former. This wrapping function accepts a set of R vectors representing multiple peak time ranges, invokes pizzas_made_at_peak_times for each, and finally returns the results combined into a single data frame.

Indefinitive

 library("tidyr") 
 
pizzas_made_for_various_peak_times <- function(peak_times) { 
	x <- peak_times %>% 
		mutate(results = pizzas_made_at_peak_times_cols( 
			peak_lunch_start, 
			peak_lunch_end, 
			peak_dinner_start, 
			peak_dinner_end 
		)) %>% 
		select(results) %>% 
		unnest(results) 
 
	return (x) 
}    

This final wrapping function accepts a table of peak lunch and dinner times and returns the results of invoking the recently constructed pizzas_made_at_peak_times_pipe function. I'm sure there is a much nicer way to do this that doesn't involve needing two functions. Due to pluralistic and functional nature of R and it's piping mechanism, it will take a bit of time to learn the many optimisations instinctive to veterans. I tend to favour more imperative and procedural programming paradigms.

— Act XI —

Piping peaks

These two convenience functions are used for building a set of peak times for the pizzas_made_for_various_peak_times function. My initial attempt at writing a code chunk in the following act ended up quite verbose and somewhat hard to read. This act is merely the result of refactoring.

 new_peak_time_tibble <- function() { 
	peak_times <- tibble( 
		peak_lunch_start = character(), 
		peak_lunch_end = character(), 
		peak_dinner_start = character(), 
		peak_dinner_end = character() 
	) 
 
	return (peak_times) 
}    

 append_peak_time_row <- function( 
	peak_times, 
	peak_lunch_start, 
	peak_lunch_end, 
	peak_dinner_start, 
	peak_dinner_end 
) { 
	peak_times <- peak_times %>% 
		add_row( 
			peak_lunch_start, 
			peak_lunch_end, 
			peak_dinner_start, 
			peak_dinner_end 
		) 
 
	return (peak_times) 
}    

— Act XII —

Picking peaks

A line graph of average orders per hour with coloured time range annotations that represent possible peak order periods

The plot above is an annotated copy of the Scene I Act III: Visualise the day graph. The green, orange, and red bands visualise possible time ranges. These are based upon average orders per hour. It was the easiest and most obvious in the circumstances.

 new_peak_time_tibble() %>% 
	# Green 
	append_peak_time_row('11:15:00', '14:00:00', '15:00:00', '20:30:00') %>% 
	# Orange 
	append_peak_time_row('11:30:00', '13:45:00', '15:45:00', '19:30:00') %>% 
	# Red 
	append_peak_time_row('11:45:00', '13:15:00', '16:30:00', '18:45:00') %>% 
	pizzas_made_for_various_peak_times() %>% 
	print()    

I've codified the three possible time ranges in the snippet above. Adding new peak time possibilities is as simple as inserting a new row prior to pizzas_made_for_various_peak_times() . This is handy as key stakeholders may want to define peak time themselves after we've presented information about average orders over a typical day.

— Act XIII —

The results

Executing the snippet in RStudio produces the following output:

As we would expect the number of pizzas made shrinks as the defined peak periods become stricter. Evidence supporting the correctness of the calculations.

Two things come to mind looking at these results. Firstly, a visualisation would make them more comprehendible. Secondly, there is no reference point on which to compare, that is, we don't know if these values are large or small in the context of all orders. I will now calculate a reference for the pizzas_made value before producing a visualisation.

— Act XIV —

Point of reference

 new_peak_time_tibble() %>% 
	# Green 
	append_peak_time_row('11:15:00', '14:00:00', '15:00:00', '20:30:00') %>% 
	# Orange 
	append_peak_time_row('11:30:00', '13:45:00', '15:45:00', '19:30:00') %>% 
	# Red 
	append_peak_time_row('11:45:00', '13:15:00', '16:30:00', '18:45:00') %>% 
	pizzas_made_for_various_peak_times() %>% 
	# New percentage column created below 
	mutate(of_total_pizzas_made = pizzas_made / sum(order_details$quantity)) %>% 
	print()    

I've built upon the prior snippet by adding a new column to the results. It contains the percentage of pizzas made against the yearly total in decimal form. With the updated ouput below, now we have something worth plotting.

A print out of the results table produced by executing the snippet in the prior section

— Act XV —

ggplot2

The following snippets plot the graph far below. I don't like the verbosity but a good portion of it is me trying to remove default graph elements. I'm sure I could halve the amount of code but I see the reason to expend the effort right now.

As a minimalist, I like my plots stripped of all but essential and high value graphical elements so the meaningful insights shine out. For future R projects I may devise my own referenceable theme for ease of application and readability. A custom theme will also help maintain a consistent style and make tweaking that style easier.

 library("ggplot2")    

 format_labels_for_plot <- function(results) { 
	results <- results %>% 
		mutate( 
			peak_lunch_start = str_replace(peak_lunch_start, ":00", ""), 
			peak_lunch_end = str_replace(peak_lunch_end, ":00", ""), 
			peak_dinner_start = str_replace(peak_dinner_start, ":00", ""), 
			peak_dinner_end = str_replace(peak_dinner_end, ":00", ""), 
			label = paste( 
				pizzas_made, " (", (floor(100 * of_total_pizzas_made)), "%)\n", 
				"\n", 
				"Peak lunch\n", 
				peak_lunch_start, " to ", peak_lunch_end, "\n", 
				"\n", 
				"Peak dinner\n", 
				peak_dinner_start, " to ", peak_dinner_end, 
				sep="" 
			) 
		) 
 
	return (results) 
}    

 results <- new_peak_time_tibble() %>% 
	# Green 
	append_peak_time_row('11:15:00', '14:00:00', '15:00:00', '20:30:00') %>% 
	# Orange 
	append_peak_time_row('11:30:00', '13:45:00', '15:45:00', '19:30:00') %>% 
	# Red 
	append_peak_time_row('11:45:00', '13:15:00', '16:30:00', '18:45:00') %>% 
	pizzas_made_for_various_peak_times() %>% 
	mutate(of_total_pizzas_made = pizzas_made / sum(order_details$quantity)) %>% 
	format_labels_for_plot()    

 fill_colours <- tibble( 
	name = c("green", "orange", "red"), 
	colour = c("#44c17b", "#FFA500", "#ff4d4d") 
)    

 ggplot(results) + 
	geom_col( 
		position='dodge', 
		width=0.8, 
		aes( 
			x=label, 
			y=of_total_pizzas_made, 
			fill=fill_colours$name 
		) 
	) + 
	geom_text( 
		aes(x=label, y=of_total_pizzas_made, label=label), 
		size=3, 
		position=position_dodge(width=0.9), 
		vjust=0.85 
	) + 
	geom_label( 
		aes( 
			x=c(2), 
			y=c(1), 
			label=c(paste("Total pizzas made: ", sum(order_details$quantity), seq="")) 
		), 
		size=3, 
	) + 
	ylim(0, 1) + 
	scale_y_continuous(labels=scales::percent, limits=c(0, 1)) + 
	scale_fill_manual(values=fill_colours$colour) + 
	theme_bw() + 
	theme( 
		legend.position="none", 
		plot.title=element_text(hjust=0.5), 
		plot.subtitle=element_text(hjust=0.5), 
		plot.caption=element_text(hjust=0.5), 
		axis.text.x=element_blank(), 
		axis.title=element_blank(), 
		axis.ticks=element_blank(), 
		panel.grid.minor=element_blank(), 
		panel.grid.major.x=element_blank(), 
		panel.background=element_blank(), 
		panel.border=element_blank() 
	) + 
	labs( 
		title="How many pizzas are made during peak periods?", 
		subtitle="Depends on what you define as peek periods", 
	) 
 
ggsave("~/Downloads/pizzas-made-at-peak-times.png")    

— Act XVI —

The plot

A bar chart of the pizzas made at peak periods for various definitions of peak lunch and dinner times

If we are loose with our defintion then over 80% of pizzas are made during peak periods. If we are strict then about 43%. This shows just how big an impact our assumptions and definitions have on the results of analysis.

Imagine Plato's Pizza is struggling in a recession and must optimise to survive. The leadership must make decisions on how to reduce waste and make every open hour count. It is likely the labour costs are the biggest outgoing so the staffing schedule will be a critical factor in the success of the business. 81% gives a much greater weight to the importance of peak lunch and dinner times than 43% does. The difference can affect the conclusions drawn which in turn affects the decisions made about staffing strategy. Such decisions can make or break buisnesses and livelihoods.

But would this really be the case? If we communicate the complexities of the insights to decision makers they can factor it all in right? We can certainly make an attempt. Some stakeholders are effortful thinkers, that is, they take or have the time to listen, grasp, and think about the various factors at depth. Unfortunately, many poeple prefer intuitive decision making which often involves picking up the numbers and making intuitive decisions with little thought about what the numbers truly mean.

When we make assumptions, create definitions, present results, visualise insights, and communicate it all we must keep in mind the variation in how people interpret and apply knowledge to make decisions. It's a very hard thing to get right, but I'm sure with more exposure to such situations I can become adept at it.

I think I could improve this plot by placing time on the x-axis and having the bars overlap. Then I can remove the text within each bar. Once done there's no eason to stick to bars; I could use semi circles instead.

— Act XVII —

Summary of discovery

To summarise what we've discovered about the making of pizzas at Plato's Pizzeria in 2015:

A total of 49,574 pizzas were made.
Smallish adjustments to the definition of peak lunch and dinner times has a significant impact on how many peak period pizzas are made.
A moderate definition of 11:30 to 13:45 for lunch and 15:45 to 19:30 for dinner gives us 31,967 pizzas made or approximately 64% of the yearly total during peak periods.
But the quantity varies quite a lot with the definition of peak period . Around 43% (21,727) for a strict one and around 81% (40,348) for a loose one.

— Scene II — How many pizzas are made during peak periods at Plato's Pizzas?