Why I’m starting a Data Science Masters program

Today is the first day of my first semester as a Data Science Masters student.

Why am I starting a Data Science Masters program?

Because I want to be able to unlock insights about whatever data problem I’m faced with. (And aren’t they all data problems? I’m joking—kind of…) Here are two general problems I hope to learn to solve:

  1. improve online conversion rates
  2. balance game system designs

How did I get here?

I got into data via art. In the mid-2000s, I followed the work of artists like Joshua Davis. He inspired me to start playing around with Flash and Processing. I fancied myself a digital artist. I might have called the images I made around this time “data visualizations,” but they were actually visual art (that happened to have some data under the hood). I didn’t yet know of the idea of making sense of data and telling a story about that data.

Then, I read Ben Fry’s Visualizing Data which opened up a whole new world. I considered studying with Ben Fry’s former mentor John Maeda at the MIT Media Lab. Instead, I ended up following a different path; mostly focusing on designing websites. That lead to an interest in organizational processes, business, and earning an MBA.

Now I find myself ready to commit to data science. I work with lots of interesting analytics experts at my day job. I keep wanting to jump in and learn how they do what they do. This Data Science program is how I plan to finally do that.

I plan to write about what I learn—at least once a week—and post those writings here. Please follow along and feel free to comment or ask questions.

Create Table Lens Display with R and ggplot2

I’m trying to recreate this graph (from the cover of Show Me the Numbers) with R and ggplot2:

Show-Me-the-Numbers-cover

A kind fellow (named Vivek Patil) on the ggplot2 mailing list got me started with some code:

tea=c("Arabian","French Roast")
Sales=c(10000, 15000)
Plan=c(12000,12000)
Variance=c(-2000,3000)
df=data.frame(tea=tea,Sales=Sales,Plan=Plan,Variance=Variance)
library(ggplot2)
library(gridExtra)
library(ggthemes)
salesplan=ggplot(df,aes(x=tea,y=Sales,fill=tea))+geom_bar(stat="identity")+
geom_segment(aes(x=as.numeric(df$tea)-.475,xend=as.numeric(df$tea)+.475,y=df$Plan,yend=df$Plan))+
ggtitle("Sales vs Plan")+ coord_flip() +theme_few()+scale_fill_few("medium")+
theme(legend.position="None", axis.title.y=element_blank(),axis.title.x=element_blank())
varianceplan=ggplot(df,aes(x=tea,y=Variance,fill=tea))+geom_bar(stat="identity")+
coord_flip()+ggtitle("Variance to Plan")+theme_few()+scale_fill_few("medium")+
theme(legend.position="None",axis.text.y = element_blank(),axis.title.y=element_blank(),
axis.title.x=element_blank())
grid.arrange(salesplan,varianceplan,ncol=2)

That code generates this graph:

Graph from starter code

I built on that by adding more data, saving it to a CSV file and then running the ggplot2 again. Here is the CSV data:

,Product,Sales,Plan,Variance,Variance2,Type
1,Arabian,10000,12000,-2000,-0.1667,Espresso
2,French Roast,15000,12000,3000,0.2500,Coffee
3,Green Tea,18000,15000,3000,0.2000,Tea
4,Mint,19000,16000,3000,0.1875,Herbal Tea
5,Italian Roast,19000,17000,2000,0.1176,Espresso
6,Sumatra,33000,37000,-4000,-0.1081,Coffee
7,Earl Grey,36000,30000,6000,0.2000,Tea
8,Darjeeling,38000,33000,5000,0.1515,Tea
9,Chamomile,40000,35000,5000,0.1429,Herbal Tea
10,Garuda,42000,40000,2000,0.0500,Espresso
11,Mocha-Java,45000,47000,-2000,-0.0426,Espresso
12,Lemon,50000,42000,8000,0.1905,Herbal Tea
13,Columbian,65000,68000,-3000,-0.0441,Coffee

Here is the code:

df <- read.csv(file="DisplayTableData.csv", header=TRUE, sep=",")
library(ggplot2)
library(gridExtra)
library(ggthemes)
salesplan=ggplot(df,aes(x=Product,y=Sales,fill=Product))+geom_bar(stat="identity")+
geom_segment(aes(x=as.numeric(df$Product)-.475,xend=as.numeric(df$Product)+.475,y=df$Plan,yend=df$Plan))+
ggtitle("Sales vs Plan")+ coord_flip() +theme_few()+scale_fill_few("medium")+
theme(legend.position="None", axis.title.y=element_blank(),axis.title.x=element_blank())
varianceplan=ggplot(df,aes(x=Product,y=Variance2,fill=Product))+geom_bar(stat="identity")+
coord_flip()+ggtitle("Variance to Plan %")+theme_few()+scale_fill_few("medium")+
theme(legend.position="None",axis.text.y = element_blank(),axis.title.y=element_blank(),
axis.title.x=element_blank())
grid.arrange(salesplan,varianceplan,ncol=2)

And here is the graph:

Graph from starter code2
As you can see, I’ve got some more work to do. I’m reading through the docs and trying things out. Hopefully, my next post will show complete working code.