Data Visualization with Python
Pandas Built-in Data Visualization
In this lecture we will learn about pandas built-in capabilities for data visualization! It's built-off of matplotlib, but it baked into pandas for easier usage!
Hopefully you can see why this method of plotting will be a lot easier to use than full-on matplotlib, it balances ease of use with control over the figure. A lot of the plot calls also accept additional arguments of their parent matplotlib plt. call.
The data we'll use in this part:
import numpy as np
import pandas as pd
%matplotlib inline
df1 = pd.read_csv('Df1.csv',index_col=0)
df2 = pd.read_csv('Df2.csv')
| |||
| Method/Operator | Description/Example | Output/Figure | |
|---|---|---|---|
Style Sheets |
plt.style.use('')
|
Matplotlib has style sheets you can use to make your plots look a little nicer. These style sheets include plot_bmh,plot_fivethirtyeight,plot_ggplot and more. They basically create a set of style rules that your plots follow. I recommend using them, they make all your plots have the same look and feel more professional. You can even create your own if you want your company's plots to all have the same look (it is a bit tedious to create on though).
Here is how to use them. Before plt.style.use() your plots look like this:df1['A'].hist()
|
![]() |
Call the style:import matplotlib.pyplot as plt
plt.style.use('ggplot')
df1['A'].hist()
|
![]() | ||
plt.style.use('bmh')
df1['A'].hist()
|
![]() | ||
plt.style.use('dark_background')
df1['A'].hist()
|
![]() | ||
plt.style.use('fivethirtyeight')
df1['A'].hist()
|
![]() | ||
Plot Types |
There are several plot types built-in to pandas, most of them statistical plots by nature:
|
||
Area
|
df2.plot.area(alpha=0.4)
|
![]() | |
Barplots
|
df2.plot.bar()
|
![]() | |
df2.plot.bar(stacked=True)
|
![]() | ||
Histograms
|
df1['A'].plot.hist(bins=50)
|
![]() | |
Line Plots
|
df1.plot.line(x=df1.index,y='B',figsize=(12,3),lw=1)
![]() | ||
Scatter Plots
|
df1.plot.scatter(x='A',y='B')
|
![]() | |
You can use c to color based off another column value Use cmap to indicate colormap to use. For all the colormaps, check out: http://matplotlib.org/users/colormaps.htmldf1.plot.scatter(x='A',y='B',c='C',cmap='coolwarm')
|
![]() | ||
Or use s to indicate size based off another column. s parameter needs to be an array, not just the name of a column:df1.plot.scatter(x='A',y='B',s=df1['C']*200)
|
![]() | ||
BoxPlots
|
df2.plot.box() # Can also pass a by= argument for groupby
|
![]() | |
Hexagonal Bin Plot
|
Useful for Bivariate Data, alternative to scatterplot:df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
df.plot.hexbin(x='a',y='b',gridsize=25,cmap='Oranges')
|
![]() | |
Kernel Density Estimation plot (KDE)
|
df2['a'].plot.kde()
|
![]() | |
df2.plot.density()
|
![]() | ||
Data visualization with Matplotlib
Matplotlib is the "grandfather" library of data visualization with Python. It was created by John Hunter. He created it to try to replicate MatLab's (another programming language) plotting capabilities in Python. So if you happen to be familiar with matlab, matplotlib will feel natural to you.
It is an excellent 2D and 3D graphics library for generating scientific figures.
ahora Some of the major Pros of Matplotlib are:
- Generally easy to get started for simple plots
- Support for custom labels and texts
- Great control of every element in a figure
- High-quality output in many formats
- Very customizable in general
References:
- The project web page for matplotlib: http://www.matplotlib.org
- The source code for matplotlib: https://github.com/matplotlib/matplotlib
- A large gallery showcaseing various types of plots matplotlib can create. Highly recommended!: http://matplotlib.org/gallery.html
- A good matplotlib tutorial: http://www.loria.fr/~rougier/teaching/matplotlib
But most likely you'll be passing numpy arrays or pandas columns (which essentially also behave like arrays). However, you can also use lists.
Matplotlib allows you to create reproducible figures programmatically. Let's learn how to use it! Before continuing this lecture, I encourage you just to explore the official Matplotlib web page: http://matplotlib.org/
Installation
conda install matplotlib
Or without conda:
pip install matplotlib
Importing:
import matplotlib.pyplot as plt
You'll also need to use this line to see plots in the notebook:
%matplotlib inline
That line is only for jupyter notebooks, if you are using another editor, you'll use: plt.show() at the end of all your plotting commands to have the figure pop up in another window.
Advanced Matplotlib Concepts
In this lecture we cover some more advanced topics which you won't usually use as often. You can always reference the documentation for more resources!
Forther reading:
Data visualization with Seaborn
Seaborn is a statistical visualization library designed to work with pandas dataframes well.
import seaborn as sns
%matplotlib inline
Built-in data sets
Seaborn comes with built-in data sets!
tips = sns.load_dataset('tips')
tips.head()
# Output:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Distribution Plots
import seaborn as sns
%matplotlib inline
| |||||
| Description/Example | Output/Figure | ||||
|---|---|---|---|---|---|
Distribution of a univariate set of observations |
distplot
|
The distplot shows the distribution of a univariate set of observations:sns.distplot(tips['total_bill'])
# Safe to ignore warnings
sns.distplot(tips['total_bill'],kde=False,bins=30)
|
![]() ![]() | ||
Match up two distplots for bivariate data |
jointplot()
|
jointplot() allows you to basically match up two distplots for bivariate data. With your choice of what kind parameter to compare with:
sns.jointplot(x='total_bill',y='tip',data=tips,kind='scatter')
sns.jointplot(x='total_bill',y='tip',data=tips,kind='hex')
sns.jointplot(x='total_bill',y='tip',data=tips,kind='reg')
|
![]() ![]() ![]() | ||
Plot pairwise relationships across an entire dataframe |
pairplot
|
pairplot will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns):sns.pairplot(tips)
sns.pairplot(tips,hue='sex',palette='coolwarm')
| |||
Draw a dash mark for every point on a univariate distribution |
rugplot
|
rugplots are actually a very simple concept, they just draw a dash mark for every point on a univariate distribution. They are the building block of a KDE plot:sns.rugplot(tips['total_bill'])
|
![]() | ||
Kernel Density Estimation plots |
kdeplot
|
kdeplots are Kernel Density Estimation plots. These KDE plots replace every single observation with a Gaussian (Normal) distribution centered around that value. For example:# Don't worry about understanding this code!
# It's just for the diagram below
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
#Create dataset
dataset = np.random.randn(25)
# Create another rugplot
sns.rugplot(dataset);
# Set up the x-axis for the plot
x_min = dataset.min() - 2
x_max = dataset.max() + 2
# 100 equally spaced points from x_min to x_max
x_axis = np.linspace(x_min,x_max,100)
# Set up the bandwidth, for info on this:
url = 'http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth'
bandwidth = ((4*dataset.std()**5)/(3*len(dataset)))**.2
# Create an empty kernel list
kernel_list = []
# Plot each basis function
for data_point in dataset:
# Create a kernel for each point and append to list
kernel = stats.norm(data_point,bandwidth).pdf(x_axis)
kernel_list.append(kernel)
#Scale for plotting
kernel = kernel / kernel.max()
kernel = kernel * .4
plt.plot(x_axis,kernel,color = 'grey',alpha=0.5)
plt.ylim(0,1)
|
![]() | ||
# To get the kde plot we can sum these basis functions.
# Plot the sum of the basis function
sum_of_kde = np.sum(kernel_list,axis=0)
# Plot figure
fig = plt.plot(x_axis,sum_of_kde,color='indianred')
# Add the initial rugplot
sns.rugplot(dataset,c = 'indianred')
# Get rid of y-tick marks
plt.yticks([])
# Set title
plt.suptitle("Sum of the Basis Functions")
|
![]() | ||||
So with our tips dataset:sns.kdeplot(tips['total_bill'])
sns.rugplot(tips['total_bill'])
|
![]() | ||||
sns.kdeplot(tips['tip'])
sns.rugplot(tips['tip'])
|
![]() | ||||
Categorical Data Plots
Now let's discuss using seaborn to plot categorical data! There are a few main plot types for this:
factorplotboxplotviolinplotstripplotswarmplotbarplotcountplot
Matrix Plots
Matrix plots allow you to plot data as color-encoded matrices and can also be used to indicate clusters within the data (later in the machine learning section we will learn how to formally cluster data).
Grids
Grids are general types of plots that allow you to map plot types to rows and columns of a grid, this helps you create similar plots separated by features.
Regression plots
Seaborn has many built-in capabilities for regression plots, however we won't really discuss regression until the machine learning section of the course, so we will only cover the lmplot() function for now.
lmplot allows you to display linear models, but it also conveniently allows you to split up those plots based off of features, as well as coloring the hue based off of features.
Style and Color
Check out the documentation page for more info on these topics: https://stanford.edu/~mwaskom/software/seaborn/tutorial/aesthetics.html
Data visualization with Plotly and Cufflinks
Plotly is a library that allows you to create interactive plots that you can use in dashboards or websites (you can save them as html files or static images).
Check out the plotly.py documentation and gallery to learn more: https://plot.ly/python/
Plotly plots can be easily saved online and shared at https://chart-studio.plot.ly. Take a look at this example: https://chart-studio.plot.ly/~jackp/671/average-effective-tax-rates-by-income-percentiles-1960-2004/#/
Installation
In order for this all to work, you'll need to install plotly and cufflinks to call plots directly off of a pandas dataframe. Cufflinks is not currently available through conda but available through pip. Install the libraries at your command line/terminal using:
pip install plotly
pip install cufflinks
Imports and Set-up
import pandas as pd
import numpy as np
%matplotlib inline
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
print(__version__) # requires version >= 1.9.0
import cufflinks as cf
# For Notebooks
init_notebook_mode(connected=True)
# For offline use
cf.go_offline()
Data
df = pd.DataFrame(np.random.randn(100,4),columns='A B C D'.split())
df2 = pd.DataFrame({'Category':['A','B','C'],'Values':[32,43,50]})
df.head()
# Output:
A B C D
0 1.878725 0.688719 1.066733 0.543956
1 0.028734 0.104054 0.048176 1.842188
2 -0.158793 0.387926 -0.635371 -0.637558
3 -1.221972 1.393423 -0.299794 -1.113622
4 1.253152 -0.537598 0.302917 -2.546083
df2.head()
# Output:
Category Values
0 A 32
1 B 43
2 C 50
| Method/Operator | Description/Example | Output/Figure | |
|---|---|---|---|
Using Cufflinks and iplot() |
Scatter | df.iplot(kind='scatter',x='A',y='B',mode='markers',size=10) |
|
Bar Plots |
df2.iplot(kind='bar',x='Category',y='Values') |
||
Boxplots |
df.iplot(kind='box') |
||
3d Surface |
df3 = pd.DataFrame({'x':[1,2,3,4,5],'y':[10,20,30,20,10],'z':[5,4,3,2,1]})
df3.iplot(kind='surface',colorscale='rdylbu') |
||
Spread |
df[['A','B']].iplot(kind='spread') |
||
Histogram |
df['A'].iplot(kind='hist',bins=25) |
||
Bubble |
df.iplot(kind='bubble',x='A',y='B',size='C') |
||
Scatter_matrix |
df.scatter_matrix()
# Similar to sns.pairplot() |









































































































