Pivot table: Practical Use for Data Analyse in Pandas.

Published in

Analytics Vidhya

4 min readFeb 8, 2021

Every project brings Data Scientist a lot of questions and challenges. When we face a new data set, we don’t know how to describe it clearly, which features correlate with each other, how many outliers we have, and more questions. We start to understand the data set from quick exploratory data analysis (EDA), where we should summarize hundreds of rows and columns. For this step better to have a simple and powerful tool is a plus.
One of the tools which help us save time and avoid frustration is the pivot table. It is useful for the slice, filter, and group data at the speed of inquiry. Also, it is represented the information in a visually appealing way.
In this blog, I want to show your work with pivot tables in Pandas.

The simplest definition for pivot tables

A pivot table is a table of statistics that summaries the data from the original table. It sounds like a simple group by function but has one difference.

The Pivot Table takes simple column-wise data as input and groups the entries into a two-dimensional table, which provides a multi-dimensional summarization of the data. ©
— definition from Jake VanderPlas “Python Data Science Handbook”

For an illustration of how to pivot_table work, I use a Titanic dataset from Kaggle. This dataset has enough columns for practice in analyzing and it is quite simple to understand.

Definition of Columns for Titanic Dataset

Let’s find the questions for this dataset that helps us find out how actually pivot_table works.

How high survival rate for different types of groups?

1. Simple example for pivot table

If we check the documentation for groupby and pivot_table we can create a query and compare syntax.

The first thing that comes to mind is to count the survival rate by the column “Sex”. How we can see, groupby and pivot_table give us the same results. The syntax is simple and readable. The row for the pivot table can be changed to:

df.pivot_table(‘Survived’, index=’Sex’, aggfunc =’mean’)
# because default function for result is default numpy.mean()

So, for both these functions, we can use all statistical functions, like sum(), mean(), median(), min(), max(), quantile(), and etc.

But still, both results tables told us that three of every four females on board survived! And only one in five males survived.

For the next step, we dive deeper into data. If this survival rate depends on ticket class?

I wrote two lines of code for groupby and pivot_table function. How we can see they both give us equivalent results, but pivot_table more readable.

2. Use different aggregating function with pivot table

We can use different aggregation functions for pivot tables. The aggfunc keyword controls which function should be applied. For example, let’s count survival people.

df.pivot_table('Survived', index = 'Sex', columns = 'Pclass', aggfunc = sum)

As a result, we have the same data frame, but cells fill out with a number of survival people in each class. Instead of one function we also can use the list of functions:

df.pivot_table(‘Survived’, index=’Sex’, columns=’Pclass’, aggfunc = [sum, ‘mean’])

What we see when using a list of aggregation functions

More magic from aggregation functions!

We can specify result columns as dictionary mapping a column to any of the options. For example, I use the sum function to count survival people and the mean function for passenger fare.

3. Totals for pivot table

The margins keyword compute totals along with each grouping. It automatically gives us information about the total, we don’t need to create additional columns or functions.

Summary

I try to show how to use pivot_table on simple examples. You can argue with me about using some functions (like unstack). But they give better visual effects to show all results into the data frame. Also, I have a plan to create a blog about pivot function for SQL tables.