In Pandas, the pandas.crosstab() function is used to compute a cross-tabulation of two (or more) factors. It is a convenient way to analyze the relationship between two or more categorical variables in a DataFrame.
In this article, we will first understand the syntax and the parameters of pandas.crosstab() function, then we will look at some examples to demonstrate it. In each example, we will pass different parameter combinations to understand how this function works in different scenarios. Let’s get started.
Syntax of pandas.crosstab() Function
By specifying the index and columns, this function creates a table that shows the frequency distribution of observations across various categories. Additional parameters such as values, aggfunc, margins, and normalize provide flexibility in tailoring the output.
Syntax:
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
Parameters:
- index: The DataFrame column to use as the row index in the cross-tabulation.
- columns: The DataFrame column to use as the column index in the cross-tabulation.
- values (optional): An array-like object representing the values to aggregate.
- rownames (optional): If provided, these will be the names for the rows.
- colnames (optional): If provided, these will be the names for the columns.
- aggfunc (optional): Aggregation function to apply (e.g., ‘sum’, ‘mean’, ‘count’). If not specified, the default is ‘count’.
- margins (optional): If True, add row/column margins (subtotals).
- margins_name (optional): Name of the row/column that will contain the totals when margins are True.
- dropna (optional): If True, do not include columns whose entries are all NaN.
- normalize (optional): If True, compute proportions (percentages) rather than counts.
Implementation of pandas.crosstab() Function
For implementing pandas.crosstab() function let’s first create a DataFrame.
import pandas as pd
df = pd.read_excel("survey.xls")
df
Here we first imported the Pandas as pd, then created the DatFrame using an XLS file, the XLS file name is survey.xls, and then printed the DataFrame.
Output:
Let’s now implement pandas.crosstab() function by passing different parameters into it for the above Pandas DataFrame.
Example 1: Passing ‘index’ and ‘column’ Parameters
pd.crosstab(df.Nationality,df.Handedness)
Here we have taken the row index as df.Nationality and column index as df .Handedness means in the row level on the x-axis for the table, we will have Nationality and in the column on the y-axis, we will have Handedness.
Output:
We can see that we have got a table in which we have two individuals from Bangladesh who are left-handed. Similarly, the USA has a value of 3 which shows the frequency of right-handed people is three for the USA.
Example 2: Passing ‘index,’ ‘column,’ and ‘margins’ Parameters
As we know when the margins parameter is True in the pandas.crosstab() function then it will give us a total of rows and columns margins.
pd.crosstab(df.Sex,df.Handedness,margins=True)
In the above code, we have taken the Sex variable in the row level and Handedness in the column level which will give us the frequency of how many males and females are left-handed and right-handed. With this, we also passed the argument margins as True which will generate the Total individuals of left-handed and right-handed in Male and Female.
Output:
In the output, we can see that we got the total of females as 5 and males as 7. With this, we got a total of left-handed persons and right-handed persons.
Example 3: Using an Array for the ‘column’ Parameter
Here to add one more variable which is Nationality into the column level that is on the y-axis of the table we will pass an array in which we will put df.Handedness and df.Nationality.
pd.crosstab(df.Sex,[df.Handedness,df.Nationality],margins=True)
Output:
Example 4: Passing ‘row,’ ‘column,’ and ‘normalize’ Parameters
Sometimes it’s good to have a percentage so to do this we will use another argument called normalize which normalizes by dividing all values by the sum of all values.
pd.crosstab(df.Sex,df.Handedness,normalize ='index')
We have passed the parameter normalize as an index which calculated the percentage at a row level.
Output:
We got that 40% of people are left-handed Females and 60% are right-handed Females while approximately 72% of males are left-handed and 28% of males are right-handed.
Example 5: Passing ‘index,’ ‘column,’ ‘values,’ and ‘aggfunc’ Parameters
Here we will find the average age of males and females for right-handed and left-handed from the survey.xls DataFrame by passing aggfunc=np.average into the pandas.crosstab() function.
import numpy as np
pd.crosstab(df.Sex,df.Handedness,values=df.Age, aggfunc=np.average)
Output:
Summary
Now that we have reached the end of this article, we hope it has elaborated on how to use the pandas.crosstab() function from the Pandas library. Here’s another article that details how to get column names in Pandas DataFrame. CodeForGeek has many other entertaining and equally informative articles that can be of great help to those who want to advance in Python, so be sure to check them out as well.
Reference
https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html#