Table of Contents
Related Searches: pandas groupby apply, python groupby, pandas groupby, group by, df groupby, groupby pandas, pd group by, pandas datafrome groupby, pandas groupby index, groupby two columns pandas, pandas groupby transform
Introduction to Pandas groupby
Generally, python is one of the best languages for doing data analysis because of its various packages. Pandas is one of those packages which makes data analyzing much easier and it is known for its various important features and functionalities. In this tutorial, we will learn about pandas groupby feature which is used for grouping data according to the categories and apply different functions to the categories. We will cover splitting of data, a grouping of data, and selecting data from pandas groupby. By the end of this tutorial, you will have a solid knowledge of pandas groupby and you can easily save your groupby as CSV or Excel files.
Pre-requisites
By default pandas module may not be installed on your environment so you can use pip to install the pandas module:
Getting start with Pandas groupby
Pandas groupby is a simple and useful concept. It helps us to create a group of categories and apply a function to those categories. This concept is useful when in real data science, you will deal with large amounts of data but in this section, we will take small data and apply different concepts of pandas groupby over it. Pandas groupby is really important because of its ability to aggregate data efficiently.
Our datafarme.csv file contains the following data:
student_id,Name,Gender,marks
5,Erlan,Male,34
10,Alex,Male,44
15,soro,Female,46
20,Khan,Male,33
25,ateeq,Male,28
30,MD,Female,49
35,JK,Male,32
40,alam,Male,12
45,kashif,Male,45
50,savri,Female,31
Now let us print the data from dataframe.csv
using pandas.
# importing pandas
import pandas as pd
# printing dataset
<b>print</b>(pd.<b>read_csv</b>("dataframe.csv"))
Output:
student_id Name Gender marks
0 5 Erlan Male 34
1 10 Alex Male 44
2 15 soro Female 46
3 20 Khan Male 33
4 25 ateeq Male 28
5 30 MD Female 49
6 35 JK Male 32
7 40 alam Male 12
8 45 kashif Male 45
9 50 savri Female 31
Splitting data into groups in pandas groupby
There are multiple ways to split pandas objects into sub-objects. The following are the simple syntax of different ways of splitting data.
Creating group based on one column
pandas_object.groupby(‘key’)
Creating a group of multiple columns
pandas_object.groupby([‘key1’,’key2’])
Now let us explain each of the above methods of splitting data by pandas groupby by taking an example. See the following example which takes the csv files, stores the dataset, then splits the dataset using the pandas groupby method.
# importing pandas
import pandas as pd
# printing dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# splitting data
<b>print</b>(my_dataframe.<b>groupby</b>("Gender"))
Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe58a5fff10>
The output shows us that we have successfully created a group of pandas objects. Let us now view the group that we have just created.
Viewing group in pandas groupby
In the above example, we had seen that when we try to print the grouped data, it prints the object not the data inside the object. In this section, we will see how we can print out the groups that we have created using the pandas groupby method.
# importing pandas
import pandas as pd
# printing dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# splitting data
<b>print</b>(my_dataframe.<b>groupby</b>("Gender").groups)
Output:
{'Female': [2, 5, 9], 'Male': [0, 1, 3, 4, 6, 7, 8]}
The pandas groupby had created a group in the form of a dictionary and indices number in the form of set as values.
Now let us create multiple groups by giving a list of names.
# importing pandas
import pandas as pd
# printing dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# splitting data
<b>print</b>(my_dataframe.<b>groupby</b>(["Gender", "Name"]).groups)
Output:
{('Female', ' soro'): [2], ('Female', ' MD'): [5], ('Female', ' savri'): [9], ('Male', ' Alex'): [1], ('Male', ' Erlan'): [0], ('Male', ' JK'): [6], ('Male', ' Khan'): [3], ('Male', ' alam'): [7], ('Male', ' ateeq'): [4], ('Male', ' kashif'): [8]}
You can notice in the above example that we had created a group of gender and names along with their indices numbers.
Iterating through groups in pandas groupby
Now we know how to create a group of dataframe by pandas groupby method. We know that the groupby method returns a pandas object which is iterable. Let us now iterate through the object using a for loop and print the data. See the example below:
import pandas as pd
# printing dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# creating group using pandas groupby method
group = my_dataframe.<b>groupby</b>("Gender")
# iterating over group
for gender, value in group:
<b>print</b>(gender)
<b>print</b>(value)
Output:
Female
student_id Name Gender marks
2 15 soro Female 46
5 30 MD Female 49
9 50 savri Female 31
Male
student_id Name Gender marks
0 5 Erlan Male 34
1 10 Alex Male 44
3 20 Khan Male 33
4 25 ateeq Male 28
6 35 JK Male 32
7 40 alam Male 12
8 45 kashif Male 45
Notice that the pandas groupby method has created different groups of data based on Gender. Now let us create a group of multiple columns and iterate over the object using for loop.
# importing pandas
import pandas as pd
# printing dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# creating group using pandas groupby method
group = my_dataframe.<b>groupby</b>(["Gender", "Name"])
# iterating over group
for gender, value in group:
<b>print</b>(gender)
<b>print</b>(value)
Output:
('Female', ' soro')
student_id Name Gender marks
2 15 soro Female 46
('Female', ' MD')
student_id Name Gender marks
5 30 MD Female 49
('Female', ' savri')
student_id Name Gender marks
9 50 savri Female 31
('Male', ' Alex')
student_id Name Gender marks
1 10 Alex Male 44
('Male', ' Erlan')
student_id Name Gender marks
0 5 Erlan Male 34
('Male', ' JK')
student_id Name Gender marks
6 35 JK Male 32
('Male', ' Khan')
student_id Name Gender marks
3 20 Khan Male 33
('Male', ' alam')
student_id Name Gender marks
7 40 alam Male 12
('Male', ' ateeq')
student_id Name Gender marks
4 25 ateeq Male 28
('Male', ' kashif')
student_id Name Gender marks
8 45 kashif Male 45
Notice that it created a group for each person because their names are different from each other.
Pandas groupby percentage
Pandas are known for their powerful features and one of them is groping based on percentage or finding percentage of each element in a group. In this section we will apply the percentage method to find the percentage of females and males in each group, by using lambda. See the example below:
# importing pandas
import pandas as pd
# printing dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# creating group using pandas groupby method
group = my_dataframe.<b>groupby</b>(["Gender", "Name"]).<b>size</b>()
# Give the percentage on the level of Rank:
per = group.<b>groupby</b>(level=0).<b>apply</b>(lambda x: 100 * x / float(x.sum()))
<b>print</b>(per)
Output:
You can see that first we grouped our data frame and then calculated the percentage of male and female in each group.
Sorting Pandas groupby
Now let us sort the marks by ascending order. Pandas allow us to arrange our data in ascending order. See the following example where we arranged our data ( marks) in ascending order.
# importing pandas
import pandas as pd
# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# sorting dataframe in groupby
sorted = my_dataframe.<b>groupby</b>("marks").<b>sum</b>()
<b>print</b>(sorted)
Output:
Notice that the marks are arranged in ascending orders. One more important feature of pandas is that we can create a group of specified numbers/data. For example, we can create a group of all the students who have marks 44. See the example below:
# importing pandas
import pandas as pd
# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# create specific data group
sorted = my_dataframe.<b>groupby</b>("marks").<b>get_group</b>(44)
<b>print</b>(sorted)
Output:
student_id Name Gender marks
1 10 Alex Male 44
Selecting a group with multiple keys in pandas groupby
Now as we already are familiar with the basics operations which we can perform on pandas groupby. Now let us first create a group by the pandas groupby method and select a small group from that one. This is possible using the get_group()
method. See the example below:
# importing pandas
import pandas as pd
# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# create specific data group
grouped = my_dataframe.<b>groupby</b>(["Name", "Gender"])
# printing group by multiple keys
<b>print</b>(grouped.<b>get_group</b>(("soro","Female")))
Output:
student_id Name Gender marks
2 15 soro Female 46
See that we were able to get a small group by providing multiple keys.
Important methods and pandas groupby
We already have the basic knowledge to work with pandas groupby. However, pandas are not only limited to these features only, but it also provides huge functionality that we can apply on different things. Here in this section, we will cover some more methods which are very useful when dealing with big groups.
We will cover following method in this section:
- Aggregation
- Transformation
- Filtration
Let us say we have the following data in our dataframe.csv
file.
Facebook,Instagram,Tiktok
2390,320,400
300,23,230
4000,340,230
In the following sections we will use different methods on this dataset.
Aggregation
Pandas aggregation provides us with different functions for example mathematical or logical operations on our dataset and it returns a summary of function. Usually, that summary is a single data. Aggregation in pandas can be used to get a summary of columns in our dataset like getting sum, min, maximum, mode, mean, etc from a particular column of our dataset.
Here is a list of different functions associated with aggregation:
sum()
:Compute sum of column valuesmin()
:Compute min of column valuesmax()
:Compute max of column valuesmean()
:Compute mean of columnsize()
:Compute column sizesdescribe()
:Generates descriptive statisticsfirst()
:Compute first of group valueslast()
:Compute last of group valuescount()
:Compute count of column values
Examples: pandas groupby aggregate multiple columns
Now let us implement the above method using practical examples of datasets. See the following example which prints the sum()
of all the cells in each column.
# importing pandas
import pandas as pd
# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# printing the sum
<b>print</b>(my_dataframe.<b>sum</b>())
Output:
In a similar way we can get the max and min values as well. See the following example.
# importing pandas
import pandas as pd
# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# printing the minimum value from dataset
<b>print</b>(my_dataframe.<b>min</b>())
Output:
Now let us try one more method and get the mean values. See the example below:
# importing pandas
import pandas as pd
# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# printing the mean value from dataset
<b>print</b>(my_dataframe.<b>mean</b>())
Output:
More importantly, pandas allows us to perform multiple aggregation at once. See the example below which prints out sum, minimum, maximum and mean at the same time.
# importing pandas
import pandas as pd
# importing numpy
import numpy as np
# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# multiple functions
<b>print</b>(my_dataframe.<b>agg</b>([np.<b>sum</b>, np.max, np.min, np.<b>mean</b>]))
Output:
Transformation
The transform method in pandas groupby returns an object which should have the same indexing as that of the grouped one. It is a process in which we perform some computations on a specific group. Let us explain the transformation process with examples. In the example, we have found the percentage of followers on Instagram and TikTok and we assumed that our total target is 500.
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# groupby method
dataframe = my_dataframe.<b>groupby</b>('Facebook')
# defining transformation function
trans = lambda x: (x/500)*100
# printing
<b>print</b>(dataframe.<b>transform</b>(trans))
Output:
Instagram Tiktok
0 64.0 80.0
1 4.6 46.0
2 68.0 46.0
In a similar way, we can add our own custom function and transform the whole group into a new one.
Let us now create one new column named total and transform our data from Facebook column to total column. See the example below:
# importing pandas
import pandas as pd
# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# Transformation
my_dataframe['Total']=my_dataframe.<b>groupby</b>("Facebook")['Facebook'].<b>transform</b>('sum')
# printing
<b>print</b>(my_dataframe)
Output:
Facebook Instagram Tiktok Total
0 2390 320 400 2390
1 300 23 230 300
2 4000 340 230 4000
Filtration
Now filtration is a process in which we simply discard some groups, based on group wise computation that evaluates True or False. In order to filter a group we use a filter method and apply different conditions to filter the group. Let us first take very simple example of filtration.
# importing pandas
import pandas as pd
# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# Filter data to print only Facebook column
<b>print</b>(my_dataframe.<b>filter</b>(["Facebook"]))
Output:
Facebook
0 2390
1 300
2 4000
Now we will apply filtration on groupby by defining a condition using lambda function.
# importing pandas
import pandas as pd
# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# groupby method
dataframe = my_dataframe.<b>groupby</b>('Facebook')
# defining transformation function
sc = lambda x: <b>len</b>(x)==1
# printing dataframe
<b>print</b>(dataframe.<b>filter</b>(sc))
Output:
Facebook Instagram Tiktok
0 2390 320 400
1 300 23 230
2 4000 340 230
In a similar way we can create any lambda function to filter our data and apply it through the filter method. Let us take one more example and apply filtration on the Facebook column and print out the rows where the sum of followers exceeds 3000. See the example below:
# importing pandas
import pandas as pd
# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")
# fiters data
filtered = my_dataframe.<b>groupby</b>(['Facebook']).<b>filter</b>(lambda x: <b>sum</b>(x['Facebook']) > 3000)
# printing
<b>print</b>(filtered)
Output:
Facebook Instagram Tiktok
2 4000 340 230
Saving grouped dataframe in different formats
So far we have learned much about pandas groupby and different operations that we can perform over them. Now let us see how we can save our grouped data in different formats. Mainly , we will learn about saving grouped data as CSV file and Excel file in this section.
We will use pandas_to_csv
and pandas_to_excel
methods to save the groupby objects as CSV and excel files respectively. And let say our CSV_file_one.csv
contains the following data:
name,subject1,subject2
bashir,20,34
alam,23,45
khan,35,5
Mk,44,45
Saving pandas groupby as CSV file
Now let us save our pandas groupby objects as a CSV file. See the example below which saves our grouped dataframe in CSV file:
import pandas as pd
# reading dataset
my_dataframe = pd.<b>read_csv</b>("CSV_file_one.csv")
# create specific data group
grouped = my_dataframe.<b>groupby</b>(["name"])
# resetting index and creating group of sum()
dataframe = grouped.<b>sum</b>().<b>reset_index</b>()
# to csv file
dataframe.<b>to_csv</b>("grouped_file.csv", index= True)
Now let us read the newly created csv file to see if we have successfully created it or not. See the example below which reads csv file and prints the data.
# reading newly created file
new = pd.<b>read_csv</b>("grouped_file.csv")
# printing
<b>print</b>(new)
Output:
name subject1 subject2
0 Mk 44 45
1 alam 23 45
2 bashir 20 34
3 khan 35 5
Saving pandas groupby as Excel file
Now let us save the grouped data in an excel file. It is simpler to the csv one but this time we will use the .to_excel()
method. See the example below which saves the pandas group in excel.
# importing pandas
import pandas as pd
# reading dataset
my_dataframe = pd.<b>read_csv</b>("CSV_file_one.csv")
# create specific data group
grouped = my_dataframe.<b>groupby</b>(["name"])
# resetting index and creating group of sum()
dataframe = grouped.<b>sum</b>().<b>reset_index</b>()
# to csv file
dataframe.<b>to_excel</b>("grouped_file.xlsx", index=False)
Let us now read the newly created file and see if we have successfully created an excel file.
# reading newly created file
new = pd.<b>read_excel</b>("grouped_file.xlsx")
# printing
<b>print</b>(new)
Output:
name subject1 subject2
0 Mk 44 45
1 alam 23 45
2 bashir 20 34
3 khan 35 5
Summary
Groupby splits our actual data into different groups depending on our choice. The pandas groupby function returns a groupby object which can be iterable. In this tutorial we learned about the pandas groupby method, splitting data into different groups, and iterating over those groups. Moreover, we also learned about selecting groups and viewing groups. At the same time, we also discussed different ways to save our groups in different file formats including CSV and excel
Further Reading
pandas groupby
pandas groupby in more details
pandas groupby documentation