Pandas groupby methods explained with SIMPLE examples


Written By - Bashir Alam
Advertisement

Related Searches: pandas groupby apply, python groupby, pandas groupby, group by, df groupby, groupby pandas, pd group by, pandas datafrome groupby, pandas groupby index, groupby two columns pandas, pandas groupby transform

 

Introduction to Pandas groupby

Generally, python is one of the best languages for doing data analysis because of its various packages. Pandas is one of those packages which makes data analyzing much easier and it is known for its various important features and functionalities. In this tutorial, we will learn about pandas groupby feature which is used for grouping data according to the categories and apply different functions to the categories. We will cover splitting of data, a grouping of data, and selecting data from pandas groupby. By the end of this tutorial, you will have a solid knowledge of pandas groupby and you can easily save your groupby as CSV or Excel files.

 

Pre-requisites

By default pandas module may not be installed on your environment so you can use pip to install the pandas module:

Pandas groupby methods explained with SIMPLE examples

 

Getting start with Pandas groupby

Pandas groupby is a simple and useful concept. It helps us to create a group of categories and apply a function to those categories. This concept is useful when in real data science, you will deal with large amounts of data but in this section, we will take small data and apply different concepts of pandas groupby over it. Pandas groupby is really important because of its ability to aggregate data efficiently.

Our datafarme.csv file contains the following data:

student_id,Name,Gender,marks
5,Erlan,Male,34
10,Alex,Male,44
15,soro,Female,46
20,Khan,Male,33
25,ateeq,Male,28
30,MD,Female,49
35,JK,Male,32
40,alam,Male,12
45,kashif,Male,45
50,savri,Female,31

Now let us print the data from dataframe.csv  using pandas.

# importing pandas
import pandas as pd

# printing dataset
<b>print</b>(pd.<b>read_csv</b>("dataframe.csv"))

Output:

   student_id     Name   Gender    marks
0           5    Erlan     Male       34
1          10     Alex     Male       44
2          15     soro   Female       46
3          20     Khan      Male      33
4          25    ateeq      Male      28
5          30       MD    Female      49
6          35       JK      Male      32
7          40     alam      Male      12
8          45   kashif      Male      45
9          50    savri    Female      31

 

Splitting data into groups in pandas groupby

There are multiple ways to split pandas objects into sub-objects. The following are the simple syntax of different ways of splitting data.

Advertisement

Creating group based on one column

pandas_object.groupby(‘key’)

Creating a group of multiple columns

pandas_object.groupby([‘key1’,’key2’])

Now let us explain each of the above methods of splitting data by pandas groupby by taking an example.  See the following example which takes the csv files, stores the dataset, then splits the dataset using the pandas groupby method.

# importing pandas
import pandas as pd

# printing dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# splitting data
<b>print</b>(my_dataframe.<b>groupby</b>("Gender"))

Output:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe58a5fff10>

The output shows us that we have successfully created a group of pandas objects. Let us now view the group that we have just created.

 

Viewing group in pandas groupby

In the above example, we had seen that when we try to print the grouped data, it prints the object not the data inside the object. In this section, we will see how we can print out the groups that we have created using the pandas groupby method.

# importing pandas
import pandas as pd

# printing dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# splitting data
<b>print</b>(my_dataframe.<b>groupby</b>("Gender").groups)

Output:

{'Female': [2, 5, 9], 'Male': [0, 1, 3, 4, 6, 7, 8]}

The pandas groupby had created a group in the form of a dictionary and indices number in the form of set as values.

Now let us create multiple groups by giving a list of names.

# importing pandas
import pandas as pd

# printing dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# splitting data
<b>print</b>(my_dataframe.<b>groupby</b>(["Gender", "Name"]).groups)

Output:

Advertisement
{('Female', '  soro'): [2], ('Female', ' MD'): [5], ('Female', ' savri'): [9], ('Male', '  Alex'): [1], ('Male', ' Erlan'): [0], ('Male', ' JK'): [6], ('Male', ' Khan'): [3], ('Male', ' alam'): [7], ('Male', ' ateeq'): [4], ('Male', ' kashif'): [8]}

You can notice in the above example that we had created a group of gender and names along with their indices numbers.

 

Iterating through groups in pandas groupby

Now we know how to create a group of dataframe by pandas groupby method. We know that the groupby method returns a pandas object which is iterable. Let us now iterate through the object using a for loop and print the data. See the example below:

import pandas as pd
# printing dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# creating group using pandas groupby method
group = my_dataframe.<b>groupby</b>("Gender")

# iterating over group
for gender, value in group:
   <b>print</b>(gender)
   <b>print</b>(value)

Output:

Female
   student_id    Name  Gender   marks
2          15    soro  Female      46
5          30      MD  Female      49
9          50   savri  Female      31
Male
   student_id     Name Gender   marks
0           5    Erlan   Male      34
1          10     Alex   Male      44
3          20     Khan   Male      33
4          25    ateeq   Male      28
6          35       JK   Male      32
7          40     alam   Male      12
8          45   kashif   Male      45

Notice that the pandas groupby method has created different groups of data based on Gender.  Now let us create a group of multiple columns and iterate over the object using for loop.

# importing pandas
import pandas as pd

# printing dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# creating group using pandas groupby method
group = my_dataframe.<b>groupby</b>(["Gender", "Name"])

# iterating over group
for gender, value in group:
   <b>print</b>(gender)
   <b>print</b>(value)

Output:

('Female', '  soro')
   student_id    Name  Gender   marks
2          15    soro  Female      46
('Female', ' MD')
   student_id Name  Gender   marks
5          30   MD  Female      49
('Female', ' savri')
   student_id    Name  Gender   marks
9          50   savri  Female      31
('Male', '  Alex')
   student_id    Name Gender   marks
1          10    Alex   Male      44
('Male', ' Erlan')
   student_id    Name Gender   marks
0           5   Erlan   Male      34
('Male', ' JK')
   student_id Name Gender   marks
6          35   JK   Male      32
('Male', ' Khan')
   student_id   Name Gender   marks
3          20   Khan   Male      33
('Male', ' alam')
   student_id   Name Gender   marks
7          40   alam   Male      12
('Male', ' ateeq')
   student_id    Name Gender   marks
4          25   ateeq   Male      28
('Male', ' kashif')
   student_id     Name Gender   marks
8          45   kashif   Male      45

Notice that it created a group for each person because their names are different from each other.

Advertisement

 

Pandas groupby percentage

Pandas are known for their powerful features and one of them is groping based on percentage or finding percentage of each element in a group. In this section we will apply the percentage method to find the percentage of females and males in each group, by using lambda. See the example below:

# importing pandas
import pandas as pd

# printing dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# creating group using pandas groupby method
group = my_dataframe.<b>groupby</b>(["Gender", "Name"]).<b>size</b>()

# Give the percentage on the level of Rank:
per = group.<b>groupby</b>(level=0).<b>apply</b>(lambda x: 100 * x / float(x.sum()))
<b>print</b>(per)

Output:

Pandas groupby methods explained with SIMPLE examples

You can see that first we grouped our data frame and then calculated the percentage of male and female in each group.

 

Sorting Pandas groupby

Now let us sort the marks by ascending order.  Pandas allow us to arrange our data in ascending order. See the following example where we arranged our data ( marks) in ascending order.

# importing pandas
import pandas as pd

# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# sorting dataframe in groupby
sorted = my_dataframe.<b>groupby</b>("marks").<b>sum</b>()
<b>print</b>(sorted)

Output:

Pandas groupby methods explained with SIMPLE examples

Notice that the marks are arranged in ascending orders. One more important feature of pandas is that we can create a group of specified numbers/data. For example, we can create a group of all the students who have marks 44. See the example below:

# importing pandas
import pandas as pd

# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# create specific data group
sorted = my_dataframe.<b>groupby</b>("marks").<b>get_group</b>(44)
<b>print</b>(sorted)

Output:

   student_id    Name Gender  marks
1          10    Alex   Male     44

 

Selecting a group with multiple keys in pandas groupby

Now as we already are familiar with the basics operations which we can perform on pandas groupby. Now let us first create a group by the pandas groupby method and select a small group from that one.  This is possible using the get_group() method. See the example below:

# importing pandas
import pandas as pd

# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# create specific data group
grouped = my_dataframe.<b>groupby</b>(["Name", "Gender"])

# printing group by multiple keys
<b>print</b>(grouped.<b>get_group</b>(("soro","Female")))

Output:

   student_id  Name  Gender  marks
2          15  soro  Female     46

See that we were able to get a small group by providing multiple keys.

Advertisement

 

Important methods and pandas groupby

We already have the basic knowledge to work with pandas groupby. However, pandas are not only limited to these features only, but it also provides huge functionality that we can apply on different things. Here in this section, we will cover some more methods which are very useful when dealing with big groups.

We will cover following method in this section:

  • Aggregation
  • Transformation
  • Filtration

Let us say we have the following data in our dataframe.csv file.

Facebook,Instagram,Tiktok
2390,320,400
300,23,230
4000,340,230

In the following sections we will use different methods on this dataset.

 

Aggregation

Pandas aggregation provides us with different functions for example mathematical or logical operations on our dataset and it returns a summary of function. Usually, that summary is a single data. Aggregation in pandas can be used to get a summary of columns in our dataset like getting sum, min, maximum, mode, mean, etc from a particular column of our dataset.

Here is a list of different functions associated with aggregation:

  • sum() :Compute sum of column values
  • min() :Compute min of column values
  • max() :Compute max of column values
  • mean() :Compute mean of column
  • size() :Compute column sizes
  • describe() :Generates descriptive statistics
  • first() :Compute first of group values
  • last() :Compute last of group values
  • count() :Compute count of column values

 

Examples: pandas groupby aggregate multiple columns

Now let us implement the above method using practical examples of datasets. See the following example which prints the sum() of all the cells in each column.

# importing pandas
import pandas as pd

# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# printing the sum
<b>print</b>(my_dataframe.<b>sum</b>())

Output:

Pandas groupby methods explained with SIMPLE examples

In a similar way we can get the max and min values as well. See the following example.

# importing pandas
import pandas as pd

# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# printing the minimum value from dataset
<b>print</b>(my_dataframe.<b>min</b>())

Output:

Advertisement

Pandas groupby methods explained with SIMPLE examples

Now let us try one more method and get the mean values. See the example below:

# importing pandas
import pandas as pd

# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# printing the mean value from dataset
<b>print</b>(my_dataframe.<b>mean</b>())

Output:

Pandas groupby methods explained with SIMPLE examples

More importantly, pandas allows us to perform multiple aggregation at once. See the example below which prints out sum, minimum, maximum and mean at the same time.

# importing pandas
import pandas as pd

# importing numpy
import numpy as np

# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# multiple functions
<b>print</b>(my_dataframe.<b>agg</b>([np.<b>sum</b>, np.max, np.min, np.<b>mean</b>]))

Output:

Pandas groupby methods explained with SIMPLE examples

 

Transformation

The transform method in pandas groupby returns an object which should have the same indexing as that of the grouped one.  It is a process in which we perform some computations on a specific group. Let us explain the transformation process with examples. In the example, we have found the percentage of followers on Instagram and TikTok and we assumed that our total target is 500.

my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# groupby method
dataframe = my_dataframe.<b>groupby</b>('Facebook')

# defining transformation function
trans = lambda x: (x/500)*100

# printing
<b>print</b>(dataframe.<b>transform</b>(trans))

Output:

   Instagram  Tiktok
0       64.0    80.0
1        4.6    46.0
2       68.0    46.0

In a similar way, we can add our own custom function and transform the whole group into a new one.

Let us now create one new column named total and transform our data from Facebook column to total column. See the example below:

# importing pandas
import pandas as pd

# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# Transformation
my_dataframe['Total']=my_dataframe.<b>groupby</b>("Facebook")['Facebook'].<b>transform</b>('sum')

# printing
<b>print</b>(my_dataframe)

Output:

  Facebook  Instagram  Tiktok  Total
0      2390        320     400   2390
1       300         23     230    300
2      4000        340     230   4000

Filtration

Now filtration is a process in which we simply discard some groups, based on group wise computation that evaluates True or False. In order to filter a group we use a filter method and apply different conditions to filter the group. Let us first take very simple example of filtration.

# importing pandas
import pandas as pd

# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# Filter data to print only Facebook column
<b>print</b>(my_dataframe.<b>filter</b>(["Facebook"]))

Output:

   Facebook
0      2390
1       300
2      4000

Now we will apply filtration on groupby by defining a condition using lambda function.

# importing pandas
import pandas as pd

# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# groupby method
dataframe = my_dataframe.<b>groupby</b>('Facebook')

# defining transformation function
sc = lambda x: <b>len</b>(x)==1

# printing dataframe
<b>print</b>(dataframe.<b>filter</b>(sc))

Output:

Advertisement
   Facebook  Instagram  Tiktok
0      2390        320     400
1       300         23     230
2      4000        340     230

In a similar way we can create any lambda function to filter our data and apply it through the filter method.  Let us take one more example and apply filtration on the Facebook column and print out the rows where the sum of followers exceeds 3000. See the example below:

# importing pandas
import pandas as pd

# reading dataset
my_dataframe = pd.<b>read_csv</b>("dataframe.csv")

# fiters data
filtered = my_dataframe.<b>groupby</b>(['Facebook']).<b>filter</b>(lambda x: <b>sum</b>(x['Facebook']) > 3000)

# printing
<b>print</b>(filtered)

Output:

   Facebook  Instagram  Tiktok
2      4000        340     230

 

Saving grouped dataframe in different formats

So far we have learned much about pandas groupby and different operations that we can perform over them. Now let us see how we can save our grouped data in different formats. Mainly , we will learn about saving grouped data as CSV file and Excel file in this section.

We will use pandas_to_csv and pandas_to_excel methods to save the groupby objects as CSV and excel files respectively.  And let say our CSV_file_one.csv contains the following data:

name,subject1,subject2
bashir,20,34
alam,23,45
khan,35,5
Mk,44,45

 

Saving pandas groupby as CSV file

Now let us save our pandas groupby objects as a CSV file. See the example below which saves our grouped dataframe in CSV file:

import pandas as pd

# reading dataset
my_dataframe = pd.<b>read_csv</b>("CSV_file_one.csv")

# create specific data group
grouped = my_dataframe.<b>groupby</b>(["name"])

# resetting index and creating group of sum()
dataframe = grouped.<b>sum</b>().<b>reset_index</b>()

# to csv file
dataframe.<b>to_csv</b>("grouped_file.csv", index= True)

Now let us read the newly created csv file to see if we have successfully created it or not. See the example below which reads csv file and prints the data.

# reading newly created file
new = pd.<b>read_csv</b>("grouped_file.csv")

# printing
<b>print</b>(new)

Output:

     name  subject1  subject2
0      Mk        44        45
1    alam        23        45
2  bashir        20        34
3    khan        35         5

 

Saving pandas groupby as Excel file

Now let us save the grouped data in an excel file. It is simpler to the csv one but this time we will use the .to_excel() method. See the example below which saves the pandas group in excel.

# importing pandas
import pandas as pd

# reading dataset
my_dataframe = pd.<b>read_csv</b>("CSV_file_one.csv")

# create specific data group
grouped = my_dataframe.<b>groupby</b>(["name"])

# resetting index and creating group of sum()
dataframe = grouped.<b>sum</b>().<b>reset_index</b>()

# to csv file
dataframe.<b>to_excel</b>("grouped_file.xlsx", index=False)

Let us now read the newly created file and see if we have successfully created an excel file.

# reading newly created file
new = pd.<b>read_excel</b>("grouped_file.xlsx")

# printing
<b>print</b>(new)

Output:

     name  subject1  subject2
0      Mk        44        45
1    alam        23        45
2  bashir        20        34
3    khan        35         5

 

Summary

Groupby splits our actual data into different groups depending on our choice. The pandas groupby function returns a groupby object which can be iterable. In this tutorial we learned about the pandas groupby method, splitting data into different groups, and iterating over those groups. Moreover, we also learned about selecting groups and viewing groups. At the same time, we also discussed different ways to save our groups in different file formats including CSV and excel

 

Further Reading

pandas groupby
pandas groupby in more details
pandas groupby documentation

 

Didn't find what you were looking for? Perform a quick search across GoLinuxCloud

If my articles on GoLinuxCloud has helped you, kindly consider buying me a coffee as a token of appreciation.

Buy GoLinuxCloud a Coffee

For any other feedbacks or questions you can either use the comments section or contact me form.

Thank You for your support!!

Leave a Comment