How to Use the Pandas Groupby Method?

Contents

  • What is The Pandas Groupby Function?

  • Sample Data We’ll Be Using

  • Splitting Data into Groups Using Pandas Groupby

    • Creating a Group Based on One Column

    • Creating Groups Based on Two Columns

    • Grouping By Three Columns

  • Getting a Specific Group

  • Pandas Groupby Output Format

  • Iterating Through Groups in Pandas Groupby

  • Finding the Percentage of Grouped Data

  • Different Built-in Functions

    • Aggregation

    • Filtering

    • Transformation

    • Apply

  • Saving Grouped Pandas DataFrame in Different Formats

    • Saving a Pandas DataFrame As a CSV File

    • Saving Pandas DataFrame as an Excel file

  • Summary

  • You May Also Enjoy

What is The Pandas Groupby Function?

The groupby function in Pandas is a powerful and versatile tool in Python. Using this method, we may split our data, apply different operations to different subsets, and then merge the final results. This may be used to process enormous amounts of data for various computations.

This tutorial will cover splitting, grouping, and selecting data from Pandas groupby. By the end, you will have a solid knowledge of Pandas groupby. You can easily apply this method to your dataset and save it as CSV and Excel files.

As part of this tutorial, we will use a CSV file with not so much data for training purposes. You can download this file later and practice on your own.

Sample Data We’ll Be Using

Our dataset csv file (AQ.csv) contains the following data:

Country,Air Quality Index,Year,Status,Continent
Bangladesh,173,2021,Poor,South Asia
Chad,147,2021,Poor,Africa
Pakistan,131,2021,Poor,South Asia
Tajikistan,116,2021,Poor,Central Asia
Mongolia,107,2021,Poor,Asia
Uganda,99,2021,Moderate,Africa
United Arab Emirates,95,2021,Moderate,Asia
Uzbekistan,90,2021,Moderate,Central Asia
India,88,2021,Moderate,South Asia
Kyrgyzstan,61,2021,Moderate,Central Asia

Now let’s open our example CSV file and see our data using the Pandas read_csv function.

Let’s use the Pandas read_csv function to look at our sample CSV file and its contents.

# importing pandas
import pandas as pd

# openning our csv file
df = pd.read_csv("AQ.csv")

# showing the content of our CSV file
df

Output:

![Table

Description automatically generated](https://lh5.googleusercontent.com/Yz_G_OM2Rz9P7lHSPdmlol4fA4V2lW6otWmBBYzxx6lhXtWajbMuszs6kcfNyF9grlVHudEm2oZNXRSvwnuEgF-3lZF8LhU_AxwMGVny-hVvCay3a68nWrRVSx0vf7Cn5a7JytoKTlmmJ6MDmjKu3-fmVqGwpIZjpnb8PPtfxeWeT5TyapaTS8ND0Q)

Splitting Data into Groups Using Pandas Groupby

In the result above, we can see that we have at least three categories by which we can group our data: “Year”, “Status” and “Continent”. To group our data by, for instance, the continent, we can type df.groupby('Continent') given that your DataFrame is called “df” and that the column is called “Continent”.

We are going to use the groupby method to group a DataFrame based on one, two, three, or more columns.

Creating a Group Based on One Column

Now let’s start with the simplest example – grouping data by one column.

# grouping data by one column
cont = df.groupby('Continent')

# printing the dataset after using 'groupby' by 'Continent'
print(cont)

Output:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018F426AC760>

The result shows us that we have successfully created a group of pandas objects, but, in general, it gives us nothing. In the output, we can see that whenever we try to use the print function to see our grouped data, it prints the object, not the data itself inside the object.

Let’s now look at the group we just created using the ‘group method’.

# printing the data from the object using groups method
print(cont.groups)

Output:

{'Africa': [1, 5], 'Asia': [4, 6], 'Central Asia': [3, 7, 9], 'South Asia': [0, 2, 8]}

Notice that in the output above we see the grouped data with indexes. Now, let’s show it in the table:

# showing the groups in the table using 'count' method
print(cont.count())

Output:

![A picture containing text

Description automatically generated](https://lh4.googleusercontent.com/qJufc7IcMLbMjKK5m6HORSXlwGehwK0cQRdyW63MdPLS3ZrKEeulzPQI6j1slMSehsYey_MW85cJLYXTbrXIkx3rMoSVuNVKHZNHxdA8EOL6QVA8-JwGX0LWx0qB8IeyRgHRsh3iKng6sak-Heh3tFZOAACBJnT2luWBi38gzmtAroJAO8KxmM1Shw)

This was all about how to group DataFrame based on one column. Let’s proceed with the grouping based on multiple columns.

Creating Groups Based on Two Columns

# grouping data by two columns 'Status' and 'Continent'
AQ = df.groupby(['Status','Continent'])

# printing grouped data
print(AQ.groups)

Output:

{('Moderate', 'Africa'): [5], ('Moderate', 'Asia'): [6], ('Moderate', 'Central Asia'): [7, 9], ('Moderate', 'South Asia'): [8], ('Poor', 'Africa'): [1], ('Poor', 'Asia'): [4], ('Poor', 'Central Asia'): [3], ('Poor', 'South Asia'): [0, 2]}

To show it in the table using count method like in the code below:

# showing the groups based on two columns in the table using 'count' method
print(AQ.count())

Output:

![Table

Description automatically generated](https://lh5.googleusercontent.com/POCM5YVdku0TeUk2KDEI6FpC-Fe8LLsjsVsdYC_8tJw4DrgYiR3-BI6ZkhT72UX9Kw5qQtOXYVMJx7gQ9TFXYL2sUfD4necpExGhPMR7Z8kky5gSio8Wam-J0pEhvr1EgJGxMPOBBM18_vl_n5ZxajANttdgv00qiBRZDETYkf_fwkLH2W2V4GHppA)

Grouping By Three Columns

Use the same idea for grouping by three columns and showing the grouped data in a table.

# grouping data by three columns 'Status','Continent' and 'Country'
AQ = df.groupby(['Status','Continent','Country'])

# printing grouped data using ‘count’ method
print(AQ.count())

Output:

![Text

Description automatically generated with medium confidence](https://lh5.googleusercontent.com/916l7Q2oJf8-MXVESkGxJmk-NoLgdtdkT9T4rMIBjvjzTHOuqdt2_9JWqk2rCVPgCA3h8rmxROssudhEKQHeqh1dEtKyPW30y1_Z_yz3dcjNnnIZncyJV6SJrPwME1kyKxKYX46RAUJSiPAteMP4ol-EOhLDciUFbkd9XqKF--dcREgo03gF54I_dw)

Getting a Specific Group

If we have multiple groups, we can select one of these groups as shown below.

# calling 'get_group' method to access a specific group from all the groups
groups.get_group('Asia')

Output:

![Table

Description automatically generated](https://lh3.googleusercontent.com/_eOPsoA_MkOI8O_UXcNdx4H6nhui1vw-27t91z8xS5jh210dzcvxWkB6X2nL7UnsXm9vdaKeGh9XP57yx5TA1KDmyKqW6kgufWVxvTTDJtc8JRW-uFAXO5yVNrUndmSF9z-kf9XPWbDRze3ksmQvXi2NFGomoHsI9JVlTU74ntrcAMAoTk9p6mUBHw)

Using get_group to multiple values:

# creating groups by two columns
groups = df.groupby(["Continent","Status"])

# calling 'get_group' method to access to two grouped columns values
asia = groups.get_group(("South Asia","Poor"))

# printing the result
print(asia)

Output:

![Text, letter

Description automatically generated](https://lh3.googleusercontent.com/Q2GSHQXuvvr_PQt2swGsjbW-jLZXSVF7r_o_qBX9VE9pyV-qWg4hx4rLA-FIjKOxDjESyaKbiHjqETy_4-3BQLjBWJu-NIHX-cpOsgs_SR2c0pvWqSgOfztia413rrfw5Koa1UVeGvl9e984l_XwwkwvV0wy8uCutPGQzWT-kMxldFUVwQW8GKoOMQ)

The results show us that we printed only countries with the “Poor” status from the “Continent” group’s column “South Asia”.

We can also apply different aggregations to a specific group using the same get_group method and any other aggregation functions.

# calling 'get_group' method to access a specific group from all the groups
asia = groups.get_group('Asia')

# applying mean() function to the specific column
mean = asia['Air Quality Index'].mean()

# pinting the mean
print("Air quality index's mean value of Asia is", mean)

Output:

Pandas Groupby Output Format

The output of groupby and aggregate operations differs between Pandas Series and Pandas DataFrames, which can be confusing.

Typically, if you compute the result with more than one column, your result will be a DataFrame. For single-column results, aggregation functions will generate a Series by default. We can change this setting by selecting a different operation column:

# producing Pandas Series
# using 'mean()' method to find the average air quality index for each continent
# rounding the mean using 'round()' function

round(df.groupby('Continent')['Air Quality Index'].mean(), 1)

Output:

![Table

Description automatically generated](https://lh4.googleusercontent.com/Zf7-Q6lBzpegUmhEqNFIV1r_gP-hraKbKYk3YcNJeC9FScnTkmUDIuXeczxzevZHT_kF9Y-ZnTKFJRMp1g4YJ3D5ZBkSHUpicyf0hZyIOo41FKNY2C1qZBz5zeZ5_bz_OWi8i4F5SvDB4eFiM7movzZkougKd8XbZKWpaI3TyB-PZIIDP2Bfy09oUQ)

Output in this case is in Pandas Series.

Now let’s try to get Pandas DataFrame as a result:

# producing Pandas DataFrame
# using 'mean()' method to find the average air quality index for each continent
# rounding the mean using 'round()' function

round(df.groupby('Continent')[['Air Quality Index']].mean(), 1)

Output:

![Table

Description automatically generated](https://lh5.googleusercontent.com/DMDBChdUI5oEd-Ew09Ou_RDPf3WB-KqFz5-idj_ft3E_ZXPGKiq6NAa3sEBX91Rvbnt0ZLJwOPoi7MVxRpldjW2YEJjRCSSkXmwCm0GymcF2pu3wxh-sV4xXTaTxShVTca_FmqWZEN8A8t_siKkI_-JUfHUrFqarSKRaIvEu9pZLcyyGbJM2y_bm0g)

This way we can get different output formats depending on our request.

Iterating Through Groups in Pandas Groupby

We already know how to use Pandas groupby function with one and multiple groups (columns), which is returning pandas object, which is iterable.

Now let’s iterate through the objects using the for loop and print the data.

# impoting pandas
import pandas as pd

# openning dataset
df = pd.read_csv("AQ.csv")

# creating group based on one column using pandas groupby method
groups = df.groupby("Continent")

# iterating over groups using for loop
for continent, value in groups:
    print(continent)
    print(value)

Output:

![A screenshot of a computer

Description automatically generated with low confidence](https://lh3.googleusercontent.com/VSCrtflKpURrnDz8FUPmDFeRmgoVamjzh9K-c1Mr37JW5flpp-vHMbgjTaA-0zTJtg3M7TemilTJySt8YveoKP07JCIfIqnbc6Uby_T7DuBePk2n4xkNw0iaQCVpUjBxV2d7LbhglkSCG3BUMEegI0N9WybdhWS_UtYoPG_4KOZ5-zp55bfzGhj8ZA)

Now let’s see what the result will be if we will iterate through two groups:

# creating group based on two columns using pandas groupby method
groups = df.groupby(["Continent","Status"])

# iterating over groups using for loop
for continent, value in groups:
    print(continent)
    print(value)

Output:

![Text

Description automatically generated](https://lh5.googleusercontent.com/tnkUa61gB04d8kRLQOv7AnXGd-F0vXyooHTcC1Tnm2TdqI-_C4tUupzSu4RUxcIFqCKJZiaf0mPWeleH2tYJ1RKlP2l6TeVQG8SB_6FerHrYeVw5K7iJ7BPoM82wKeegn3tr13bC4x3YP5R1ox-uZtUOW_QvoEJy3STha78RkYSAiCaqUfQ9-qwEpw)

In the same way, let’s try to use for loop for the three groups.

# creating group based on three columns using pandas groupby method
groups = df.groupby(["Continent","Country","Status"])

# iterating over groups using for loop
for continent, value in groups:
    print(continent)
    print(value)

Output:

![A close-up of a document

Description automatically generated with medium confidence](https://lh6.googleusercontent.com/c4Fy-2pK5NO0dx_9IET07AT4Jo1dOa2ibuOrYdkz-kDAIYpjkN2Cgzap4gASbSWgKVAiGhEjhL40qerbD6zSJteVQdJkIBdp6hrUoijYn01VdLtjlCRjnZCSW9ctm2-kx3jWtPfTq0xyEVWpBoyr3LRsfjcFfDZb_UujpwLx4tADpRvc31VqWpyAcA)

That was all about using for loop through one and multiple columns.

Finding the Percentage of Grouped Data

Moreover, we can use different functions over our grouped data. For example, let’s use a lambda function to see the percentage of the grouped data.

# creating groups by three columns

# applying 'size()' property to represent the number of elements in this object
groups = df.groupby(["Continent","Status"]).size()

# calculating the percentage on the level of status for each continent using lambda function
perc = groups.groupby(level=0).apply(lambda x:100*x/float(x.sum()))

# showing the result
print(perc)

Output:

![Table

Description automatically generated](https://lh3.googleusercontent.com/g0keTahtmN3mbWM5anK9ly0PamrNdGXTe3In64cRQ64ccq3_6PavfSytVHy7OgAWwNu7aA8_vBodWKwpkmsRH8LcwoStv1Rj-J_YdxY1uW94AQ7u1nm2KHbupBLndh_qg3DPPijYhgbtfcEw0LaLjJVsWVg_42p8IGxcZ4b0qvKImuREPCnsqHq4yg)

Different Built-in Functions

Up to this point, we used mean, count, and other functions, which are built-in Pandas aggregation, but they’re not the only ones.

The following table summarizes some other built-in Pandas aggregations:

AggregationDescription
count()Total number of items
first(), last()First and last item
mean(), median()Mean and median
min(), max()Minimum and maximum
std(), var()Standard deviation and variance
mad()Mean absolute deviation
prod()Product of all items
sum()Sum of all items

Take a look at this example:

# using describe function over each continents' air quality indexes

df.groupby('Continent')['Air Quality Index'].describe()

In this code, we’re using [‘Air Quality Index’] to specify over which group we want to apply a describe function.

Output:

![Table

Description automatically generated](https://lh5.googleusercontent.com/TNFzvs9FdIRNDqBHgHYUqZjDl5sAYXt_dMCKW6oQ-0R0_9-34J485HCvaMAKedch2n5BSSjb7zpz-trsT8w0t9YkxMqUn-QClYGELaDqTlCETF3wJFiajqiMCygBC2LghsJz3wd2fqiZigHBUPy0y8zzl3ObO1qty1R4km5ggYdvjloglFOwKXkKtA)

Before it, we mainly used built-in functions separately like mean(), sum(), etc. Still, groupby objects have aggregate(), filter(), transform(), and apply() methods that easily execute a bunch of useful operations before combining the grouped data.

Aggregation

The aggregate() method allows us to use several different aggregations and compute all of them at once.

# importing pandas 
import pandas as pd

# importing numpy
import numpy as np

# using multiple aggregations like min(),mean(),max()
agg = df.groupby(["Continent","Year"]).agg([np.min,np.mean,np.max])

# printing the result
print(agg)

Output:

![Table

Description automatically generated with medium confidence](https://lh5.googleusercontent.com/39zXn0s3u5DHU6vfKnViCkUdVZoAnSaBM4Y_jWw-d5dap_wOpphPhEqY36TcG56MINifV7PHxV96lCGXrAqb6RyJRF1y8C3mYkrOpnFHippdOUlbHSUp0iwhvGI7bea7A0CfrIu8ttGHboJ8EpuQhDDQhD9cyLEho96gkL_YWmWtrRXeBtvFBpeNmw)

In the output, we see that we implemented three functions at once to get each continent’s minimum, mean, and maximum values of the air quality index.

Filtering

The filter() operation allows us to discard data based on group properties. For example, we may want to keep only one needed group with some needed properties.

# importing pandas 
import pandas as pd

# reading dataset
df = pd.read_csv("AQ.csv")

# applying filter() method to our dataset
groups = df.filter(['Continent', 'Air Quality Index'])

# grouping by continents
groups.groupby("Continent").mean()

Output:

![Table

Description automatically generated](https://lh6.googleusercontent.com/jNDpfxSM8DjCWq-0CxxdlgIu7Hjeo2MpNY4vWu6qGSY9UfvcBp7OzZEVYZIPuHFyjk9grSZgtilzrh1m5TfIMSTrLAUg-cm2Xb7I9m5o0tFXK9mdczlJhjfOGnHYkNYL3eHbwxI_ZIBJDvPjLsAKOZJt80p3c_A2xV5Ucswa0IuYOr6evRRGmFw08w)

Transformation

The transform() function returns a transformed version of the entire data for the recombination. The output result has the same form as the input before.

# defining function of square 
def function(x):
    return x**2

# getting air quality index from a data frame
AQI = df['Air Quality Index']

# using transfor function over air quality index
print(AQI.transform(function))

Output:

![Table

Description automatically generated with medium confidence](https://lh5.googleusercontent.com/jn8KseUc3jUCrxBbxjhxxa_YyqZBP-0l_v0-EbmIqMQASWdVcX_zVkCkHSe7zbJarMf-ilfJwEaU35MV7lEBgTeuXosuSFSJmyCfteMfIPC52PTrITr61Ikj0cOdUMQWdzZ7lbEGwfjc7P9cGKYkIVSbX7UCw2h4O7S5qglKJg463-EZT4QKoev5hg)

Apply

Apply() function allows us to apply various other functions to the groups.

# applying lambda function 
AQI = df.groupby(['Continent'])['Air Quality Index'].apply(lambda x: x - x.min())

# printing the result
print(AQI)

Output:

![A picture containing table

Description automatically generated](https://lh3.googleusercontent.com/jG3wLGuvSWGkZBv3XVoY5Z4COnTeQZUWWMtTnZdeFG9HjDJRtmwF6qFtigwiAtITtMe8YZFs-KmeMYUbxyVME0ktUA2jOJIT7M_PDGoQDDlubFOp10njqYhh9CUbTWM44ldxnccn3N1UDaPRukfiNoDUj240UtinZuOK7J3_D-LHN27EadetA87TAg)

Saving Grouped Pandas DataFrame in Different Formats

In the last section of this tutorial, we’re going to learn how to convert our groupby to CSV and Excel files. However, do not forget that we must reset our indexes for both types of data savings.

Saving a Pandas DataFrame As a CSV File

To save our data as a CSV file, we will use to_csv method. See the example below:

# importing pandas
import pandas as pd

# reading our datset
df = pd.read_csv("AQ.csv")

# grouping by the continent
groups = df.groupby(['Continent'])

# resetting index and finding size of the elements in each continent
my_df = groups.size().reset_index()

# converting groupby to csv file using to_csv function
my_df.to_csv("grouped_file.csv", index = True)

# checking saved file 
check = pd.read_csv("grouped_file.csv")
print(check)

Output:

![Table

Description automatically generated](https://lh6.googleusercontent.com/6U4XJjNFbdSepLZnaq9jxrOcs0eAL1uImW9nFUlWcL_W_5mfy1hgZAoqr6Q2cl2pRSYs8E4lFRKeszS1Ki5mzhr0V95wGorSsJJ9eIoNs-L6MFsd8ClbAhn5FXhR0zeiDSe-tJZ1DRFDjuCpKXsrc4mskdCxtHFqtgLvmhIvcQYP7m0PPec9D-B7dQ)

Saving Pandas DataFrame as an Excel file

To save our data as an Excel file we will use to_excel method.

See the example below:

# importing pandas
import pandas as pd

# reading our datset
df = pd.read_csv("AQ.csv")

# grouping by the continent
groups = df.groupby(['Continent'])

# resetting index and finding size of the elements in each continent
my_df = groups.size().reset_index()

# converting groupby to excel file using to_xlsx function
my_df.to_excel("grouped_file.xlsx", index = False)

# checking saved file
check = pd.read_csv("grouped_file.csv")
print(check)

Output:

![Table

Description automatically generated](https://lh4.googleusercontent.com/ztJfBT2aEreLPI0hMV41wukzsQymYh8R5RR-YuOKLIpqC4r-3_7EfkiTby4keilRqJhd9mbwgP7lsf-ZypAUyX2a1jF3oAcwHm7rgar7WrQe9Oarf5YYPDlu1Ko71b79SqMwb1PHOuPlej6_VZ2AYgHy8cpzG2xzcIz_nT3WbkzRQ_j5ZBFs3Alprw)

Summary

In this tutorial, we looked at the groupby function, which divides our data into different groups by category. The result of the groupby function is iterable, indicating that we can iterate the groups independently, using various functions at each step of the process. We’ve looked at how to select a particular group and how to work with them. We also learned how to apply different aggregations to one or more groups and finally learned how to save our data as CSV or Excel files.

You May Also Enjoy