How to drop rows in pandas DataFrame [Practical Examples]


Written by - Deepak Prasad

When working with data in Python, pandas is a must-know library. It offers a suite of powerful data manipulation tools, including the ability to drop rows in a DataFrame, which is essential for any data cleaning process. In this article, we'll delve into various ways to drop rows in pandas DataFrame, helping you to clean, prepare, and make your data more manageable and efficient for analysis.

  1. Basic Drop Method: This involves using the DataFrame.drop() method to remove rows based on their index. It's the most straightforward way to remove specific rows from your DataFrame.
  2. Conditional Drop: Sometimes, you might want to remove rows based on a certain condition. pandas allows this through a combination of Boolean indexing and the drop method.
  3. Handling Missing Data: The DataFrame.dropna() method comes in handy to remove rows with missing or NaN values, a common issue in many real-world datasets.
  4. Dropping Duplicates: pandas' DataFrame.drop_duplicates() method allows you to efficiently remove duplicate rows based on identical values in one or more columns.
  5. Dropping by Index Range: This involves removing a range of rows based on their index values, which can be achieved using slicing and the drop method.
  6. Inplace Dropping: The inplace parameter in various drop methods allows you to alter the original DataFrame directly, without creating a new one.
  7. Dropping by Column's Datatype: Sometimes, it might be necessary to remove rows based on the datatype of a specific column. pandas allows for this with a combination of the drop method and DataFrame dtype property.

 

Create pandas DataFrame with example data

DataFrame is a data structure used to store the data in two dimensional format. It is similar to table that stores the data in rows and columns. Rows represents the records/ tuples and columns refers to the attributes.

We can create the DataFrame by using pandas.DataFrame() method.

Syntax:

pandas.DataFrame(input_data,columns,index)

Parameters:

It will take mainly three parameters

  1. input_data is represents a list of data
  2. columns represent the columns names for the data
  3. index represent the row numbers/values

We can also create a DataFrame using dictionary by skipping columns and indices.

Let’s see an example.

Example:

Python Program to create a dataframe for market data from a dictionary of food items by specifying the column names.

import pandas as pd
import numpy as np

data = {
    'name': ['John', 'Anna', 'Peter', 'Linda', 'James'],
    'age': [28, 22, np.nan, 45, 30],
    'city': ['New York', 'Los Angeles', 'Chicago', np.nan, 'Houston']
}

df = pd.DataFrame(data)

Our DataFrame df looks like this:

    name   age         city
0   John  28.0     New York
1   Anna  22.0  Los Angeles
2  Peter   NaN      Chicago
3  Linda  45.0          NaN
4  James  30.0      Houston

 

1. Basic Drop Method

To drop a row from a DataFrame, we use the drop() function and pass in the index of the row we want to remove.

df.drop([1])  # Drop the row with index 1

This will output:

    name   age      city
0   John  28.0  New York
2  Peter   NaN   Chicago
3  Linda  45.0       NaN
4  James  30.0   Houston

Now let's drop the row with index label 2:

df.drop(2, inplace=True)

Our DataFrame df now looks like this:

   name   age         city
0  John  28.0     New York
1  Anna  22.0  Los Angeles
3 Linda  45.0          NaN
4 James  30.0      Houston

Now, let's drop the rows with index labels 1 and 4:

df.drop([1, 4], inplace=True)

Our DataFrame df now looks like this:

   name   age      city
0  John  28.0  New York
3 Linda  45.0       NaN

 

2. Dropping rows with specific conditions

If you want to drop rows based on certain conditions, you can use Boolean indexing. For instance, let's drop all rows where age is less than 30.

df.drop(df[df['age'] < 30].index)

This will output:

    name   age     city
3  Linda  45.0      NaN
4  James  30.0  Houston

Let's say we want to drop all rows where 'age' is NaN or 'city' is 'Los Angeles'. We could do:

df_dropped = df.drop(df[(df['age'].isna()) | (df['city'] == 'Los Angeles')].index)

Here, (df['age'].isna()) | (df['city'] == 'Los Angeles') is a Boolean condition that checks if the 'age' is NaN or the 'city' is 'Los Angeles'. The drop operation happens on the DataFrame indices that satisfy this condition. The resulting DataFrame df_dropped will be:

   name   age      city
0  John  28.0  New York
3 Linda  45.0       NaN
4 James  30.0   Houston

Now assuming we want to drop the rows at positions 1 and 3:

df.drop(df.index[[1, 3]], inplace=True)

Our DataFrame df now looks like this:

    name   age      city
0   John  28.0  New York
2  Peter   NaN   Chicago
4  James  30.0   Houston

 

3. Dropping rows with missing data

The dropna() function can be used to drop rows that contain NaN values. For instance:

df.dropna()  # Drop rows with NaN values

This will output:

   name   age         city
0  John  28.0     New York
1  Anna  22.0  Los Angeles
4 James  30.0      Houston

 

4. Dropping Rows Based on Duplicate Values

Suppose we've mistakenly added a duplicate entry to our DataFrame:

import pandas as pd
import numpy as np

data = {
    'name': ['John', 'Anna', 'Peter', 'Linda', 'James', 'Anna'],
    'age': [28, 22, np.nan, 45, 30, 22],
    'city': ['New York', 'Los Angeles', 'Chicago', np.nan, 'Houston', 'Los Angeles']
}

df = pd.DataFrame(data)

Our DataFrame df looks like this:

    name   age         city
0   John  28.0     New York
1   Anna  22.0  Los Angeles
2  Peter   NaN      Chicago
3  Linda  45.0          NaN
4  James  30.0      Houston
5   Anna  22.0  Los Angeles

We can see that the last entry is a duplicate of the second entry. We can remove this using the drop_duplicates() function:

df.drop_duplicates(inplace=True)

Our DataFrame df now looks like this:

    name   age         city
0   John  28.0     New York
1   Anna  22.0  Los Angeles
2  Peter   NaN      Chicago
3  Linda  45.0          NaN
4  James  30.0      Houston

The duplicate row is gone.

 

5. Dropping Rows by Index Range

Let's say we want to drop all rows between indices 1 and 3 inclusive. We can do this with the drop() function and the range() function to generate the indices:

df.drop(range(1, 4), inplace=True)

Our DataFrame df now looks like this:

   name   age      city
0  John  28.0  New York
4 James  30.0   Houston

Rows with indices 1, 2, and 3 have been dropped.

To drop rows by position, we first need to reset the index:

df.reset_index(drop=True, inplace=True)

Our DataFrame df now looks like this:

   name   age      city
0  John  28.0  New York
1 Linda  45.0       NaN

Now we can drop the row at position 1:

df.drop(df.index[1], inplace=True)

Our DataFrame df now looks like this:

   name   age      city
0  John  28.0  New York

 

6. Inplace Dropping

The inplace parameter is used in many pandas functions to determine whether the operation should be performed on the DataFrame and the result returned (inplace=False) or if the operation should be performed in-place and nothing returned (inplace=True).

For instance, if we want to drop the row with index 4 from df and want this change to reflect in df itself, we can use inplace=True:

df.drop(4, inplace=True)

After this operation, our DataFrame df is:

   name   age      city
0  John  28.0  New York

With inplace=True, the drop() function directly modifies df and doesn't return anything. If we used inplace=False (or just omitted it, as False is the default value), df would be unchanged and the function would return a new DataFrame where the row has been dropped.

 

7. Dropping rows based on a column's datatype

In some cases, you might need to drop rows based on the datatype of a certain column. This is a bit more complex, because pandas doesn't provide a built-in way to do this. However, you can achieve this by combining a few pandas techniques.

Suppose we have the following DataFrame:

df = pd.DataFrame({
    'name': ['John', 'Anna', 'Peter', 'Linda', 'James'],
    'age': [28, '22', np.nan, 45, 30],
})

Notice that one of the 'age' entries is a string, and we want to keep only the rows where 'age' is a float. We can do this as follows:

df = df[df['age'].map(lambda x: isinstance(x, float))]

Here, we're using the map() function to apply a lambda function to each value in the 'age' column. The lambda function checks if the value is an instance of float, and returns a Boolean Series. This Series is used to index the DataFrame, effectively dropping the rows where 'age' is not a float.

 

Performance Considerations

When dropping rows, especially in large DataFrames, performance can be a concern. The time complexity of dropping rows in pandas is roughly O(N), where N is the number of rows in the DataFrame. This is because dropping a row involves creating a new DataFrame and copying over all the data from the old DataFrame, excluding the dropped rows.

Therefore, if you're working with a large DataFrame and want to exclude a small number of rows, dropping those rows is fine. But if you want to exclude a large portion of your DataFrame, it may be faster to create a new DataFrame with just the rows you want to keep. This is called "filtering" the DataFrame.

 

Error Handling

If you try to drop a row that doesn't exist (i.e., a row with an index that's not in the DataFrame), pandas will raise a KeyError. To avoid this, you can use the errors parameter in the drop() function:

df.drop([1000, 1001], errors='ignore')

In this example, if the rows with indices 1000 and 1001 don't exist, pandas will ignore the error and return the DataFrame without those rows. If those rows do exist, it will drop them.

 

Subsetting vs Dropping

"Dropping" rows and "subsetting" a DataFrame are two sides of the same coin - both involve creating a new DataFrame with a reduced set of rows. The difference lies in the perspective:

  • When we "drop" rows, we're starting with a DataFrame and removing certain rows from it.
  • When we "subset" a DataFrame, we're starting with a set of criteria and creating a new DataFrame that meets these criteria.

In terms of code, these two operations are often identical. For example, df.drop(df[df['age'] < 30].index) and df[df['age'] >= 30] do the same thing: they both return a new DataFrame without the rows where 'age' is less than 30.

The key difference is in the intent: if you're thinking about the rows you want to remove, you're "dropping". If you're thinking about the rows you want to keep, you're "subsetting".

 

Summary

Dropping rows from a DataFrame is a crucial part of data preprocessing in Python's pandas library. There are several ways to drop rows based on various conditions:

  1. Basic Drop Method: This method allows you to drop a single or multiple rows in a DataFrame using the row index label. The drop() function is used, where the argument is the index label or a list of index labels.
  2. Dropping Rows with Specific Conditions: You can drop rows based on certain conditions applied to the columns of the DataFrame. This involves creating a Boolean mask, where True indicates the rows you want to keep, and False the rows to drop.
  3. Dropping Rows with Missing Data: The dropna() function is used to drop all rows that contain NaN values. If you only want to drop rows where specific columns have NaN values, you can provide these column names as an argument.
  4. Dropping Rows Based on Duplicate Values: The drop_duplicates() function allows you to drop all rows that have identical values in all columns or a subset of columns.
  5. Dropping Rows by Index Range: The drop() function is also used to drop a range of rows based on their index values.
  6. Inplace Dropping: The inplace parameter in DataFrame drop functions allows you to drop rows in the original DataFrame without returning a new one.
  7. Dropping Rows Based on a Column's Datatype: This involves using the map() function to create a Boolean Series indicating which rows to keep, based on the datatype of a certain column.
  8. Performance Considerations: Dropping rows in pandas involves creating a new DataFrame, which can be time-consuming for large DataFrames. In cases where a large portion of the DataFrame is to be excluded, it's better to filter the DataFrame instead of dropping rows.
  9. Error Handling: If you try to drop a row that doesn't exist, pandas will raise a KeyError. You can handle this by using the errors parameter in the drop() function.
  10. Subsetting vs Dropping: Subsetting a DataFrame and dropping rows from it are essentially the same operation, but viewed from different perspectives. If you're thinking about the rows you want to remove, you're "dropping". If you're thinking about the rows you want to keep, you're "subsetting".

 

Further Reading

Pandas - drop()

 

Deepak Prasad

He is the founder of GoLinuxCloud and brings over a decade of expertise in Linux, Python, Go, Laravel, DevOps, Kubernetes, Git, Shell scripting, OpenShift, AWS, Networking, and Security. With extensive experience, he excels in various domains, from development to DevOps, Networking, and Security, ensuring robust and efficient solutions for diverse projects. You can reach out to him on his LinkedIn profile or join on Facebook page.

Can't find what you're searching for? Let us assist you.

Enter your query below, and we'll provide instant results tailored to your needs.

If my articles on GoLinuxCloud has helped you, kindly consider buying me a coffee as a token of appreciation.

Buy GoLinuxCloud a Coffee

For any other feedbacks or questions you can send mail to admin@golinuxcloud.com

Thank You for your support!!

Leave a Comment