When working with data in Python, pandas is a must-know library. It offers a suite of powerful data manipulation tools, including the ability to drop rows in a DataFrame, which is essential for any data cleaning process. In this article, we'll delve into various ways to drop rows in pandas DataFrame, helping you to clean, prepare, and make your data more manageable and efficient for analysis.
- Basic Drop Method: This involves using the
DataFrame.drop()
method to remove rows based on their index. It's the most straightforward way to remove specific rows from your DataFrame. - Conditional Drop: Sometimes, you might want to remove rows based on a certain condition. pandas allows this through a combination of Boolean indexing and the drop method.
- Handling Missing Data: The
DataFrame.dropna()
method comes in handy to remove rows with missing or NaN values, a common issue in many real-world datasets. - Dropping Duplicates: pandas'
DataFrame.drop_duplicates()
method allows you to efficiently remove duplicate rows based on identical values in one or more columns. - Dropping by Index Range: This involves removing a range of rows based on their index values, which can be achieved using slicing and the drop method.
- Inplace Dropping: The
inplace
parameter in various drop methods allows you to alter the original DataFrame directly, without creating a new one. - Dropping by Column's Datatype: Sometimes, it might be necessary to remove rows based on the datatype of a specific column. pandas allows for this with a combination of the drop method and DataFrame dtype property.
Create pandas DataFrame with example data
DataFrame is a data structure used to store the data in two dimensional format. It is similar to table that stores the data in rows and columns. Rows represents the records/ tuples and columns refers to the attributes.
We can create the DataFrame by using pandas.DataFrame() method.
Syntax:
pandas.DataFrame(input_data,columns,index)
Parameters:
It will take mainly three parameters
- input_data is represents a list of data
columns
represent the columns names for the dataindex
represent the row numbers/values
We can also create a DataFrame using dictionary by skipping columns and indices.
Let’s see an example.
Example:
Python Program to create a dataframe for market data from a dictionary of food items by specifying the column names.
import pandas as pd
import numpy as np
data = {
'name': ['John', 'Anna', 'Peter', 'Linda', 'James'],
'age': [28, 22, np.nan, 45, 30],
'city': ['New York', 'Los Angeles', 'Chicago', np.nan, 'Houston']
}
df = pd.DataFrame(data)
Our DataFrame df
looks like this:
name age city
0 John 28.0 New York
1 Anna 22.0 Los Angeles
2 Peter NaN Chicago
3 Linda 45.0 NaN
4 James 30.0 Houston
1. Basic Drop Method
To drop a row from a DataFrame, we use the drop()
function and pass in the index of the row we want to remove.
df.drop([1]) # Drop the row with index 1
This will output:
name age city
0 John 28.0 New York
2 Peter NaN Chicago
3 Linda 45.0 NaN
4 James 30.0 Houston
Now let's drop the row with index label 2:
df.drop(2, inplace=True)
Our DataFrame df
now looks like this:
name age city
0 John 28.0 New York
1 Anna 22.0 Los Angeles
3 Linda 45.0 NaN
4 James 30.0 Houston
Now, let's drop the rows with index labels 1 and 4:
df.drop([1, 4], inplace=True)
Our DataFrame df
now looks like this:
name age city
0 John 28.0 New York
3 Linda 45.0 NaN
2. Dropping rows with specific conditions
If you want to drop rows based on certain conditions, you can use Boolean indexing. For instance, let's drop all rows where age
is less than 30.
df.drop(df[df['age'] < 30].index)
This will output:
name age city
3 Linda 45.0 NaN
4 James 30.0 Houston
Let's say we want to drop all rows where 'age' is NaN or 'city' is 'Los Angeles'. We could do:
df_dropped = df.drop(df[(df['age'].isna()) | (df['city'] == 'Los Angeles')].index)
Here, (df['age'].isna()) | (df['city'] == 'Los Angeles')
is a Boolean condition that checks if the 'age' is NaN or the 'city' is 'Los Angeles'. The drop operation happens on the DataFrame indices that satisfy this condition. The resulting DataFrame df_dropped
will be:
name age city
0 John 28.0 New York
3 Linda 45.0 NaN
4 James 30.0 Houston
Now assuming we want to drop the rows at positions 1 and 3:
df.drop(df.index[[1, 3]], inplace=True)
Our DataFrame df
now looks like this:
name age city
0 John 28.0 New York
2 Peter NaN Chicago
4 James 30.0 Houston
3. Dropping rows with missing data
The dropna()
function can be used to drop rows that contain NaN
values. For instance:
df.dropna() # Drop rows with NaN values
This will output:
name age city
0 John 28.0 New York
1 Anna 22.0 Los Angeles
4 James 30.0 Houston
4. Dropping Rows Based on Duplicate Values
Suppose we've mistakenly added a duplicate entry to our DataFrame:
import pandas as pd
import numpy as np
data = {
'name': ['John', 'Anna', 'Peter', 'Linda', 'James', 'Anna'],
'age': [28, 22, np.nan, 45, 30, 22],
'city': ['New York', 'Los Angeles', 'Chicago', np.nan, 'Houston', 'Los Angeles']
}
df = pd.DataFrame(data)
Our DataFrame df
looks like this:
name age city
0 John 28.0 New York
1 Anna 22.0 Los Angeles
2 Peter NaN Chicago
3 Linda 45.0 NaN
4 James 30.0 Houston
5 Anna 22.0 Los Angeles
We can see that the last entry is a duplicate of the second entry. We can remove this using the drop_duplicates()
function:
df.drop_duplicates(inplace=True)
Our DataFrame df
now looks like this:
name age city
0 John 28.0 New York
1 Anna 22.0 Los Angeles
2 Peter NaN Chicago
3 Linda 45.0 NaN
4 James 30.0 Houston
The duplicate row is gone.
5. Dropping Rows by Index Range
Let's say we want to drop all rows between indices 1 and 3 inclusive. We can do this with the drop()
function and the range()
function to generate the indices:
df.drop(range(1, 4), inplace=True)
Our DataFrame df
now looks like this:
name age city
0 John 28.0 New York
4 James 30.0 Houston
Rows with indices 1, 2, and 3 have been dropped.
To drop rows by position, we first need to reset the index:
df.reset_index(drop=True, inplace=True)
Our DataFrame df
now looks like this:
name age city
0 John 28.0 New York
1 Linda 45.0 NaN
Now we can drop the row at position 1:
df.drop(df.index[1], inplace=True)
Our DataFrame df
now looks like this:
name age city
0 John 28.0 New York
6. Inplace Dropping
The inplace
parameter is used in many pandas functions to determine whether the operation should be performed on the DataFrame and the result returned (inplace=False
) or if the operation should be performed in-place and nothing returned (inplace=True
).
For instance, if we want to drop the row with index 4 from df
and want this change to reflect in df
itself, we can use inplace=True
:
df.drop(4, inplace=True)
After this operation, our DataFrame df
is:
name age city
0 John 28.0 New York
With inplace=True
, the drop()
function directly modifies df
and doesn't return anything. If we used inplace=False
(or just omitted it, as False
is the default value), df
would be unchanged and the function would return a new DataFrame where the row has been dropped.
7. Dropping rows based on a column's datatype
In some cases, you might need to drop rows based on the datatype of a certain column. This is a bit more complex, because pandas doesn't provide a built-in way to do this. However, you can achieve this by combining a few pandas techniques.
Suppose we have the following DataFrame:
df = pd.DataFrame({
'name': ['John', 'Anna', 'Peter', 'Linda', 'James'],
'age': [28, '22', np.nan, 45, 30],
})
Notice that one of the 'age' entries is a string, and we want to keep only the rows where 'age' is a float. We can do this as follows:
df = df[df['age'].map(lambda x: isinstance(x, float))]
Here, we're using the map()
function to apply a lambda function to each value in the 'age' column. The lambda function checks if the value is an instance of float
, and returns a Boolean Series. This Series is used to index the DataFrame, effectively dropping the rows where 'age' is not a float.
Performance Considerations
When dropping rows, especially in large DataFrames, performance can be a concern. The time complexity of dropping rows in pandas is roughly O(N), where N is the number of rows in the DataFrame. This is because dropping a row involves creating a new DataFrame and copying over all the data from the old DataFrame, excluding the dropped rows.
Therefore, if you're working with a large DataFrame and want to exclude a small number of rows, dropping those rows is fine. But if you want to exclude a large portion of your DataFrame, it may be faster to create a new DataFrame with just the rows you want to keep. This is called "filtering" the DataFrame.
Error Handling
If you try to drop a row that doesn't exist (i.e., a row with an index that's not in the DataFrame), pandas will raise a KeyError
. To avoid this, you can use the errors
parameter in the drop()
function:
df.drop([1000, 1001], errors='ignore')
In this example, if the rows with indices 1000 and 1001 don't exist, pandas will ignore the error and return the DataFrame without those rows. If those rows do exist, it will drop them.
Subsetting vs Dropping
"Dropping" rows and "subsetting" a DataFrame are two sides of the same coin - both involve creating a new DataFrame with a reduced set of rows. The difference lies in the perspective:
- When we "drop" rows, we're starting with a DataFrame and removing certain rows from it.
- When we "subset" a DataFrame, we're starting with a set of criteria and creating a new DataFrame that meets these criteria.
In terms of code, these two operations are often identical. For example, df.drop(df[df['age'] < 30].index)
and df[df['age'] >= 30]
do the same thing: they both return a new DataFrame without the rows where 'age' is less than 30.
The key difference is in the intent: if you're thinking about the rows you want to remove, you're "dropping". If you're thinking about the rows you want to keep, you're "subsetting".
Summary
Dropping rows from a DataFrame is a crucial part of data preprocessing in Python's pandas library. There are several ways to drop rows based on various conditions:
- Basic Drop Method: This method allows you to drop a single or multiple rows in a DataFrame using the row index label. The
drop()
function is used, where the argument is the index label or a list of index labels. - Dropping Rows with Specific Conditions: You can drop rows based on certain conditions applied to the columns of the DataFrame. This involves creating a Boolean mask, where
True
indicates the rows you want to keep, andFalse
the rows to drop. - Dropping Rows with Missing Data: The
dropna()
function is used to drop all rows that containNaN
values. If you only want to drop rows where specific columns haveNaN
values, you can provide these column names as an argument. - Dropping Rows Based on Duplicate Values: The
drop_duplicates()
function allows you to drop all rows that have identical values in all columns or a subset of columns. - Dropping Rows by Index Range: The
drop()
function is also used to drop a range of rows based on their index values. - Inplace Dropping: The
inplace
parameter in DataFrame drop functions allows you to drop rows in the original DataFrame without returning a new one. - Dropping Rows Based on a Column's Datatype: This involves using the
map()
function to create a Boolean Series indicating which rows to keep, based on the datatype of a certain column. - Performance Considerations: Dropping rows in pandas involves creating a new DataFrame, which can be time-consuming for large DataFrames. In cases where a large portion of the DataFrame is to be excluded, it's better to filter the DataFrame instead of dropping rows.
- Error Handling: If you try to drop a row that doesn't exist, pandas will raise a
KeyError
. You can handle this by using theerrors
parameter in thedrop()
function. - Subsetting vs Dropping: Subsetting a DataFrame and dropping rows from it are essentially the same operation, but viewed from different perspectives. If you're thinking about the rows you want to remove, you're "dropping". If you're thinking about the rows you want to keep, you're "subsetting".
Further Reading