Pandas concat() Function Tutorial [Practical Examples]


Python Pandas

The pandas.concat() function is a part of the Pandas library in Python, and it is used for concatenating two or more Pandas objects along a particular axis, either row-wise (axis=0) or column-wise (axis=1). This function is extremely useful when you have data spread across multiple tables, files, or arrays and you want to combine them into a single DataFrame for easier analysis and manipulation. The concat function is quite flexible, allowing for different types of set logic for indexes and relational algebra functionality between the objects you want to concatenate.

 

1. Pandas concat() Syntax Explained

The general syntax for pandas.concat() is:

pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

Here's a brief overview of some key parameters:

  • objs: A sequence or mapping of Series or DataFrame objects to be concatenated. This is the primary argument, and the objects you want to concatenate must be passed here.
  • axis: Specifies the axis along which the objects will be concatenated. axis=0 indicates row-wise concatenation, while axis=1 indicates column-wise concatenation.
  • join: Indicates the type of join operation to perform. You can choose between 'inner' and 'outer' joins.
  • ignore_index: When set to True, this will reset the index for the resulting DataFrame. This is useful if you don't want to keep the index labels from the original DataFrames.
  • keys: Allows you to create hierarchical index levels when you concatenate the DataFrames.
  • verify_integrity: If set to True, this will check for duplicate index values and will raise an exception if found.
  • sort: Sorts the result DataFrame's index if set to True.

An example to concatenate two DataFrames row-wise:

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2'],
                    'C': ['C0', 'C1', 'C2']},
                    index=[0, 1, 2])

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
                    'B': ['B3', 'B4', 'B5'],
                    'C': ['C3', 'C4', 'C5']},
                    index=[3, 4, 5])

# Concatenate DataFrames
result = pd.concat([df1, df2])

In the example above, result will be a DataFrame containing rows from both df1 and df2, concatenated along the rows (axis=0 by default).

 

1.1 Example - Row-wise Concatenation (Axis=0)

Row-wise concatenation in Pandas is done using the axis=0 parameter. This concatenates two or more DataFrames along the rows, stacking them on top of each other. Here's a simple example to demonstrate this:

# Import the Pandas library
import pandas as pd

# Create two simple DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3],
                    'B': ['a', 'b', 'c']})

df2 = pd.DataFrame({'A': [4, 5, 6],
                    'B': ['d', 'e', 'f']})

# Perform row-wise concatenation
result = pd.concat([df1, df2], axis=0)

# Display the resulting DataFrame
print(result)

The output of this code will be:

   A  B
0  1  a
1  2  b
2  3  c
0  4  d
1  5  e
2  6  f

Explanation of Result

As you can see, the DataFrames df1 and df2 are stacked on top of each other, and the resulting DataFrame result contains all the rows from both. Note the following:

  • Index Labels: The index labels are retained from the original DataFrames. If you want to ignore these and have a new continuous index, you can set the ignore_index parameter to True.
  • Column Matching: The columns from each DataFrame ('A' and 'B' in this example) match, so there are no NaN values. If the columns didn't match, the missing ones would be filled with NaN in the resulting DataFrame.
  • Order of DataFrames: The order in which DataFrames are listed in the pd.concat function determines the order of the rows in the concatenated DataFrame. In the example, rows from df1 appear first, followed by rows from df2.

 

1.2 Example - Column-wise Concatenation (Axis=1)

Column-wise concatenation in Pandas is done using the axis=1 parameter. This stacks DataFrames side by side, appending the columns of one DataFrame next to another. Here's a simple example:

# Import the Pandas library
import pandas as pd

# Create two simple DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3],
                    'B': ['a', 'b', 'c']})

df2 = pd.DataFrame({'C': [4, 5, 6],
                    'D': ['d', 'e', 'f']})

# Perform column-wise concatenation
result = pd.concat([df1, df2], axis=1)

# Display the resulting DataFrame
print(result)

The output will look like this:

   A  B  C  D
0  1  a  4  d
1  2  b  5  e
2  3  c  6  f

Explanation of Result

The DataFrames df1 and df2 are concatenated side-by-side, resulting in a new DataFrame result that includes all the columns from both. Let's discuss some of the details:

  • Index Matching: By default, pandas.concat aligns the DataFrames based on their index labels. If the index labels in the DataFrames match, as they do in this example, the rows will be correctly aligned.
  • New Columns: All the columns from df1 ('A' and 'B') appear first in result, followed by all the columns from df2 ('C' and 'D'). The order depends on the sequence of DataFrames in the pd.concat function.
  • No NaN Values: Since both df1 and df2 have matching indices and cover all the rows, there are no NaN (Not a Number) values in result.

 

2. Important Parameters in Pandas concat()

The pandas.concat function offers several parameters that provide you with fine-grained control over how DataFrames are concatenated. Let's delve into some of these important parameters, complete with examples.

1. objs: Specifying the DataFrames

The objs parameter is where you specify the sequence of DataFrame or Series objects that you want to concatenate.

Example:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})

result = pd.concat(objs=[df1, df2])

# Output: A DataFrame with values from df1 and df2

2. axis: Control over Rows or Columns

The axis parameter determines whether the concatenation is row-wise (axis=0) or column-wise (axis=1).

Example:

Row-wise:

result_row = pd.concat([df1, df2], axis=0)

Column-wise:

result_column = pd.concat([df1, df2], axis=1)

3. join: Inner vs Outer Joins

The join parameter allows you to specify how to handle overlapping column names. You can opt for an 'outer' join, which includes all columns, or an 'inner' join, which includes only overlapping columns.

Example:

df3 = pd.DataFrame({'B': [1, 2]})

# Inner Join
result_inner = pd.concat([df1, df3], axis=1, join='inner')

# Outer Join
result_outer = pd.concat([df1, df3], axis=1, join='outer')

4. ignore_index: Re-indexing

When set to True, the ignore_index parameter will reset the index of the resulting DataFrame. This is often useful when the index doesn't contain meaningful data.

Example:

result_ignore_index = pd.concat([df1, df2], ignore_index=True)

5. keys: Multi-level Indexing

The keys parameter allows you to create a hierarchical index (multi-level index) by providing an array or sequence of keys.

Example:

result_keys = pd.concat([df1, df2], keys=['dataset_1', 'dataset_2'])

6. sort: Sorting Index

The sort parameter, when set to True, will sort the columns or index of the resulting DataFrame.

Example:

result_sort = pd.concat([df1, df3], sort=True)

 

3. Handling Indexes in Concatenation

When using pandas concat to concatenate DataFrames, managing indexes can be crucial for data integrity and further data manipulations. There are several options to handle indexes, which include retaining indexes, ignoring indexes, and creating multi-level indexes.

1. Retaining Indexes

By default, pandas concat retains the index labels from the original DataFrames when performing the concatenation. This is most useful when the indexes have meaningful data.

Example:

import pandas as pd

# Create two DataFrames with index labels
df1 = pd.DataFrame({'A': ['A0', 'A1']}, index=['a', 'b'])
df2 = pd.DataFrame({'A': ['A2', 'A3']}, index=['c', 'd'])

# Retaining Indexes
result_retain = pd.concat([df1, df2])

# Output retains the index labels from df1 and df2
print(result_retain)

2. Ignoring Indexes

If you want to ignore the index labels from the original DataFrames and instead want to use a new continuous index, you can set the ignore_index parameter to True.

Example:

# Ignoring Indexes
result_ignore = pd.concat([df1, df2], ignore_index=True)

# Output will have a new continuous index
print(result_ignore)

3. Creating Multi-level Indexes

When you're dealing with multiple DataFrames and you want to remember which row came from which DataFrame, you can create multi-level indexes using the keys parameter.

Example:

# Creating Multi-level Indexes
result_multi = pd.concat([df1, df2], keys=['df1', 'df2'])

# Output will have a hierarchical index
print(result_multi)

Through these different index-handling options, pandas concat provides a flexible way to manage how your concatenated DataFrame is indexed.

 

4. Handling Missing Data in Concatenation

When concatenating DataFrames using pandas concat, it's common to encounter missing data, especially if the DataFrames you are combining don't have the same columns or rows. Knowing how to handle these situations is essential for maintaining data integrity.

 

4.1 What Happens When There Are Missing Columns or Rows?

If you concatenate DataFrames that don't have the same columns, Pandas will introduce NaN (Not a Number) values for the columns that are missing in some DataFrames.

Example:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})

# Columns 'B' and 'C' are not common between df1 and df2
result_missing_columns = pd.concat([df1, df2])

# Output will contain NaN values in columns 'B' and 'C'
print(result_missing_columns)

Similarly, if you perform a column-wise concatenation (axis=1) of DataFrames with differing row indices, Pandas will introduce NaN values for the rows that don't align.

 

4.2 Using fillna to Replace NaN Values

Once you have NaN values in your DataFrame, you can choose to replace them with specific values or estimates using the fillna method.

Example:

# Replace NaN values with 0
result_filled = result_missing_columns.fillna(0)

# Output will have 0s instead of NaN
print(result_filled)

You can also replace NaN values in specific columns:

# Replace NaN values in column 'B' with the mean of column 'A'
result_missing_columns['B'] = result_missing_columns['B'].fillna(result_missing_columns['A'].mean())

# Output will have mean of column 'A' in place of NaN in column 'B'
print(result_missing_columns)

 

5. Advanced Concatenation with Pandas concat() Function

Once you're comfortable with the basics of pandas concat, you might find yourself needing more advanced concatenation techniques for complex data manipulation tasks. Here, we'll cover some advanced topics, such as dealing with different column names, different row indexes, and how to combine concat with other Pandas functions like groupby, merge, and join.

 

5.1 Example - Concatenation with Different Column Names

Sometimes the DataFrames you're working with have different column names but hold similar types of data. You can still use pandas concat to concatenate these DataFrames.

Example:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'X': [5, 6], 'Y': [7, 8]})

# Concatenating DataFrames with different column names
result_diff_columns = pd.concat([df1, df2], axis=0)

# Output will contain NaN values due to different column names
print(result_diff_columns)

 

5.2 Example - Concatenation with Different Row Indexes

If you concatenate DataFrames column-wise (axis=1) and they have different row indexes, pandas concat will introduce NaN values where the indexes don't align.

Example:

df3 = pd.DataFrame({'A': [9, 10]}, index=['x', 'y'])

# Concatenating DataFrames with different row indexes
result_diff_index = pd.concat([df1, df3], axis=1)

# Output will contain NaN values where indexes don't align
print(result_diff_index)

 

5.3 Example - Combining concat with groupby, merge, and join

The real power of pandas concat emerges when you combine it with other operations like groupby, merge, and join.

Example with groupby:

# Assume each DataFrame represents data from different years
df1['Year'] = 2020
df2['Year'] = 2021

# Concatenate and then group by 'Year'
result_groupby = pd.concat([df1, df2], keys=['2020', '2021'])
grouped = result_groupby.groupby('Year').sum()

# Output will be the sum of values grouped by 'Year'
print(grouped)

Example with merge:

# Assume you have a third DataFrame with additional info
df_info = pd.DataFrame({'A': [1, 5], 'Info': ['Type1', 'Type2']})

# Concatenate df1 and df2, and then merge with df_info
result_merge = pd.concat([df1, df2], ignore_index=True)
result_merge = pd.merge(result_merge, df_info, on='A', how='left')

# Output will contain additional 'Info' column
print(result_merge)

Example with join:

# Assume df1 and df3 contain related data but have different indexes
result_join = df1.join(df3, how='outer', lsuffix='_df1', rsuffix='_df3')

# Output will be an outer join of df1 and df3
print(result_join)

 

6. Practical Examples and Use Cases

In this section, we'll explore some of the most common challenges that users face while using Pandas concat() Function

1. Concatenating DataFrames with Different Column Names

Problem: Users often ask how to concatenate DataFrames that have different column names.

Example: Let's say DataFrame df1 has columns ['A', 'B'] and DataFrame df2 has columns ['X', 'Y'].

Solution: You can concatenate them and get a DataFrame with all four columns, but the values will be filled with NaN where information is not available.

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'X': [5, 6], 'Y': [7, 8]})

result = pd.concat([df1, df2], axis=0)

# Expected Output:
#     A    B    X    Y
# 0  1.0  3.0  NaN  NaN
# 1  2.0  4.0  NaN  NaN
# 0  NaN  NaN  5.0  7.0
# 1  NaN  NaN  6.0  8.0
print(result)

2. How to Keep the Original Indices After Concatenation

Problem: Users often want to keep the original index values after using pandas concat.

Example: DataFrame df1 and df2 both have an index ranging from 0 to 2.

Solution: Using the default settings of pd.concat() will retain the original indices.

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

result = pd.concat([df1, df2])

# Expected Output:
#    A  B
# 0  1  3
# 1  2  4
# 0  5  7
# 1  6  8
print(result)

Output: The resulting DataFrame will have duplicated indices, each ranging from 0 to 2, unless ignore_index is set to True.

3. Concatenating a List of DataFrames in a Loop

Problem: Users sometimes need to concatenate a list of DataFrames in a loop, which can be inefficient.

Example: You have a list of DataFrames [df1, df2, df3].

Solution: It's more efficient to concatenate them all in a single line instead of using a loop.

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df3 = pd.DataFrame({'A': [9, 10], 'B': [11, 12]})

result = pd.concat([df1, df2, df3])

# Expected Output:
#     A   B
# 0   1   3
# 1   2   4
# 0   5   7
# 1   6   8
# 0   9  11
# 1  10  12
print(result)

Output: This produces a single DataFrame that contains all the rows from df1, df2, and df3.

4. Concatenating DataFrames with Different Row Indices

Problem: How to concatenate DataFrames that have different row indices?

Example: DataFrame df1 has a row index of ['a', 'b'] and df2 has a row index of ['x', 'y'].

Solution: If you use pandas concat with axis=0, the row indices will be retained, leading to a DataFrame with indices ['a', 'b', 'x', 'y'].

df1 = pd.DataFrame({'A': [1, 2]}, index=['a', 'b'])
df2 = pd.DataFrame({'A': [3, 4]}, index=['x', 'y'])

result = pd.concat([df1, df2], axis=0)

# Expected Output:
#    A
# a  1
# b  2
# x  3
# y  4
print(result)

Output: The resulting DataFrame will have indices ['a', 'b', 'x', 'y'].

5. Handling NaN Values After Concatenation

Problem: What to do if the concatenated DataFrame has NaN values?

Example: DataFrame df1 has columns ['A', 'B'] and df2 has columns ['A', 'C'].

Solution: You can use fillna to replace NaN values after concatenation.

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})

result = pd.concat([df1, df2], axis=0)
result.fillna(0, inplace=True)

# Expected Output:
#     A    B    C
# 0  1.0  3.0  0.0
# 1  2.0  4.0  0.0
# 0  5.0  0.0  7.0
# 1  6.0  0.0  8.0
print(result)

Output: The NaN values in the resulting DataFrame will be replaced with 0.

 

7. Best Practices for Pandas concat() function

Understanding when to use pandas concat versus other functions like merge and join, as well as considerations for performance, can help you write efficient and maintainable code.

 

7.1 When to Use pandas.concat vs merge and join

Use pandas concat when you want simple stacking of DataFrames either vertically or horizontally.

# Vertical stacking with `concat`
result_vertical = pd.concat([df1, df2], axis=0)

Use merge when you want to combine DataFrames based on a common column.

Example:

# Merging on a common column 'A'
result_merge = pd.merge(df1, df2, on='A', how='inner')

Use join when you want to combine DataFrames based on index.

Example:

# Joining on index
result_join = df1.join(df2, how='outer', lsuffix='_df1', rsuffix='_df2')

 

7.2 Performance Considerations

Preallocate Resulting DataFrame: When you are concatenating a large number of DataFrames in a loop, it's generally more efficient to preallocate the resulting DataFrame and populate it, rather than using concat within the loop.Example:

# Inefficient
result = pd.DataFrame()
for df in list_of_dfs:
    result = pd.concat([result, df])

# Efficient
result = pd.concat(list_of_dfs, axis=0)

Avoid Using ignore_index=True Unnecessarily: Resetting the index by using ignore_index=True may lead to additional computational overhead. Use it only when you're sure you don't need the existing index.

Example:

# If retaining index is not necessary
result_ignore = pd.concat([df1, df2], ignore_index=True)

Be Mindful of Data Types: When using pandas concat, ensure that the data types of columns are consistent across all DataFrames to avoid type upcasting, which may consume more memory.

Example:

# Ensure consistent data types
df1['A'] = df1['A'].astype('int32')
df2['A'] = df2['A'].astype('int32')

 

8. Common Pitfalls and How to Avoid Them

While pandas concat is an incredibly useful function, there are some pitfalls that users commonly encounter. Understanding these can save you a lot of time and trouble.

1. Duplicated Indices

When using pandas concat, it's possible to end up with duplicated index labels, especially if you are concatenating DataFrames with overlapping indices and you've not specified to ignore or regenerate the indices.

Example:

# Create DataFrames with overlapping indices
df1 = pd.DataFrame({'A': [1, 2]}, index=['x', 'y'])
df2 = pd.DataFrame({'B': [3, 4]}, index=['x', 'y'])

# Using `concat` without resetting the index
result_duplicated = pd.concat([df1, df2], axis=0)

# Output will have duplicated indices
print(result_duplicated)

How to Avoid:

You can avoid this issue by setting the ignore_index parameter to True, or by using the verify_integrity parameter to raise an error if there are overlapping indices.

# Using `ignore_index` to reset index
result_ignore = pd.concat([df1, df2], ignore_index=True, axis=0)

# Using `verify_integrity` to check for overlapping indices
try:
    result_verify = pd.concat([df1, df2], verify_integrity=True, axis=0)
except ValueError as e:
    print(f"ValueError encountered: {e}")

2. Memory Issues

Concatenating very large DataFrames can lead to memory issues. This is particularly true if you're working in a constrained memory environment.

How to Avoid:

You can avoid memory issues by using the copy parameter. Setting it to False tells Pandas not to copy data unnecessarily, saving memory.

# Using `copy=False` to save memory
result_memory = pd.concat([df1, df2], copy=False)

3. Data Type Preservation

When using pandas concat, the data type of columns should ideally be consistent across all DataFrames to avoid type upcasting, which can consume more memory and lead to unexpected behaviors.

# DataFrame with integers
df1 = pd.DataFrame({'A': [1, 2]})

# DataFrame with floats
df2 = pd.DataFrame({'A': [1.1, 2.2]})

# Using `concat`
result_dtype = pd.concat([df1, df2])

# Column 'A' has been upcast to float
print(result_dtype.dtypes)

How to Avoid:

Ensure that the data types are consistent before using pandas concat.

# Ensure data types are consistent
df1['A'] = df1['A'].astype('float64')
df2['A'] = df2['A'].astype('float64')

result_dtype_fixed = pd.concat([df1, df2])

# Now the data types are consistent
print(result_dtype_fixed.dtypes)

 

9. Frequently Asked Questions

Can I use pandas concat to merge two DataFrames vertically and horizontally at the same time?

No, pandas concat only allows concatenation either row-wise (axis=0) or column-wise (axis=1) at a time. You cannot concatenate both vertically and horizontally in a single operation.

How do I handle duplicate index values when using pandas concat?

You have several options to deal with duplicate index values:
1. Use the ignore_index=True parameter to reset the index.
2. Use the keys parameter to create a multi-level index that distinguishes between the source DataFrames.
3. Use the verify_integrity=True parameter to raise an error if the operation would result in duplicate indices.

What happens when there are missing columns or rows?

If the DataFrames have different columns or indexes and you're using the default outer join, pandas concat will include all columns or rows but will fill the missing entries with NaN (Not a Number).

How can I replace NaN values that appear after concatenation?

You can use the fillna method on the concatenated DataFrame to replace NaN values with a value of your choice. For example, if your DataFrame is stored in a variable called df, you could use df.fillna(0) to replace all NaN values with zeros.

Is pandas concat the same as merge and join?

No, each function has its specific use-cases:
1. Use pandas concat for simple stacking of DataFrames either vertically or horizontally.
2. Use merge to combine DataFrames based on a common column.
3. Use join to combine DataFrames based on index.

How can I improve the performance of pandas concat?

If you're concatenating multiple DataFrames, consider doing it in a single concat call rather than using multiple calls in a loop. Also, using the copy=False parameter can save memory.

Can I concatenate DataFrames with different data types?

Yes, you can, but be aware that this might lead to type upcasting, which could consume more memory and lead to unexpected behavior.

What are the most common pitfalls to avoid?

The most common pitfalls include duplicated indices, memory issues due to large DataFrames, and data type inconsistencies. Understanding these and how to avoid them can make your usage of pandas concat more effective.

 

10. Summary

  • pandas concat is a powerful function for concatenating DataFrames either row-wise (axis=0) or column-wise (axis=1).
  • Important parameters like axis, join, ignore_index, and keys offer control over how DataFrames are concatenated.
  • Handling of missing data, indexes, and duplicate indices are key areas where caution is required.
  • Real-world examples illustrated the utility of pandas concat in scenarios such as merging monthly sales data, combining datasets with different columns, and using multi-level indexing in financial data.

When to Use pandas concat

  • Use pandas concat when you need to simply stack DataFrames either vertically or horizontally.
  • It's especially useful when the DataFrames to be combined don't share a common key, or when you need to maintain the existing indices.

Other Alternatives

  • merge: When you want to combine DataFrames based on a common column.
  • join: For combining DataFrames based on their indexes.

References: The official Pandas documentation offers in-depth explanations, examples, and best practices related to pandas concat.

 

Deepak Prasad

Deepak Prasad

He is the founder of GoLinuxCloud and brings over a decade of expertise in Linux, Python, Go, Laravel, DevOps, Kubernetes, Git, Shell scripting, OpenShift, AWS, Networking, and Security. With extensive experience, he excels in various domains, from development to DevOps, Networking, and Security, ensuring robust and efficient solutions for diverse projects. You can connect with him on his LinkedIn profile.

Can't find what you're searching for? Let us assist you.

Enter your query below, and we'll provide instant results tailored to your needs.

If my articles on GoLinuxCloud has helped you, kindly consider buying me a coffee as a token of appreciation.

Buy GoLinuxCloud a Coffee

For any other feedbacks or questions you can send mail to admin@golinuxcloud.com

Thank You for your support!!

Leave a Comment