Pandas merge, concat, append, join dataframe - Examples

Related Searches: pandas merge, pandas concat, pd concat, pd merge, pd join, pandas append, dataframe append, pandas left join, pandas merge dataframes, merge two dataframes pandas, pandas merge on index, pandas join dataframes, append dataframe pandas, pandas concatenate, pandas append dataframe, dataframe merge, pandas dataframe append, pandas concat two dataframes, df merge, dataframe join, pandas merge on multiple columns, pandas join two dataframes, python merge, pd merge on multiple columns, join two dataframes pandas, pandas dataframe merge, combine two dataframes pandas, merge dataframe pandas, merge dataframes, python concat, dataframe concat, pandas dataframe join, concat dataframe pandas, pandas join vs merge, pandas merge multiple dataframes, concatenate two dataframes pandas, join dataframes pandas, pandas inner join, combine dataframes pandas

Introduction to pandas merge method

The dataframe and series in pandas are one of the powerful tools to explore and analyze data. One of the most important features of these data sets is their approach to combine separate datasets. With pandas, we can merge and concatenate different datasets together. In this tutorial, we will cover how we can merge and concatenate different datasets in pandas. we will learn about different techniques of merging inlcuding let merging, right merging, inner merging, and outer merging by working on practical examples. Moreover, we will also discuss pandas concat method as well which is another way of merging different datasets together.

Install Python Panda Module

By default pandas module may not be installed on your distro so you may get following error while trying to execute your python code using pandas module:

ModuleNotFoundError: No module named 'pandas'

So you must first install this module. One RHEL/Rocky Linux/CentOS/Fedora distributions you can use dnf while for Debia/Ubuntu you can use apt package manager:

# dnf -y install python3-pandas.x86_64

HINT

On RHEL 8 environment, I was getting nothing provides libqhull.so.7()(64bit) needed by python3-matplotlib-3.0.3-3.el8.x86_64 error while installing python3-panda. To overcome this we need to enable "codeready-builder" repo using subscription-manager repos --enable "codeready-builder-for-rhel-8-{ARCH}-rpms". Here, replace {ARCH} with your architecture value

Alternatively you can install pip3 and then use pip to install panda module:

# dnf -y install python3-pip

Now you can use pip3 to install the panda module:

Getting start with pandas merge method

Pandas merge method allows us to create better datasets which helps us to get efficient and accurate results when trying to analyze data. merge() method is mostly used to combine data objects based on one or more keys in a similar way to a relational database. In simple words merge() is useful when we want to combine rows that share data. Let us suppose that we have two csv files containing different data.

HINT

How to read and write in CSV File using Python Programming

CSV_file_one.csv contains the following data:

 student_id, Name, Gender
 5, Erlan, Male
10, Alex, Male 
15, soro, Female 
 20, Khan, Male
25, ateeq, Male
30, MD, Female
35, JK, Male

And CSV_file_two.csv contains the following data:

student_id, major
10, CS
20, MBBS
30, CS
40, CS
50, MBBS

In this section, we will cover different ways to combine the above data using the pandas merge method.

Example of left merging in pandas

Pandas left merge is mostly concerned about the data on the left side and add data from the right side if it has some of the dame ids. To apply left merging, we chop up the rows in the right dataframe and glue a piece of it onto the left dataframe. If something in the right dataframe does not match or doesn’t exist, pandas keep the same length of columns by adding NaN to fill the missing column. Here is a pictorial representation of left merging.

Pandas merge, concat, append, join dataframe - Examples

Pandas Left Merge

The simple syntax of left merging in pandas looks like this:

pd.merge(left=left_dataframe_name, right=right_dataframe_name, on=”ID”, how=”left”)

Let us take a practical example and see how left merging works in pandas. We will merge the two csv file given above.

# importing pandas
import pandas as pd

# reading the csv files
left_data = pd.<b>read_csv</b>("CSV_file_one.csv")
right_data = pd.<b>read_csv</b>("CSV_file_two.csv")

# Left merge on "student_ID" column
merged_data = pd.<b>merge</b>(left_data,right_data, on="student_id", how="left")

# printing
<b>print</b>(merged_data)

Output:

 student_id  Name  Gender  major
0      5  Erlan   Male   NaN
1     10  Alex   Male   CS
2     15  soro  Female   NaN
3     20  Khan   Male  MBBS
4     25  ateeq   Male  NaN
5     30   MD  Female   CS
6     35   JK   Male  NaN

You can see that the left merge takes all data from the left dataframe and takes only required data ( data that matches) with the left one.

Example of right merging in pandas

The right merge is very much similar to the left one with the difference that it mostly cares about the right side and adds the data from the left if it matches with the IDs. Here is the pictorial representation of right merging.

Pandas Right Merge

The simple syntax of right merging is very much similar to left merging.

pd.merge(left=left_dataframe_name, right=right_dataframe_name, on=”ID”, how=”right”)

Everything is the same, except we have to replace left with right. Now let us take example of right merging and merging data in the two csv files.

python

# importing pandas
import pandas as pd

# reading the csv files
left_data = pd.<b>read_csv</b>("CSV_file_one.csv")
right_data = pd.<b>read_csv</b>("CSV_file_two.csv")

# right merge on "student_ID" column
merged_data = pd.<b>merge</b>(left_data,right_data, on="student_id", how="right")

# printing
<b>print</b>(merged_data)

Output:

 student_id  Name Gender  major
0     10  Alex  Male   CS
1     20  Khan   Male  MBBS
2     30   MD  Female   CS
3     40   NaN   NaN   CS
4     50   NaN   NaN  MBBS

This time , the right merging methods take all the data from the right dataframe and take only required ( same id ones) from the left and merge them together.

Example of inner merging in pandas

Inner merging is very much similar to the intersection process. In this merging, pandas takes both data frames and merge the staff/data that matches. If an ID dont found in both data frames, pandas will not add that data, nor add NaN. Here is a pictorial representation of inner merging in pandas.

Pandas Inner Merge

Now let us look at the syntax of pandas inner merging.

pd.merge(left=left_dataframe_name, right=right_dataframe_name, on=”ID”, how=”inner”)

We just have to add inner to the code, the rest will be similar to right and left merging. See the example below:

# importing pandas
import pandas as pd

# reading the csv files
left_data = pd.<b>read_csv</b>("CSV_file_one.csv")
right_data = pd.<b>read_csv</b>("CSV_file_two.csv")

# inner merge on "student_ID" column
merged_data = pd.<b>merge</b>(left_data,right_data, on="student_id", how="inner")

# printing
<b>print</b>(merged_data)

Output:

 student_id  Name Gender  major
0     10  Alex  Male   CS
1     20  Khan   Male  MBBS
2     30   MD  Female   CS

You can see the inner merging only takes and gives an output of data that is common(matches) in both data frames.

Example of outer merging in pandas

Outer merging in pandas is very much similar to union, which takes all the data from both data frames and adds NaN to fill the blanks, if any. See the pictorial representation of outer merging.

Pandas Outer Merge

The syntax of outer merging is very much similar to other ones. See the syntax below:

pd.merge(left=left_dataframe_name, right=right_dataframe_name, on=”ID”, how=”outer”)

Now let us take one example and merge the two data frames using the outer method.

# importing pandas
import pandas as pd

# reading the csv files
left_data = pd.<b>read_csv</b>("CSV_file_one.csv")
right_data = pd.<b>read_csv</b>("CSV_file_two.csv")

# outer merge on "student_ID" column
merged_data = pd.<b>merge</b>(left_data,right_data, on="student_id", how="outer")

# printing
<b>print</b>(merged_data)

Output:

 student_id  Name  Gender  major
0      5  Erlan   Male   NaN
1     10  Alex   Male   CS
2     15  soro  Female   NaN
3     20  Khan   Male  MBBS
4     25  ateeq   Male  NaN
5     30   MD  Female   CS
6     35   JK   Male  NaN
7     40   NaN    NaN   CS
8     50   NaN    NaN  MBBS

Notice that it takes all the data from both dataframes, and merges them together and adds NaNs to the blanks if any.

Example of merging on index in pandas

o far we have used the specified column in the “on” parameter in the merge. Instead of specifying columns, we can merge two dataframe based on indexes. The syntax of the merging on index pandas looks like this:

merged_data = pd.merge(left_data,right_data, left_index=True, right_index=True)

Now let us take the example of the two csv files and merge them based on indexing.

# importing pandas
import pandas as pd

# reading the csv files
left_data = pd.<b>read_csv</b>("CSV_file_one.csv")
right_data = pd.<b>read_csv</b>("CSV_file_two.csv")

# inner merging using indexing method
merged_data = pd.<b>merge</b>(left_data,right_data, how="inner", left_index=True, right_index=True)

# printing
<b>print</b>(merged_data)

Output:

 student_id_x  Name  Gender  student_id_y major
0       5  Erlan   Male       10   CS
1      10  Alex   Male       20  MBBS
2      15  soro  Female       30   CS
3      20  Khan   Male      40   CS
4      25  ateeq   Male      50  MBBS

Pandas concat method to merge datasets

There is a little difference between pandas merging and concatenation methods. With the help of merging, we can get the result dataset in the form of rows from the parent datasets mixed together. Sometimes we might also lose rows that do not have matches in the dataset depending on the type of merging. While concatenation just stitches data together along an axis either in the row axis or column axis.

The simple syntax of pandas concatenation looks like this:

concated_data = pd.concat([dataframe_one, dataframe_two])

Now let us take an example of the two csv files and merge them together using pandas concat() method.

import pandas as pd

# reading the csv files
left_data = pd.<b>read_csv</b>("CSV_file_one.csv")
right_data = pd.<b>read_csv</b>("CSV_file_two.csv")

# pandas merging using concat() method
merged_data = pd.<b>concat</b>([left_data, right_data])

# printing
<b>print</b>(merged_data)

Output:

0      5  Erlan   Male   NaN
1     10  Alex   Male   NaN
2     15  soro  Female   NaN
3     20  Khan   Male  NaN
4     25  ateeq   Male  NaN
5     30   MD  Female  NaN
6     35   JK   Male  NaN
0     10   NaN    NaN   CS
1     20   NaN    NaN  MBBS
2     30   NaN    NaN   CS
3     40   NaN    NaN   CS
4     50   NaN    NaN  MBBS

Notice that it takes all the rows (data) and columns along with the indexing number, adds them together and puts NaN in the missing data.

By default, concatenation is a set of a union, where all data is merged together. The concat() method can take some optional parameters as well. The following list gives us information about some of those parameters.

axis : help us to specify the axis position. By default the value is 0, which concatenates along the index while 1 concatenates along columns.
objs : this takes any sequence of series or dataframe objects to be concatenated.
join : by default the value is outer, we can specify and make it inner merging as well.
key: this parameter allows us to construct a hierarchical index.

Now let us take the same example of two csv files and concatenate them by explicitly mentioning the method(inner). See the example below:

# importing pandas
import pandas as pd

# reading the csv files
left_data = pd.<b>read_csv</b>("CSV_file_one.csv")
right_data = pd.<b>read_csv</b>("CSV_file_two.csv")

# pandas merging using concat() method
merged_data = pd.<b>concat</b>([left_data, right_data], join="inner")

# printing
<b>print</b>(merged_data)

Output:

This prints all the rows with only one column (the student_id) because by default the axis 0 which means row. See the example below which concatenates column wise.

# importing pandas
import pandas as pd

# reading the csv files
left_data = pd.<b>read_csv</b>("CSV_file_one.csv")
right_data = pd.<b>read_csv</b>("CSV_file_two.csv")

# pandas merging using concat() method
merged_data = pd.<b>concat</b>([left_data, right_data], join="inner", axis=1)

# printing
<b>print</b>(merged_data)

Output:

 student_id  Name  Gender  student_id major
0      5  Erlan   Male      10   CS
1     10  Alex   Male      20  MBBS
2     15  soro  Female      30   CS
3     20  Khan   Male     40   CS
4     25  ateeq   Male     50  MBBS

Pandas join method to merge dataframe

We know that merge() is a modular function while join() is an object function which lives on a dataframe that helps us to specify only one dataframe which will join. Syntax is similar to the panda merge method. See the syntax below:

dataframe1_name.join( dataframe2_name, optional_parameters)

Now let us take a practical example of the two csv files and merged them using join method.

# importing pandas
import pandas as pd

# reading the csv files
left_data = pd.<b>read_csv</b>("CSV_file_one.csv")
right_data = pd.<b>read_csv</b>("CSV_file_two.csv")

# pandas merging using join method
merged_data = left_data.<b>join</b>(right_data)

# printing
<b>print</b>(merged_data)

Output:

 student_id  Name  Gender  student_id major
0      5  Erlan   Male     10.0   CS
1      10  Alex   Male     20.0  MBBS
2      15  soro  Female     30.0   CS
3      20  Khan   Male    40.0   CS
4      25  ateeq   Male    50.0  MBBS
5      30   MD  Female     NaN  NaN
6      35   JK   Male     NaN  NaN

You can see that the join method merges two dataframes and adds NaN to the black areas.

Here is a description of the parameters that join method can take:

other : this is a required parameter which specifies the name of another dataframe. We can also specify a list of dataframe as well.
on : Optional parameter which specifies an optional column or index name for the left datframe. By default the value is set to None
how: as name suggests, is the same as how the parameter in merge works with a difference that this time it is indexed-based.
lsuffix and rsuffix : similar to the suffixes in merge one. It specifies a suffix to add to any overlapping columns.

As you are now familiar with different parameters that we can use with the join method. Let us take the same example and perform an inner join using the join method.

# importing pandas
import pandas as pd

# reading the csv files
left_data = pd.<b>read_csv</b>("CSV_file_one.csv")
right_data = pd.<b>read_csv</b>("CSV_file_two.csv")

# pandas merging using join method
merged_data = left_data.<b>join</b>(right_data, how="inner")

# printing
<b>print</b>(merged_data)

Output:

 student_id  Name  Gender  student_id major
0      5  Erlan   Male      10   CS
1      10  Alex   Male      20  MBBS
2      15  soro  Female      30   CS
3      20  Khan   Male     40   CS
4      25  ateeq   Male     50  MBBS

Pandas append method to merge dataframes

The append method in pandas is used to append rows of one dataframe to the end of a given dataframe, and return a new dataframe object. Columns which are not in the original dataframe are added as new columns and NaN is added in new cells. Here is a simple syntax of append method in pandas.

dataframe1_name.append(dataframe2_name)

Now let us take the practical example of the two csv files and append one file to the end of another.

# importing pandas
import pandas as pd

# reading the csv files
left_data = pd.<b>read_csv</b>("CSV_file_one.csv")
right_data = pd.<b>read_csv</b>("CSV_file_two.csv")

# pandas merging using join method
merged_data = left_data.<b>append</b>(right_data)

# printing
<b>print</b>(merged_data)

Output:

 student_id  Name  Gender  student_id major
0     5.0  Erlan   Male     NaN  NaN
1     10.0  Alex   Male     NaN  NaN
2     15.0  soro  Female     NaN  NaN
3     20.0  Khan   Male     NaN  NaN
4     25.0  ateeq   Male     NaN  NaN
5     30.0   MD  Female     NaN  NaN
6     35.0   JK   Male     NaN  NaN
0     NaN   NaN    NaN    10.0   CS
1     NaN   NaN    NaN    20.0  MBBS
2     NaN   NaN    NaN    30.0   CS
3     NaN   NaN    NaN    40.0   CS
4     NaN   NaN    NaN    50.0  MBBS

Notice that it added new columns which were not in the previous one ( major, student_id) and added NaN to the cells.

Summary

Pandas merge and pandas concat methods allow us to create better datasets for analyzing and filtering data. There are different methods of merging available in pandas for example left merging, right merging, inner merging, and outer merging. In this tutorial, we learned about all these pandas merging methods along with examples. We also covered the pandas concat method to merge data by working on practical examples. In a nutshell, we cover everything that you need to know in order to understand and use pandas merge method in practical problems.