Pandas dataframe explained with simple examples

Related Searches: pandas dataframe, pd dataframe, python dataframe, pandas create dataframe, python pandas dataframe, create dataframe, create dataframe pandas

 

Introduction to pandas dataframe

Padas has two powerful data structures, data frames, and series. A dataframe is a table with multiple columns much like SQL or Excel. Pandas allow us to perform different operations on these data frames such as filtering, aggregation, selecting data, and deleting specific data. Pandas' dataframes are particularly useful because of the powerful methods that are built into them. These data frames can load data from a number of different data structures and files including lists and dictionaries, CSV, and excel files. In this tutorial, we will learn to create pandas dataframes from different data sets including lists, dictionaries, and numpy arrays. Moreover, we will also cover different operations that we can perform on pandas dataframe including selecting, deleting, and adding columns and many more.

Advertisement

 

Getting start with pandas dataframe

Pandas dataframes are data structures that contain data organized in two-dimensional arrays namely rows and columns. Pandas module does not come with python and we have to manually install it in our environment before accessing its powerful features. We can install pandas using the pip command through our terminal. See the example below:

pip install pandas

Once you successfully install pandas on your pc, you are ready to go and access the powerful functionalities. In this section, we will see how we can create pandas dataframe through various data sets.

 

Difference between pandas dataframe and series

Before jumping into pandas dataframe let us first clear the difference between a dataframe and series. Here are the following differences.

  • Series are one dimensional while dataframes are two dimensional
  • Series can only contain a single list with index, whereas dataframe can be made of more than one series
  • Series does not have any name/header whereas the dataframe has column names.

Pandas dataframe explained with simple examples

 

Create pandas dataframe with a dictionary

We can create a panda dataframe from scratch using a dictionary. The keys of the dictionary will be the column labels and the dictionary values will be the actual data values in the corresponding dataframe columns.

Here is a simple syntax of python pandas to convert a dictionary to a dataframe.

Data_frame_name = pd.DataFrame(dic_name)

See the following example which creates a pandas dataframe using a dictionary.

Advertisement
# import pandas
import pandas as pd

# python dictionary
my_dic = {"names": ["Bashir", "Alam", "Arlen"], "age":[21, 23, 19]}

# creating pandas dataframe from dictionary
my_dataframe = pd.DataFrame(my_dic)

# printing
print(my_dataframe)

Output:

names  age
0  Bashir   21
1    Alam   23
2   Arlen   19

We use the .DataFrame() method to convert the data set into pandas dataframe. Now, notice that the output contains an auto indexing starting from the second row. It is because by default the very first row in pandas will be treated as headers and auto indexing will be given to the row. We can change the default values of indexing and give our own indexing. See the example below:

# import pandas
import pandas as pd

# python dictionary
my_dic = {"names": ["Bashir", "Alam", "Arlen"], "age":[21, 23, 19]}

# creating pandas dataframe from dictionary
my_dataframe = pd.DataFrame(my_dic, index=["row1", "row2", "row3"])

# printing
<b>print</b>(my_dataframe)

Output:

      names  age
row1  Bashir   21
row2    Alam   23
row3   Arlen   19

To change the default indexing, we have to provide one more argument of indexing to the .DataFrame() method.

 

Create pandas dataframe with a list

Another way to create pandas dataframe from scratch is to use nested lists or a list of dictionaries . We can use nested lists as the data values. The simple syntax of creating pandas dataframe from list looks like this:

Name_of_dataframe = pd.DataFrame(name_of_list, column= list_containing_names])

Now let us take a practical example and create a pandas dataframe from a nested list.

# import pandas
import pandas as pd

# python nested list
my_list = [[1, 2, 3], [4, 5, 6],[7, 8, 9]]

# creating pandas dataframe from nested list
labels = ["data1", "data2", "data2"]

my_dataframe = pd.DataFrame(my_list, columns=labels)
# printing
<b>print</b>(my_dataframe)

​Output:

   data1  data2  data2
0      1      2      3
1      4      5      6
2      7      8      9

In a similar way, we can create a pandas dataframe from a list of dictionaries as well. See the example below:

Advertisement
# import pandas
import pandas as pd

# python list containing dictionaries
my_list = [{"dat1": 1, "data2":2, "data3": 3},
{"dat1": 4, "data2":5, "data3": 6},
{"dat1": 7, "data2":8, "data3": 9}]

# creating pandas dataframe from list containing dictionaries
my_dataframe = pd.DataFrame(my_list)

# printing
<b>print</b>(my_dataframe)

Output:

  dat1  data2  data3
0     1      2      3
1     4      5      6
2     7      8      9

In the same way, if a list has tuples, we can also create pandas dataframe. See the example below which  creates a pandas dataframe from a list containing tuples.

# import pandas
import pandas as pd

# python list containing tuples
my_list = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

# creating pandas dataframe from list containing tuples
labels = ["data1", "data2", "data3"]
my_dataframe = pd.DataFrame(my_list, columns=labels)

 # printing
<b>print</b>(my_dataframe)

Output:

   data1  data2  data3
0      1      2      3
1      4      5      6
2      7      8      9

 

Create pandas dataframe with NumPy array

To create a pandas dataframe from a NumPy array, first, we have to create a NumPy array. To do that, we have to first install NumPy on our system using the pip command. Once we are done with the installation and creating a NumPy array, we are good to create pandas dataframe. Here is a simple syntax of creating a dataframe with a NumPy array.

Name_of_dataframe = pd.DataFrame(numpy_array, columns = [list_of_columns_name])

Now let us create a pandas dataframe from a numpy array. See the example below:

# import pandas
import pandas as pd

# importing numpy
import numpy as np

# creating numpy array
numpy_array =np.<b>array</b>([[1, 2,3],[4, 5, 6],[7, 8, 9]])
labels = ["data1", "data2", "data3"]

# creating dataframe from numpy array
my_dataframe = pd.DataFrame(numpy_array, columns=labels)

# printing
<b>print</b>(my_dataframe)

Output:

   data1  data2  data3
0      1      2      3
1      4      5      6
2      7      8      9

We can change the row indexing in a similar way as we did before by adding an indexing argument and passing a list containing indices. See the example below:

# import pandas
import pandas as pd

# importing numpy
import numpy as np

# creating numpy array
numpy_array =np.<b>array</b>([[1, 2,3],[4, 5, 6],[7, 8, 9]])
labels = ["data1", "data2", "data3"]

# creating dataframe from numpy array
my_dataframe = pd.DataFrame(numpy_array, index=["row1", "row2", "row3"] ,columns=labels)

# printing
<b>print</b>(my_dataframe)

Output:

Advertisement
  data1  data2  data3
row1      1      2      3
row2      4      5      6
row3      7      8      9

 

Selecting operations on pandas dataframe

Now we have all the necessary information to create pandas dataframe through various ways.  In this section we will learn how we can perform selection operations on rows and columns and select specific data from the dataframe.

Let us say we have the following pandas' dataframe.

# import pandas
import pandas as pd

# pandas dataframe
my_dataframe = pd.DataFrame(numpy_array, index=["row1", "row2", "row3"] ,columns=labels)

# printing
<b>print</b>(my_dataframe)

Output:

      data1  data2  data3
row1      1      2      3
row2      4      5      6
row3      7      8      9

Let us now apply different selection operations on the given dataframe.

 

Select a column in pandas dataframe

It is very easy and simple to select a particular column in pandas dataframe. We can select a column by simply calling its name. The simple syntax of selecting a column looks like this:

column_name = name_of_dataframe[‘name_of_column’]

Now let us select column two which is named as data2 in the above example.


# selecting specific column
second_column = my_dataframe[['data2']]

# printing
<b>print</b>(second_column)

Output:

     data2
row1      2
row2      5
row3      8

In the same way we can also select multiple columns at the same time by writing the names in the form of a list. See the example below:

# selecting multiple column
second_column = my_dataframe[['data2', "data3"]]

# printing
<b>print</b>(second_column)

Output:

     data2  data3
row1      2      3
row2      5      6
row3      8      9

 

Select a row from pandas dataframe

Selecting a row in a pandas dataframe is different from column selection. There is a built-in function loc() which is used to select rows from pandas dataframe. Row can also be selected by passing integer location to a loc() function. The simple syntax of row selection in Pandas looks like this:

Row_name  = dataframe_name.loc[‘index_of_row’]

Now let us take the same example and select the first row using loc() method.

# selecting row in dataframe
row = my_dataframe.loc[['row1']]

# printing
<b>print</b>(row)

Output:

      data1  data2  data3
row1      1      2      3

In a similar way, we can select multiple rows at a time by providing a list of names/indices of rows. See the example below:

# selecting multiple row in dataframe
row = my_dataframe.loc[['row1', "row2"]]

# printing
<b>print</b>(row)

Output:

      data1  data2  data3
row1      1      2      3
row2      4      5      6

 

Delete and Insert data in pandas dataframe

Pandas provides us with a number of techniques to insert and delete rows or columns. In this section we will see how we can add and delete rows and columns from a pandas dataframe through various examples. Let us say we have the same following data set named my_dataframe which contains the following data.

      data1  data2  data3
row1      1      2      3
row2      4      5      6
row3      7      8      9

Now let us see how we can delete and add new rows and columns.

 

Delete and add a column

Pandas provides us with a built-in function known as drop(), which deletes the specified column. We can specify the index label or column name to delete. Simple syntax of deleting a column in pandas dataframe look like this:

Dataframe_name.drop([‘column_name’], axis=specified_position, optional_arguments)

The drop() method can takes the following arguments:

  • labels: String or list of strings referring to row or column name.
  • axis: int or string value, 0 ‘index’ for Rows and 1 ‘columns’ for Columns.
  • index or columns: Single label or list. index or columns are an alternative to axis and cannot be used together.
  • level: Used to specify level in case a data frame is having multiple level indexes.
  • inplace: Makes changes in the original Data Frame if True.
  • errors: Ignores error if any value from the list doesn’t exists and drops rest of the values when errors = ‘ignore’

Now let us take an example and delete the data2 column from the given above example.

# deleting specific column from dataframe
<b>print</b>(my_dataframe.<b>drop</b>(["data2"], axis=1))

Output:

      data1  data3
row1      1      3
row2      4      6
row3      7      9

Now let us see how we can add a new column to pandas dataframe. We can create a new list as a column and then add that list to the existing pandas dataframe.

The simple syntax of adding a new column as a list looks like this.

Dataframe_name[‘name_of_new_column’] = list_to_be_added

Now let us add “data4” to the already existing dataframe.

# list of data
data_list =[10, 11, 12]

# adding new column
my_dataframe["data4"] = data_list

# printing
<b>print</b>(my_dataframe)

Output:

      data1  data2  data3  data4
row1      1      2      3     10
row2      4      5      6     11
row3      7      8      9     12

 

Delete and add a new row

Adding a new row in pandas dataframe is a little bit tricky. We can concat the older dataframe with the new one or the new row. See the simple syntax of adding new row to the dataframe.

Dataframe_name = pd.concat([new_row, dataframe_name]).reset_idex(drop=True)

Now let us take the same example of my_dataframe and add one more row to the dataframe.

#Creating new row as dataframe
new_row = pd.DataFrame({'data1' :10, 'data2': 20, 'data3':20}, index=[0])

# concatenating new dataframe with old one at position 
my_dataframe = pd.<b>concat</b>([new_row, my_dataframe]).<b>reset_index</b>(drop = True)

<b>print</b>(my_dataframe)

Output:

   data1  data2  data3
0     10     20     20
1      1      2      3
2      4      5      6
3      7      8      9

We use the same drop() to remove a row from the dataframe. See the following example where we removed the last row from pandas dataframe using drop() method.

# deleting last row
my_dataframe.<b>drop</b>('row3', inplace=True)

<b>print</b>(my_dataframe)

Output:

      data1  data2  data3
row1      1      2      3
row2      4      5      6

 

Access and modify data in pandas dataframe

So far we have learned how to access a specific column and row. However, pandas provides us with many powerful accessors which help us to retrieve data from dataframe. Some of which are .loc[ ], iloc[ ] and .at[ ]. In this section, we will cover these accessors and will see how we can use them to get different columns and rows.

 

Getting data with accessor from pandas dataframe

Let us use .loc[ ] and .iloc[ ] to get data from pandas dataframe. See the example below:

# printing data using .loc []
<b>print</b>(my_dataframe.loc[["row2"]])

Output:

      data1  data2  data3
row2      4      5      6

Now let us use loc[ ] to get data from multiple rows. We just need to provide the list containing names of rows.

# printing data using .loc []
<b>print</b>(my_dataframe.loc[["row2", "row1"]])</pre<   Output:

Output:

      data1  data2  data3
row2      4      5      6
row1      1      2      3

The powerful feature of .loc is that we can get specific data by specifying columns and rows at the same time. Let us now specify column and row and get specific data.

# printing data using .loc []
<b>print</b>(my_dataframe.loc[["row1"], ["data1"]])

Output:

      data1
row1      1

To get access to the specific data, all we need to do is to provide two lists, one containing labels of rows and other containing labels of columns as shown in the above example.

Unlike .loc[ ] which takes labels, the .iloc[ ] takes the index number and returns data accordingly. See the example below.

# printing data using .iloc []
<b>print</b>(my_dataframe.iloc[[1]])

Output:

      data1  data2  data3
row2      4      5      6

In a similar way, we can get data from multiple rows at a time by providing a list of indices. See the example below:

# printing data using .iloc []
<b>print</b>(my_dataframe.iloc[[0, 2]])

Output:

      data1  data2  data3
row1      1      2      3
row3      7      8      9

We can also get specific data by specifying column index and row index. See the example below.

# printing data using .iloc []
<b>print</b>(my_dataframe.iloc[[0],[1]])

Output:

      data2
row1      2

There is another very simple way to get specific data from pandas dataframe without using .loc[] or .iloc[].  The .at[] method too provides the specific data. See the example below:

# printing data using .at []
<b>print</b>(my_dataframe.at["row1", "data1"])

Output:

1

Here we get the data from row1 and data1 which is 1 by simply specifying the labeling of rows and columns inside .at[].

 

Modify data with accessors in pandas dataframe

Accessor does not only allow us to get access to data but also helps us to modify data from a pandas dataframe. That is why they are very powerful tools to work with dataframe. Their powerful functionality makes them one of the key elements in dataframe. See the following example which modifies the data using .loc[].

# printing data using .loc
<b>print</b>("before modifying:\n {} ".<b>format</b>(my_dataframe.loc[["row1"]]))

# modify the data
my_dataframe.loc[["row1"]]= 100

# printing modified data
<b>print</b>("After modified: \n{}".<b>format</b>(my_dataframe.loc[["row1"]]))

Output:

before modifying:
       data1  data2  data3
row1      1      2      3 
After modified: 
      data1  data2  data3
row1    100    100    100

Notice that all the data in column has been updated to 100, that is why because we didnt specified the column name. We can update each element by specifying the column and row name at the same time. See the example below:

# printing data using .loc
<b>print</b>("before modifying:\n {} ".<b>format</b>(my_dataframe.loc[["row1"]]))

# modify the data
my_dataframe.loc[["row1"], ["data1", "data2", "data3"]]= [100, 200, 300]

# printing modified data
<b>print</b>("After modified: \n{}".<b>format</b>(my_dataframe.loc[["row1"]]))

Output:

before modifying:
       data1  data2  data3
row1      1      2      3 
After modified: 
      data1  data2  data3
row1    100    200    300

In a similar way we can use .i;oc[] to update data from pandas dataframe. The only difference will be providing index numbers instead of labeling . See the example below:

# printing data using .iloc
<b>print</b>("before modifying:\n {} ".<b>format</b>(my_dataframe.iloc[[1]]))

# modify the data
my_dataframe.iloc[[1]]= 100

# printing modified data
<b>print</b>("After modified: \n{}".<b>format</b>(my_dataframe.iloc[[1]]))

Output:

before modifying:
       data1  data2  data3
row2      4      5      6 
After modified: 
      data1  data2  data3
row2    100    100    100

All data in row2 is updated to 100 because we didn't specify the column indices. Let us now update each value in the column as well.

# printing data using .iloc
<b>print</b>("before modifying:\n {} ".<b>format</b>(my_dataframe.iloc[[1]]))

# modify the data
my_dataframe.iloc[[1], [0, 1, 2]]= [100, 200, 300]

# printing modified data
<b>print</b>("After modified: \n{}".<b>format</b>(my_dataframe.iloc[[1]]))

Output:

before modifying:
       data1  data2  data3
row2      4      5      6 
After modified: 
      data1  data2  data3
row2    100    200    300

 

More operations with pandas dataframe

So far we have covered all the basic and necessary information and operations that are important to start working with pandas dataframe. In this section, we will cover some more operations that we can perform on pandas dataframe. We will cover arithmetic operations and filtering of data in pandas dataframe.

 

Arithmetic operations on pandas dataframe

Applying arithmetic operations on pandas dataframe is very similar to applying on any other data. But the important thing about pandas dataframe is that we can apply arithmetic operations to the whole row or column without specifying each data. For example if we want to add two rows, we dont need to add each  data row manually, pandas will do it for us.

See the examples below, which use different arithmetic operations.

# printing
<b>print</b>(my_dataframe["data1"])
<b>print</b>(my_dataframe["data2"])

# addition
<b>print</b>("after addition\n{}".<b>format</b>(my_dataframe['data1'] + my_dataframe["data2"]))

Output:

row1    1
row2    4
row3    7
Name: data1, dtype: int64
row1    2
row2    5
row3    8
Name: data2, dtype: int64
after addition
row1     3
row2     9
row3    15
dtype: int64

In a similar way we can apply other arithmetic operations as well.

 

Filtering data from dataframe

Another powerful feature of pandas is that it allows us to filter data and get only the required result. Now let us take an example and see how data filtering works in pandas.

# printing
<b>print</b>(my_dataframe[["data1", "data2"]])

# after filtering
<b>print</b>(my_dataframe[["data1", "data2"]]<5)

Output:

      data1  data2
row1      1      2
row2      4      5
row3      7      8
      data1  data2
row1   True   True
row2   True  False
row3  False  False

Filtering method in pandas returns True if the certain requirements meet and False if not. Pandas allow us to use logical operators in filtering as well. See the example below:

# printing
<b>print</b>(my_dataframe[["data1", "data2"]])

# after filtering
<b>print</b>(my_dataframe[(my_dataframe["data1"]<5) & ( my_dataframe["data2"] >1)])

Output:

      data1  data2
row1      1      2
row2      4      5
row3      7      8
      data1  data2  data3
row1      1      2      3
row2      4      5      6

The above example prints out the rows where value in data1 is less than five and value in data2 is greater than 1. In a similar way we can use other logical operators and arithmetic operations to solve complex problems and filter required data.

Summary

Pandas dataframes are powerful data structures that allow us to perform a number of different powerful operations such as sorting, deleting, selecting and inserting. In this tutorial we learn about pandas dataframe and the difference between a dataframe and a series.  Moreover, we also come across different methods through which we could create pandas dataframe from scratch. For example creating a dataframe with dictionaries, lists, files and numpy arrays. At the same time, we also covered some of the important operations like adding columns, rows, selecting columns and rows and deleting columns and rows.

 

Further Reading Section

python pandas
pandas dataframe
pandas documentation

 

Didn't find what you were looking for? Perform a quick search across GoLinuxCloud

If my articles on GoLinuxCloud has helped you, kindly consider buying me a coffee as a token of appreciation.

Buy GoLinuxCloud a Coffee

For any other feedbacks or questions you can either use the comments section or contact me form.

Thank You for your support!!

Leave a Comment