Handling Row Duplication
Most of the time, the datasets you will receive or have access to will not have been 100% cleaned. They usually have some issues that need to be fixed. One of these issues could be duplicated rows. Row duplication means that several observations contain the exact same information in the dataset. With the
pandas package, it is extremely easy to find these cases.
Let's use the example that we saw in Chapter 10, Analyzing a Dataset.
Start by importing the dataset into a DataFrame:
import pandas as pd file_url = 'https://github.com/PacktWorkshops/The-Data-Science-Workshop/blob/master/Chapter10/dataset/Online%20Retail.xlsx?raw=true' df = pd.read_excel(file_url)
duplicated() method from
pandas checks whether any of the rows are duplicates and returns a boolean value for each row,
True if the row is a duplicate and
False if not:
You should get the following output: