The in-class exercise is based on textbook chapter 7: Data Cleaning and Preparation. You are supposed to carefully read through the chapter before working on the exercise. To use panda’s functions, you need first to import the package using import pandas as pd.
⦁ Filter Out Missing Data. The most straightforward way to deal with missing observations in a data frame is to remove their corresponding rows entirely. It can be done using the pandas dropna function. First, let’s create a random 10 x 4 data frame with random numbers and replace some of the values with NA using the following codes:
import pandas as pd
import numpy as np
from numpy import nan as NA #”nan” is a module in “numpy”
df pd.DataFrame(np.random.randn(10,4))
df.iloc[:3, 0] = NA #Replace the first three values in column 0 with NA.
df.iloc[2:5, 1] = NA #Replace the second to fifth values in column 1 with NA.
df.iloc[:4, 2] = NA #Replace the first four values in column 2 with NA.
df.iloc[2:6, 3] = NA #Replace the second to sixth values in column 3 with NA.
print(df) #Print out the results.
⦁ Remove the rows from df that contain NA and store the new data frame as df2. Print out df2. Note: you are supposed to remove all the rows that have at least one NA in this question. Thus, your df2 should only have the last four rows of the original data frame df.
Hint: the function dropna() removes all rows with NAs. To define a new data frame you will need an equal sign. The new name is on the left of the equal sign, and the values are on the right. For example, df2=df.dropna().
⦁ Only keep the rows from df that have at least 2 actual values and store the new data frame as df3. Print out df3. Note: if you check on the original data, there are only two rows that have less than 2 actual values: rows 2 and 3. Thus, you should only identify and remove these two rows from df.
Hint: To specify the threshold for the number of actual values in a row to keep while removing NAs, you need to indicate the thresh argument in the dropna(thresh=) function. For example, df3 = df.dropna(thresh=2).
⦁ Drop rows that are all NA in df and store the new data frame as df4. Note: if you check on the original data, there is only one row that has all NA: rows 2. Thus, you should only identify and remove the row from df.
Hint: To only drop rows that are all NAs, you need to pass how= ‘all’ argument in the dropna() function. For example, df4 = df.dropna.
⦁ Filling in Missing Data. Dropping missing values could reduce the data size dramatically. Thus, people often choose to fill in the missing values. The fillna() function in pandas can fill in the missing data in many ways.
⦁ Fill in the missing values in df4, created in question 1.3, and store the new data frame as df5. Specifically, fill in 0s for the missing values in column 0, fill in column 1’s missing values with its mean, and fill in column 2’s missing values with its median. Note: column 3’s missing values will be handled by the next question.
Hint: To fill in multiple columns with different values, you can use the fillna() function and pass a dictionary. In the dictionary, the keys represent the column numbers and values are the filled values. For example: df5 = df4.fillna({0:0, 1:df4[1].mean(),2:df4[2].median()}).
⦁ Fill in the missing values in df5, created in question 2.1, and store it as df6. Specifically, fill in column 3’ missing values using the forward fill method.
Hint: the forward fill is a method for filling the missing values by propagating last valid observation forward. It is fulfilled by using the fillna() function and passing method = ‘ffill’ argument. For example, df6 = df5.fillna(method= ‘ffill’).
Last Completed Projects
topic title | academic level | Writer | delivered |
---|