top of page
  • Writer's picturefuatakal

Dropping Rows and Columns with Missing Values: Order Matters

I personally like working with datasets from healthcare and finance domains. I have not run into the missing data problem in the finance domain yet. However, it is a real pain in the neck for the healthcare domain.


For various reasons, many real world datasets from the healthcare domain contain missing values, often encoded as blanks, NaNs or other placeholders (e.g., Not checked). Honestly, this is understandable. Because, such data is collected in a clinic environment, which is very busy and stressful. On the other hand, scikit-learn estimators assume that all values in a dataset ( I mean a numpy array) are numerical, and that all have and hold meaning.


There are several strategies to handle missing data. I am going to cover ways to complete missing data, i.e., imputation in other posts.


For the moment, let us talk about getting rid of entire rows/columns with missing values, which is the simplest strategy for handling missing data. Although I am a minimalist, this strategy is not a good one. It will very likely result in loosing precious data. If you carefully study your data, you might be able to minimize your loss though.


Sorry for the spoiler. Let me show you on an example. The source code for this post is at the very bottom of this post. You can skip to it if you are a source code reader.


Assume you have a Pandas dataframe df whose shape is (100, 10). Further assume that the dataframe is filled with numbers from 0 to 9. That is, no missing values yet.


You can easily check the counts of missing values at each column by using the following code.

df.isnull().sum()

We are ok. All columns (a-j) has zero missing values.

a    0
b    0
c    0
d    0
e    0
f    0
g    0
h    0
i    0
j    0
dtype: int64

Let us introduce some missing values by simply replacing randomly picked cells with np.nans. The frac parameter stands for the missing value rate for rows and columns. Details are in the code.


for i in range(numPass):
    for col in df.sample(frac=missingColumns,  
                                        axis='columns').columns:
        df.loc[df.sample(frac=missingRows).index, col] = np.nan

Let us check if it worked.


df.isnull().sum()

Yes, it did. There are 27 missing value on the column g. Columns a, c, and e have 10 missing values each.


a    10
b     0
c    10
d     0
e    10
f     0
g    27
h     0
i     0
j     0
dtype: int64

I would like to implement two basic strategy for removing rows or columns.


As the first strategy, I will remove all rows or columns with a missing value.


To remove all rows that contain a missing value, you can use the following.

df = df.dropna(axis=0)
Removing all rows containing a missing value
Your dataframe has 53 rows and 10 columns now
You lost 47.0% of your samples.

Now, removing columns with a missing value.

df.dropna(axis=1)
Removing all columns containing a missing value
Your dataframe has 100 rows and 6 columns now
You lost 40.0% of your columns.

So, removing entire row or column with a missing value seems like the not the best strategy. It resulted in a huge data loss. We lost 47% of rows and 40% percent of columns with this strategy. Maybe we should try another strategy.


My second strategy is to remove the column with the highest missing value rate first, then remove rows with a missing value.


Let us apply this strategy to the column g. Remember, it has 27% missing rate. Actually 27 is the count of the missing values. However, I can safely say 27% as the number of rows equals to 100 :-)


df = df.drop(['g'], axis=1).dropna(axis=0)
Removing the columns with the most missing values and then, all rows containing a missing value
Your dataframe has 72 rows and 9 columns now
You lost 28.0% of your rows and 10.0% of your columns.

Good! We already reduced the amount of lost data from 47% and 40% to 28% and 10%.


Let us try remove one more column. I will pick two columns this time with the highest missing value rates. The column g is obvious. There are three candidates for the second choice. I randomly pick the column c.


df = df.drop(['g', 'c'], axis=1).dropna(axis=0)
Removing two columns with high missing values and then, all rows containing a missing value
Your dataframe has 82 rows and 8 columns now
You lost 18.0% of your rows and 20.0% of your columns.

Better, or not? Obviously we reduced the loss on rows from 28% to 18%. However, we lost twice as many columns, i.e. 20% vs 10%.


Gaining rows vs. columns? Tough choice.


A long story short, the second strategy seems to be working better. However, it is still a trade off. Your choice depends on how many rows/columns you can tolerate to lose. Missing value counts on each column and each row must be inspected carefully.



 

Thank you for reading this post. If you have anything to say/object/correct, please drop a comment down below.


The code used in this post is available at GitHub. Feel free to use, distribute, or contribute.












60 views0 comments

Recent Posts

See All

Handling Missing Data: Part 2 - Multivariate Imputation

I continue with the second post of the imputation series. I had covered univariate imputation before. If you missed it somehow, you can find it here. Let us continue with a more sophisticated imputati

Never Trust a Dataset!

If you are following my posts, you might remember that I was writing on imputation recently. To prepare a series of blog posts on imputation, I searched for a convenient data set on the Internet. And,

Handling Missing Data: Part 1 - Univariate Imputation

I have been struggling with imputation recently. Therefore, I decided to create a series on data imputation. I am going to use the Heart Failure Prediction Dataset for demonstrations. Unfortunately, t

bottom of page