top of page
  • Writer's picturefuatakal

Never Trust a Dataset!


If you are following my posts, you might remember that I was writing on imputation recently.


To prepare a series of blog posts on imputation, I searched for a convenient data set on the Internet. And, I found the Heart Failure Prediction Dataset on Kaggle.


It was convenient for me for a few reasons. First, it was from the healthcare domain, which I am interested in. Second, it was complete. Finally, it had both numeric and categorical variables that I could use in my demonstrations.


It seemed perfect. Except that I had to create random missing values in the dataset to demonstrate how imputation works :-)


Anyhow. To create missing values on the dataset, I randomly picked a column. It was the Cholesterol column.


I must say it was not totally random. Honestly, it was the variable that seemed familiar the most to me since my in-laws have cholesterol problems. So, I picked it and created missing values on the Cholesterol column with a missing rate of 5%.


Then, by using different strategies I imputed the Cholesterol column and tried to evaluate their efficiencies.


I spent the evening to write it down. I submitted my post and went to bed happy :-)


The next day, I started to prepare my second post on the series. It was about multivariate imputation. I eagerly wrote some code and got some result .


My main objective was to compare results of univariate (most_frequent, median, and mean strategies) and multivariate (iterative) imputation approaches. The table below shows the imputed values, plus the original value. Remember, I had removed some values in the beginning to create missing values in the dataset and saved the original values somewhere else.


Let us read the first line from the table: The original value was 339. It was converted to an np.NAN so that it could be imputed. Univariate imputation produces 0.0, 221.5, and 198.25 by using the most_frequent, median, and mean strategies, respectively. Iterative imputation, on the other hand, produces 203.86.


iterative 	most_frequent    median        mean       original
203.86 	          0.0 	     221.5 	     198.25 	 	339
265.22 	          0.0 	     221.5 	     198.25 	 	237
221.55 	          0.0 	     221.5 	     198.25 	 	211
273.41 	          0.0 	     221.5 	     198.25 	 	273
233.47 	          0.0 	     221.5 	     198.25 	 	260

I am not interested in which imputation choice works better in this post. I am trying to tell what I did wrong while choosing the column to impute.


Remember, I told you why I picked the Cholesterol column for imputation. I made a major mistake though. And I only noticed that when I printed the table above.


The most frequent value in the Cholesterol column is zero, which I think is practically impossible. Obviously, the column had already missing data, encoded as zeros. Lots of them.


df['Cholesterol'].value_counts().head()
0.0      165
254.0     11
220.0     10
204.0      9
211.0      9
Name: Cholesterol, dtype: int64

There are 165 zeros on the column, divided by the total number of rows, i.e., 918, equals 18%. This is already higher than my assumption, which was 5%. In other words, in the worst case, I was imputing a column that has a 23% missing value rate. Besides, since I was imputing np.NANs, not zeros, my iterative imputer did not work to the best of its abilities.


I could have avoided this nasty situation, by simply looking at the number of distinct values at each column or the description of the dataset as below. The standard deviation for the Cholesterol column seems high, which could raise a red flag.


df['Cholesterol'].describe()
count    872.000000
mean     198.231651
std      109.694656
min        0.000000
25%      171.750000
50%      222.500000
75%      267.000000
max      603.000000
Name: Cholesterol, dtype: float64

A long story short, I simply trusted the dataset and overlooked details that I could have spotted easily. Now, I have to review my previous post and correct the related codes.




 

Thank you for reading this post. If you have anything to say/object/correct, please drop a comment down below.









87 views0 comments

Recent Posts

See All

Handling Missing Data: Part 2 - Multivariate Imputation

I continue with the second post of the imputation series. I had covered univariate imputation before. If you missed it somehow, you can find it here. Let us continue with a more sophisticated imputati

Handling Missing Data: Part 1 - Univariate Imputation

I have been struggling with imputation recently. Therefore, I decided to create a series on data imputation. I am going to use the Heart Failure Prediction Dataset for demonstrations. Unfortunately, t

bottom of page