Indian Journal of Medical Informatics, Vol 4, No 1 (2009)

A Semi-Supervised Clustering by λ–Cut for Imputation of Missing Data in Type II Diabetes Databases

Ilango Paramasivam, Hemalatha Thiagarajan, Nickolas Savarimuthu

Abstract


Data mining is used extensively in healthcare to mine patient data to construct a predictive model that is sound, makes reliable predictions and helps physicians to improve their prognosis, diagnosis or treatment planning procedures. The pathological data in medicine often produces missing data due to various reasons. Accurate and robust estimation methods of missing data are needed since the performance of the data mining algorithms heavily depend on the quality of the dataset. In this paper, an imputation method, Semi-Supervised Clustering by λ-Cut (λ-CUT_CLUST) is proposed. In the proposed method, similar records are clustered using weight as similarity measure. A high degree of intra cluster similarity is achieved by selecting those records with weights in the threshold range 0.6≤λ≤1. The missing data is imputed by obtaining the mean value of the respective attributes of the cluster. The method is experimented on Pima Indian Type II Diabetes dataset and the performance is compared with other imputation methods. The comparative analysis demonstrates that the method is able to impute the missing data with less imputation error and produce stable results over different percentages of missing data.

Full Text: HTML