|
Indian Journal of Medical Informatics. 2009; 4(1): 1 |
||||||||
ArticleA Semi-Supervised Clustering by λ_Cut for Imputation of Missing Data in Type II Diabetes Databases Ilango Paramasivam1, Hemalatha Thiagarajan2 and Nickolas Savarimuthu3 1PhD Research Scholar, Department of Computer Applications, National Institute of Technology, Trichirappalli, India. Assistant Professor, School of Computing Sciences, VIT University, Vellore, India. ilangosarojini@gmail.com 2Professor, Department of Mathematics, National Institute of Technology, Trichirappalli. hema@nitt.edu 3Assistant Professor, Department of Computer Applications, National Institute of Technology, Trichirappalli. nickolas@nitt.edu |
||||||||
|
Abstract: Data mining is used extensively in healthcare to mine patient data to construct a predictive model that is sound, makes reliable predictions and helps physicians to improve their prognosis, diagnosis or treatment planning procedures. The pathological data in medicine often produces missing data due to various reasons. Accurate and robust estimation methods of missing data are needed since the performance of the data mining algorithms heavily depend on the quality of the dataset. In this paper, an imputation method, Semi-Supervised Clustering by λ-Cut (λ-CUT_CLUST) is proposed. In the proposed method, similar records are clustered using weight as similarity measure. A high degree of intra cluster similarity is achieved by selecting those records with weights in the threshold range 0.6 ≤λ≤1. The missing data is imputed by obtaining the mean value of the respective attributes of the cluster. The method is experimented on Pima Indian Type II Diabetes dataset and the performance is compared with other imputation methods. The comparative analysis demonstrates that the method is able to impute the missing data with less imputation error and produce stable results over different percentages of missing data. Keywords:Missing data, λ-Cut, Cluster, Imputation Error, Type II Diabetes Dataset. |
|||
I IntroductionRecent advancements in the healthcare industry have both necessitated and made feasible the adoption of tools for data analysis and knowledge discovery. Quality of data in medical database can enhance the quality of medical diagnosis to a major extent. The quality factors of data in these databases include accuracy, completeness, suitability for usage and clarity. Quality of data can impact the stages of retrieval, analysis and usage. In order to attain data quality, various tools are appropriately deployed, which would facilitate access and analysis, so that meaningful patterns emerge leading to knowledge discovery. Medical databases help in diagnosis, treatment and follow-up and thereby support medical practitioners to make precise diagnosis, to provide healthcare, to monitor progresses and to advance clinical research [1]. Data mining in medical databases facilitate the physician for quality diagnosis, especially in the presence of missing data, numerous features and parameters [2]. Knowledge discovered through data mining, aids interpretation of data about patients, so that healthcare professionals are enabled to arrive at informed decisions regarding their clients' health. In practice, however, there are possibilities of human and systemic errors, owing to multiplicity of data sources and enormity of data size. These errors could render the required data to become deficient, `noisy' and contradictory. Errors happen to accumulate and result in a serious distortion of facts. It is thus seen that deriving knowledge through data mining is adversely impacted. Knowledge derived from inadequately processed raw data can not only be ineffective but also counterproductive. Hence, cleaning of data with an objective to enhance its reliability is a precursor to data mining. Data defectiveness remains unavoidable to an extent, particularly in large-scale applications [3]. It is indicated that the bulk of the time and efforts are being devoted to the task of data preparation [4-6]. Similarly, a vast majority of over 80%, of data mining initiatives tend to focus on getting the data ready for usage [7-9]. Procedures to refine data, strive to nullify the disruptive aspects of the data and to spot and eliminate discrepancies. These procedures also aim at aptly substituting missing values. Therefore, data cleaning, which seeks to improve the quality of the data and make the data more reliable for mining purposes, is an important preliminary step in data mining. Data cleaning algorithms attempt to smooth noise in the data, identify and eliminate inconsistency, and remove missing values or replace them with values imputed from the rest of the data. Malfunctioning of equipments, non-realization of the significance of certain data, failures to enter data, removal of vital data, are some of the causes cited to be the reasons for the existence of missing data in the medical databases. Due to the presence of missing values, there are possibilities of bias and consequentially, the quality of the data mining process may be deteriorated. The method of imputation is popular as a strategy. Its popularity is attributed to the fact that it predicts missing values by using as much information as feasible [10, 11]. Missing data imputation is one of the challenges in data mining and such hurdles tend to bias the user and lower the quality of the data mining process [10,12]. Some of the existing algorithms are set to assume the absence of missing values and therefore are ill suited to certain domains of applications. Though there might be various methods to handle data with missing values, no single one could be judged as superior to others. An eluding issue in this context is regarding the estimation of missing values, whose erroneous treatment would lead to distorted results for methods such as hierarchical clustering and K-means clustering respectively. Estimation of missing data could warrant voluminous, accurate data in order to make reliable imputations. Diagnostic quality depends on the worthiness of the imputed data. Hence, accurate and more consistent estimation of missing data is warranted in medical data mining. In this paper, we propose a method, λ-CUT_CLUST, to impute missing data, by grouping the similar set of complete instances formulated as cluster for every incomplete patient record. The complete dataset, Pima Indian Type II Diabetes dataset is taken up for evaluating the method. The imputed values are compared with the observed values and the average imputation error is estimated. The performance of the proposed method is compared with other imputation methods namely 10-NN, NORM, EMImpute Columns, LSImpute_Rows, Mean Imputation. The background of the research work in imputation of missing data is given in section II. Section III describes the proposed problem, methodology and the dataset used for evaluating the proposed method. Section IV presents the results of the proposed method and compare with the other imputation methods in terms of their imputation errors. Finally, Section V concludes with research findings. II BackgroundMedical databases usually show a considerable amount of missing data, which may be missed due to procedural errors, refusal of response or non-applicability of responses, lack of test results and equipment malfunctioning, among other things. Most healthcare datasets contain a lot of missing values, which occur mostly in conditional attributes. Many approaches to deal with missing values have been described in the literature [11], including those of ignoring objects containing missing values, filling the missing value manually; substituting the missing values by the mean or mode of the respective conditional attribute; a global constant or getting the most probable value to fill in the missing values. Each of these approaches has its own shortcomings. While the first specified approach usually loses a great deal of useful information, the second one becomes too time-consuming and logically inconsistent, which makes the results irrelevant and not feasible with regard to most applications. The third approach is a poor choice as it distorts other statistical properties of the data and does not consider the dependencies between attributes. The fourth one assumes that all missing values are with the same value, perhaps leading to considerable distortions in data distribution. Another approach is to create a new data which is akin to missing values to represent missing data in the dataset. However, this has the unfortunate negative impact that data mining algorithms may try to use "missing" as a legal value, which is likely to be inappropriate and it is found that it sometimes has the effect of artificially inflating the accuracy of some data mining algorithms on some datasets [13]. Conventional methods of imputing missing value include two major types - namely parametric and non-parametric imputations. The former is considered to be better if a dataset can be adequately modeled parametrically, or if users can correctly specify the parametric forms for the dataset. For instance, the linear regression methods usually can treat well the continuous target attribute, which is a linear combination of the conditional attributes. However, when the exact relation between the conditional attributes and the target attribute is unknown, the performance of the linear regression for imputing missing values is very poor. The imputation of missing data by regression uses the predicted values derived from a regression equation based on variables in the dataset that contain no missing data[14]. Regression imputation may not be performing well for all datasets as it assumes the existence of a specific relationship between the attributes. In real-time applications, if the model is mis-specified and the distribution of the real dataset is unknown, the estimations of parametric method may be highly biased and the optimal control factor settings may be miscalculated. Techniques of imputation such as K-Nearest Neighborhood (KNN), LSImpute_Rows, and EMImpute_Columns have been used in many applications [15,16]. LSImpute_Rows and EMImpute_Columns are feature based extraction methods. While imputing the missing data using K Nearest Neighborhood (KNN) method, all pairs of pairs of points in a dataset is assigned with the distance. The Euclidean distance between two points delineates the distance between two points. The distance between all pairs of points (X,Y) constructs the distance matrix. Every data point within the dataset holds a class from the set, C= {C1... Cn}. The K numbers of nearest data points are identified using the distance matrix. The most common class is identified out of the nearest data points and is then assigned to the data point which is being analyzed. However, it results in recursive process while two or more classes occur equally for a specific data point within the dataset. The least squares principle is used to estimate missing data in LSImpute method. It uses the correlations between the reference record and other followed by minimizing the sum of squared errors of a regression model. Hot-deck imputation to fill missing data in incomplete sample surveys [16]. It uses values from other rows of the database that are similar to the row with the missing data and depends highly on measuring the similarity between the rows. Different authors have recommended different methods such as machine learning including auto associative neural network [17], decision tree imputation [18], case-wise deletion [19], `lazy decision tree' [13] and `dynamic path generation' [20]. Further, a few of the methods in machine learning like C4.5 handle only the discrete value. In these methods, continuous attributes are broken down into different intervals and the imputation result of those methods may nullify the original distribution of dataset. However, as these methods tend to obliterate the original data set distribution while imputing, they are not considered as perfect methods in handling the issue of missing values. A usual approach to tackle missing data is the creation of data mining algorithms that "internally" handle missing data and still produce good results. For example, the CART decision-tree learning algorithm internally handles missing data essentially using an implicit form of imputation based on regression [21]. Introduction of bias into the data and limitations in applicability are cited to be the faults of some simple pre-processing methods of handling missing data [22]. Cluster analysis is used to identify homogeneous and well-separated groups of objects [23]. Clustering is described as dividing of a dataset into subsets known as clusters, in such a manner that data belonging to each sub set would have commonalities [24, 25]. Under this method, a dataset is subdivided and most similar elements are grouped into the same cluster while most dissimilar groups are placed in different clusters. Similarity measures including those that have distance as the basis such as Euclidean and Manhattan distances are used in many clustering algorithms. Generally, clustering methods fall under the categories namely, hierarchical, mixture-model, learning-network, objective-function-based and partition clustering. The proposed method exploits the concept of grouping similar records into cluster. III. MethodologyAn imputation method, λ-CUT_CLUST measures the similarity of the complete records with the incomplete record by assigning weights. The method identifies the similar records and forms the cluster by considering the incomplete record as a centre point. The λ-Cut, indicates the optimal threshold value applied to group the similar records while forming the cluster. Finally, missing values in the incomplete records will be obtained by the computing the mean value of the respective attribute(s) of the cluster. The experiments are performed on PIMA Indian Type II Diabetes Dataset [26], which originally does not have any missing data. The dataset contains 768 complete cases described by 8 attributes and a predictive class. The attributes are labeled as number of times pregnant, glucose tolerance test, diastolic blood pressure, triceps skin fold thickness, 2-hour serum insulin, body mass index, diabetes pedigree function and age. The predictive class value `1' is interpreted as "tested positive for diabetes" and class value `0' is interpreted as "tested negative for diabetes". Class value `0' was found in 500 instances and 1 in 268 instances. The dataset has only numerical attributes and each transaction is viewed as a multi dimensional vector. Different percentages of missing data (from 5% to 80%) were generated using random labeling with simulation. The method is evaluated by estimating the average imputation error, the difference between the imputed value and the original value. The performance is compared with other existing imputation methods namely 10-NN, NORM, EMImpute Columns, LSImpute_Rows, Mean Imputation. The λ-CUT_CLUST method comprises of four phases. 1. In the first phase, testing dataset is formulated by randomly missing some of the attributes. 2. For cluster generation, weight is assigned to every record in the training dataset. Weight specifies the number of similar attributes between a complete record in the training dataset and an incomplete record in the testing dataset. The greater the value of weight, the greater is the similarity between those records. 3. The weights are normalized using max-difference normalization so as to generalize the method for any dataset. To do more precise imputation, it is necessary that the records in the cluster are highly similar. So, a threshold value λ is chosen and the records whose weights above the threshold value λ are used for cluster generation. Clusters are generated on the training dataset with records of testing dataset as centre points. Experiments are performed for different λ values, and the optimal threshold λ value is chosen. 4. Finally, the missing-attribute value is imputed with the mean value of the corresponding attribute(s) of the respective clusters. The results are validated using the average imputation error. The pseudo code of the proposed algorithm is given below. Generation of testing datasetIn this phase, testing dataset is generated by grouping similar incomplete records into various blocks in such a way that a block consists of records with same set of missing attributes. A maximum of (2n - 2) blocks (where n is the number of attributes) will be created. The proposed method does the imputation process on these incomplete records. These records in every block is considered as centre points (reference points) to generate clusters on the training dataset. Weight generationCluster generation is an important phase. The performance of the proposed method solely depends on the intra cluster similarity among the records within the cluster. The similarity between a complete record of the training dataset and an incomplete record of the testing dataset is measured in terms of weight, the number of similar attributes. An attribute in the complete record of the training dataset is considered to be similar to the corresponding attribute in the incomplete record of the testing dataset if and only if the attribute value lies in the range (attribute value ± 0.5σ) where σ is the standard deviation of the respective attribute. The statistical reason for this consideration is given below. The statistical parameter which describes the spread of the values in a dataset is standard deviation (σ) [27]. It measures the spread of the data about the mean value. If a data distribution is normal then about 68.2% of the values are within one standard deviation (1σ) of the mean, about 95% of the values are within two standard deviations (2σ), and about 99.75% lie within three standard deviations (3σ). To group similar records more precisely, the attribute values that lie in the range (attribute value±0.5σ) are considered to be similar. The equation given below shows the formula for assigning weights to every record in the training dataset. Where n is the number non-missing attributes of the records in the testing dataset, α is the attribute value of the incomplete record, β is the attribute value of the complete record in the training dataset, σ is the standard deviation of the respective attribute. λ-Cut and Cluster generation The generated weights are normalized using max-difference normalization. The advantage of using normalization is that the method can be applied to any dataset with any number of attributes. The assigned weights are normalized by dividing the weights by (n - miss (a)). Where `Weight of Train_Dataset' is the weight generated on the complete records of training dataset, `n' is the number of attributes in the dataset and `missing(a)' refers to the set of missing attributes. The weight 0.5 is an uncertainty and the records with the weight < 0.5 indicate that those records are less similar to the incomplete record. When the similarity among the data points is not crisp, it can be handled well by fuzzy clustering [28]. Fuzzy C-Means clustering is proposed by Dunn (1973) and modified by Bezdek (1981). The flexibility and robustness [29] of Fuzzy C-Means clustering made it popular and it is also able to produce reliable and steady clusters. The data mining research desires to utilize these features [30]. In Fuzzy C-Means clustering, the data points are defined with the membership values with respect to the cluster centre and will be updated iteratively. In this paper, the incomplete record is assumed as the cluster centre and its similarity with the complete record is measured using weight. When the weight is in uncertainty it can be resolved using Fuzzy C-Means clustering as extension of this work and the authors are working on these issues. The intra cluster similarity among the records of the cluster is important for effective imputation. So, the proposed method selects those complete records of the training dataset with weight ≥0.6 for cluster generation and for the imputation process. The clusters are generated for each threshold value of λ,0.6 ≤λ≤1. Imputation of missing dataClusters are generated with incomplete record as cluster center. The values of the missing attributes are imputed by computing the mean of the respective attributes in the cluster. The accuracy of the imputation process is evaluated using average Imputation Error. The clusters are generated for various threshold values of λ and the results are compared. IV. Results and DiscussionThe proposed method, λ-CUT_CLUST is experimented using MATLAB on the Pima Indian Type II diabetes dataset [26]. The experiments are conducted in two phases. In the first phase, clusters are generated for various threshold values of λ, namely 0.6, 0.7, 0.8, 0.9 and 1.0 and for different sizes of missing data such as 5%, 10%, 15%, 20%, 25%, 30% and 35%. The imputation is done and the average imputation error is estimated. The accuracy of the imputation process is then analyzed by comparing the average imputation error for the various threshold values of λ. The λ value for which minimum imputation error is obtained is considered as optimal threshold value. In the second phase of experiments, the imputation process is done for all the attributes with the optimal threshold value λ, for different percentages of missing data namely 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75% and 80%. The proposed method λ-CUT_CLUST is evaluated with 20 random simulations to analyze the performance of imputation process. Average Imputation ErrorThe accuracy and the consistency of the estimated imputed values are validated using the Average Imputation Error. The average imputation error (E) [31] is computed as given below. Where n is the number of imputed values, m is the number of random simulations for each missing value, Oij is the original value of the attribute j, Iij is the imputed value, Maxj is the maximum value of the of the attribute j, Minj is the is the minimum value of the attribute j, j is the corresponding attribute to which Oi and Ii belong. Performance AnalysisTable I shows the experimental results of the imputation process, taken for λ = 0.6, 0.7, 0.8, 0.9 and 1.0 with different sizes of missing data namely 5%, 10%,15%, 20%,25%,30% and 35%. It is observed from Table I that the imputation error is minimum for λ= 0.6 compared to other values of λ. The imputation error is increased with increase in the λ value. It is due to the fact that less the value of λ, more the number of records that is available for the imputation process of missing data. Figure-1 demonstrates the comparative performance of the method for different threshold values of λ. It can be seen from the Figure-1 that the average imputation error is minimal for λ = 0.6 compared to other values of λ. The value of λ is considered as optimal threshold value and the experiment is repeated for imputing different sizes of missing data from 5% to 80%. Table - II shows the computed average imputation error for all the eight attributes for different percentage of missing data with the threshold value λ= 0.6. The average imputation error varies significantly from attribute to attribute due to the nature of the distribution of the attributes of the dataset. The overall average imputation error value varies from a minimum of 0.107036 to a maximum of 0.110099 with the range of 0.00306 and the variation is also minimal for different percentages of missing data. Figure 2 shows the average imputation error estimated for different sizes of the missing data. It is observed from Figure 2 that except for minimum variation; the error values remain stable for different sizes of missing data. The stability in the performance of the proposed method in terms of the average imputation error is due to the competency in formulating the clusters with intra cluster similarity. The method shows stability even for the maximum size of missing data. It implies that the imputation of missing data stands consistent with the observed values irrespective of the size of the missing data. The Clusters which are generated using Fuzzy C-Means clustering are resulted with high intra cluster similarity [32]. If the complete records with the weight 0.5 are considered by resolving its fuzzy uncertainty using Fuzzy C-Means Clustering there may be some more complete homogeneous records included into the cluster. Comparative analysisThe performance of the proposed method λ-CUT_CLUST is compared with other imputation methods namely 10-NN, NORM, EMImpute Columns, LSImpute_Rows, Mean Imputation. The results are shown in table III. The performance is compared for 5% to 35% of missing data. The imputation process in all the methods uses the complete record(s) of the dataset. Normally, the accuracy of the imputed values depends on the size of missing data and the availability of complete instances for the imputation process. Hence, the missing data is simulated up to one-third of the dataset, 35% is taken up for the comparative performance analysis.
It is observed from the Table-III that NORM method gives highest imputation error rate and provides least accurate estimation. Conversely, EMImpute_columns shows lowest imputation error rate with stability in the imputation process for different sizes of missing data. Mean Imputation stands next to NORM but exhibits stability for different sizes of missing data. The LSImpute method performs next to Mean Imputation but results are stable. The error increases with increase in the size of missing data, loses its stability. The proposed λ-CUT_CLUST method shows better performance than the NORM, Mean Imputation and LSImpute and the average imputation error ranges from 10.8 to 11.0. The method shows stability even with increased size of missing data. The 10-NN shows lesser error rate while compared to λ-CUT_CLUST method proposed by the authors but, it shows increase in its average imputation error when the size of missing data gets increased. EMImpute method shows highest stability and consistency in terms of the average imputation error while compared to other imputation methods rather it also shows highest value of standard deviation. The higher value of standard deviation, the higher is the area of distribution of imputed values. It means that the imputed values are widely dispersed when the standard deviation is higher. It is observed from Table III that the λ-CUT_CLUST method has higher stability, consistency and with less variance compared to the other said methods and may be substantiated with the underlying statistical theory "Specifying the standard deviation is more or less useless without the additional specification of the mean value" by Dogra, Shaillay K. Figure-3 illustrates the comparative performance of different Imputation methods in terms of the standard deviation. It is observed from Table III that the lowest standard deviation is resulted in λ-CUT_CLUST when compared to other methods; it is due to the fact that the imputation process is performed using the similar set of complete records formulated as clusters. The lower value of standard deviation indicates that the imputed values are less dispersed from its mean value. As the imputation process completely depends on the clusters which are generated using the incomplete record as the centre point, the less dispersed imputed value around the centre point (mean value) of the cluster is a valid representation or replacement for the missing attribute(s) in the incomplete records of the testing dataset. V ConclusionIn medical data mining, the datasets are extremely sensitive, voluminous and report non-reproducible situations. The missing data occurs in the clinical database due to various reasons such as un-affordability of costly tests due to the involvement of the personnel and use of expensive instrumentation, due to the potential discomfort of the patients involved, equipment malfunctioning, etc. In most of the situations, simple techniques for handling missing data (such as complete case analysis, overall mean imputation, and the missing-indicator method) produce biased results. In medical data mining where accuracy is highly essential, efficient methods to prevent inaccuracies or to handle incomplete data are required. In this paper, a semi-supervised clustering by the threshold is proposed for the imputation of missing data. Application of the proposed method, λ-CUT_CLUST on the Pima Indian Type II Diabetes Dataset was demonstrated as a case study. The influence of λ-Cut on clustering the similar records for the imputation of missing data is discussed. The proposed λ-CUT_CLUST method exploits the concept of semi-supervised clustering with the threshold value λ in grouping the similar records into clusters using weights as similarity measure and imputes the missing values stirring in the process of diagnosis of the disease. Normalization of weights ensures the applicability of the method to any dataset irrespective of the number of attributes. The method exhibits relatively better performance than other methods as it produces lower imputation error rate with stability and less variance. It is concluded that the proposed method λ-CUT_CLUST imputes the missing data with higher accuracy with consistency and there was no significant reduction in accuracy as the size of missing data increases. References1. Travers D, Mandelkehr L. The Emerging Field of Informatics. N C Med J. 2008; 69 : 127-1313. Laurance J. Breast cancer cases rise 80% since Seventies. The Independent. [http://www.independent.co.uk/life-style/health-and-wellbeing/health-news/breast-cancer-cases -rise-80-since-seventies-417990.html] Accessed October 2006 7. Cios K, Kurgan L. Trends in data mining and knowledge discovery. In: Pal N, Jain L, Teoderesku N (eds) Knowledge discovery in advanced information systems. New York: Springer-Verlag, 2002. 8. Zhang C, Yang Q, Liu B. Intelligent data preparation. IEEE Trans Knowl Data Eng 2005; 17 : 1163_1165 9. Zhang C, Zhang S, Webb G. Identifying approximate itemsets of interest in large databases. Appl Intell 2003; 18:91_104. 10. Zhang, S.C., et al. "Missing is useful": Missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering. 2005; 17 : 1689-1693. 11. Han, J., and Kamber, M. Data Mining: Concepts and Techniques. 2nd ed. Morgan Kaufmann, 2006. 12. Qin, Y.S. et al. Semi-parametric Optimization for Missing Data Imputation. Applied Intelligence. 2007; 27 : 79-88. 13. Friedman JH, Kohavi R, Yun Y. Lazy decision trees. In: Proceedings of 13th AAAI and 8th IAAI, 1996, pp. 717_724. 14. Beaumont JF. On regression imputation in the presence of nonignorable nonresponse. In: Proceedings of the Survey Research 570 Methods Section, ASA, 2000, pp. 580_585. 15. Abdala OT, Saeed M. Estimation of Missing Values in Clinical Laboratory Measurements of ICU Patients Using a Weighted K-Nearest Neighbors Algorithm. Computers in Cardiology 2004 ; 31 : 693-696 16. Ford BL. Incomplete data in sample surveys: An Overview of Hot-deck Procedures. Academic Press, 1983. 17. Pyle D. Data preparation for data mining. San Fransisco: Morgan Kaufmann, 1999.18. Quinlan JR. C4.5: Programs for machine learning. San Mateo: Morgan Kaufmann, 1993 19. Liu WZ, White AP, Thompson SG, BramerMA. Techniques for dealing with missing values in classification. In: IDAL97, vol 1280 of Lecture notes, pp 527_536. 20. White AP. Probabilistic induction by dynamic path generation in virtual trees. In: BramerMA(ed) Research and development in expert systems III. Cambridge, Cambridge University Press, 1987, pp 35_46. 21. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees, Chapman and Hall, 1984. 22. LittleRJA, Rubin DB. Statistical Analysis with Missing Data, John Wiley & Sons, Inc., USA, 1986. 23. Grossberg S. Adaptive Pattern Classification and Universal Recoding, I: Parallel Development and Coding of Neural Feature Detectors. Biological Cybernetics. 1976; 23 : 121-134 24. Bradley PPS, Fayyad U, Reina C. Scaling clustering algorithms to large Databases. Proceedings of the International Conference on Knowledge Discovery in Databases, 1998. 25. Moth'd Belal, Al-Daoud. A New Algorithm for Cluster Initialization. Transactions on Engineering, Computing and Technology 2005; 4 : 74-77 28. Klir GJ, Yuan B. Fuzzy sets and fuzzy logic: theory and applications. Pattern Recognition. New Jersey: Prentice-Hall, 1995 29. Bezdek, JC. Cluster validity with fuzzy sets, J Cybernet. 1974; 3 : 58_71 30. Murase K, Nagayoshi M, Uenishi Y. Extraction of arterial input function for measurement of brain perfusion index with 99mTc compound using fuzzy clustering. Nucl Med Commun. 2004; 25 : 299_303. 31. Giardina M, Huo Y, Azuaje F, McCullagh P. Harper R. A Missing Data Estimation Analysis in Type II Diabetes Databases. Proceedings of the 18th IEEE Symposium on Computer-Based Medical Systems (CBMS'05) 1063-7125/05 2005 IEEE.
|
|
Paper received on 03/02/2009; accepted on 04/05/2009 Correspondence: Ilango Paramasivam This Open Access article is available at: http://ijmi.org/index.php/ijmi/article/view/y09i1a1 © 2009 Author(s); licensee Indian Journal of Medical Informatics under Creative Commons Attribution-No Derivative Works 3.0 License . |