{"id":"13","title":"Test content 1","deck_type":"analysis","status":"ready","slide_count":5,"style":{"preset_id":"unknown","preset_label":"unknown","quality_mode":"unknown","layout_family":"unknown","surface_style":"unknown","reveal_theme":"unknown","visual_motifs":[],"data_sources_used":[],"mcp_components_used":[],"pdf_render_mode":"same-as-reveal"},"mcp_in_use":false,"mcp_source":"shadcn MCP stdio transport (npx shadcn@latest mcp)","registry_source":"live shadcn registry via MCP + local static fallback","static_registry_count":9,"slides":[{"section":"Overview","id":"sl-1267680-w54dk","mode":"approved","raw_type":"title","component_id":null,"component_props":null,"has_approved":true,"has_discovery":false,"raw":{"id":"sl-1267680-w54dk","mode":"approved","approved":{"id":"sl-1267680-w54dk","meta":"Generated 15 April 2026","type":"title","headline":"Test content 1","subheadline":"EngineHouse Analysis","speaker_notes":"Query: Test Content 1"},"speaker_notes":"Query: Test Content 1"}},{"section":"Overview","id":"sl-1267680-7rvs8","mode":"approved","raw_type":"statement","component_id":null,"component_props":null,"has_approved":true,"has_discovery":false,"raw":{"id":"sl-1267680-7rvs8","mode":"approved","approved":{"id":"sl-1267680-7rvs8","type":"statement","footnote":"Synthesised by EngineHouse Interface","headline":"Key Findings","statement":"PIK  Report\nNo. 126\nFOR\nPOTSDAM INSTITUTE\nCLIMATE IMPACT RESEARCH (PIK)\nCLUSTER ANALYSIS TO UNDERSTAND\nSOCIO-ECOLOGICAL SYSTEMS:\nA GUIDELINE\nPeter Janssen, Carsten Walther, Matthias Lüdeke\n\nHerausgeber:\nProf. Dr. F.-W. Gerstengarbe\nTechnische Ausführung:\nU. Werner\nPOTSDAM-INSTITUT\nFÜR KLIMAFOLGENFORSCHUNG\nTelegrafenberg\nPostfach 60 12 03, 14412 Potsdam\nGERMANY\nTel.:\n+49 (331) 288-2500\nFax:\n+49 (331) 288-2600\nE-mail-Adresse:pik@pik-potsdam.de\nAuthors:\nDipl.-Phys. Carsten Walther\nDr. Matthias Lüde…","speaker_notes":"PIK  Report\nNo. 126\nFOR\nPOTSDAM INSTITUTE\nCLIMATE IMPACT RESEARCH (PIK)\nCLUSTER ANALYSIS TO UNDERSTAND\nSOCIO-ECOLOGICAL SYSTEMS:\nA GUIDELINE\nPeter Janssen, Carsten Walther, Matthias Lüdeke\n\nHerausgeber:\nProf. Dr. F.-W. Gerstengarbe\nTechnische Ausführung:\nU. Werner\nPOTSDAM-INSTITUT\nFÜR KLIMAFOLGENFORSCHUNG\nTelegrafenberg\nPostfach 60 12 03, 14412 Potsdam\nGERMANY\nTel.:\n+49 (331) 288-2500\nFax:\n+49 (331) 288-2600\nE-mail-Adresse:pik@pik-potsdam.de\nAuthors:\nDipl.-Phys. Carsten Walther\nDr. Matthias Lüdeke\nPotsdam Institute for Climate Impact Research\nP.O. Box 60 12 03, D-14412 Potsdam, Germany\nDr. Peter Janssen *\nNetherlands Environment Assessment Agency (PBL)\nP.O. Box 1, 3720 BA Bilthoven, The Netherlands\nE-Mail: peter.janssen@pbl.nl\n* corresponding author\nPOTSDAM, SEPTEMBER 2012\nISSN 1436-0179\nThis report is the result of a joint study between the PBL Netherlands Environmental\nAssessment Agency and PIK.\n\n3 \n \nAbstract \n \nIn coupled human-environment systems where well established and proven general \ntheories are often lacking cluster analysis provides the possibility to discover \nregularities – a first step in empirically based theory building. The aim of this report is \nto share the experiences and knowledge on cluster analysis we gained in several \napplications in this realm helping to avoid typical problems and pitfalls. In our \ndescription of issues and methods we will highlight well-known main-stream methods \nas well as promising new developments, referring to pertinent literature for further \ninformation, thus offering also some potential new insights for the more experienced.  \nThe following aspects are discussed in detail: data-selection and pre-treatment, \nselection of a distance measure in the data space, selection of clustering method,  \nperforming clustering (parameterizing the algorithm(s), determining the number of \nclusters etc.) and the interpretation and evaluation of results. We link our description – \nas far as tools for performing the analysis are concerned - to the R software \nenvironment and its associated cluster analysis packages. We have used this public \ndomain software, together with own tailor-made extensions, documented in the \nappendix.\n\n4 \n \n  \n \n \nContents \n \n \n1. Introduction \n \n \n \n \n \n \n \n  5 \n2. Data selection and pre-treatment  \n \n \n \n \n  8 \n3. Selection of a distance measure in the data space \n \n \n20 \n4. Selection of clustering method \n \n \n \n \n \n24 \n5. How to measure the validity of a cluster? \n \n \n \n39 \n6. Graphical representation of the results \n \n \n \n \n49 \n7. References \n \n \n \n \n \n \n \n55 \n \nAppendix A: The R software environment \n \n \n \n \n72 \nAppendix B: Cluster analysis in R \n \n \n \n \n \n73 \nAppendix C: Data for comparing clustering methods \n \n \n83 \nAppendix D: On determining variable importance for clustering \n \n84 \nAppendix E: Commonly used internal validation indexes   \n \n88 \n \n \n \n \n \n \n \n \n \nAcknowledgements \nThis report is the result of a joint study between the PBL Netherlands Environmental \nAssessment Agency and PIK, as part of a wider research effort to quantitatively \nanalyse patterns of vulnerability. The authors like to acknowledge the feedback and \ninputs of Henk Hilderink, Marcel Kok, Paul Lucas, Diana Sietz, Indra de Soysa and \nTill Sterzel provided during the course of the study.\n\n5 \n \n1 Introduction \n \nCluster analysis is a general methodology for exploration of datasets when no or little \nprior information is available on the data’s inherent structure. It is used to group data \ninto classes (groups or clusters) that share similar characteristics, and is widely used \nin behavioural and natural scientific research for classifying phenomena or objects \nunder study without predefined class-definitions. In particular in coupled human-\nenvironment systems where well established and proven general theories are still \nlacking cluster analysis provides the possibility to discover regularities – a first step in \nempirically based theory building. A recent example is the application for assessing \nthe vulnerability of human wellbeing against global change (Sietz et al., 2011 and \nKok et al., 2010). The aim of this report is to share the experiences and knowledge on \ncluster analysis we gained in these applications helping to avoid typical problems and \npitfalls.  \nA broad collection of clustering methods has been proposed in areas as statistics, data \nmining, machine learning, bioinformatics, and many textbooks and overview papers \nillustrate the variety of methods as well as the vigorous interest in this field over the \nlast decade with the growing availability of computer power for analysing extensive \ndatasets or data objects involving many attributes (i.e. finding clusters in high-\ndimensional space, where the data points can be sparse and highly skewed). Books on \ncluster analysis, there are many: e.g. Aldenderfer and Blashfield (1976), Jain and \nDubes (1988), Kaufman and Rousseeuw (1990), Gordon (1999), Hastie et al. (2001), \nEveritt, Landau and Leese, 2001, Mirkin (2005); Xu and Wunsch (2009). The same \nholds for overview papers, see e.g. Jain, Murty and Flynn (1999), Omran, \nEngelbrecht, Salman (2005), Xu and Wunsch (2005), Wunsch and Xu (2008).  \n \nIn this report we will highlight the major steps in the cluster analysis process, and link \nit – as far as tools for performing the analysis are concerned - to the R software \nenvironment and its associated cluster analysis packages (see appendix A and B). We \nhave used this public domain software, together with own tailor-made extensions, to \nperform cluster analysis for identifying patterns of vulnerability to global \nenvironmental change (Kok et al. 2010), as part of a joint study of the PBL \nNetherlands Environmental Assessment Agency, PIK and the Norwegian University \nof Science and Technology. Examples from this study will be used as illustrative \nmaterial in the present report. \n \nBeyond this specific background, the report is set up in more general terms, and can \nbe used by novices in the field of cluster analysis, as well as by people who have \nalready some working experience with the method but want to extend their ability to \nperform cluster analyses.  \n \nIn our description of issues and methods we will highlight well-known main-stream \nmethods as well as promising new developments, referring to pertinent literature for \nfurther information, thus offering also some potential new insights for the more \nexperienced. We do not extensively consider cluster analysis methods which \nexplicitly account for spatial and/or temporal aspects of the data, but only briefly \ntouch upon them.\n\n6 \n \n1.1 Outline of the report \n \nOur exposition is for an important part based on the excellent book of Everitt, Landau \nand Leese, 2001 on clustering and on Han and Kamber’s book on data mining, which \ncontains a concise chapter on cluster analysis (Han and Kamber, 2006, chapter 7). In \ndiscussing cluster analysis we will divide the clustering-process into a number of \nlogical steps: \n \n• Data-selection and pre-treatment: In its generality this concerns the selection of \ndata of interest for the problem at hand and the treatment of missing values and \noutliers. Optionally it also involves dimension-reduction by selecting variables or \nextracting relevant features from the data, the use of data transformations to bring \nthe data values to a more even scale and the standardization of data to make them \nmutually more comparable. These forms of data-processing can influence the \noutcomes of the clustering to a large extent, and should therefore be chosen with \ndue consideration. \n• Selection of a distance measure in the data space: In order to express the \nsimilarity or dissimilarity between data points a suitable distance measure (metric) \nshould be chosen. It forms the basis for performing the clustering to identify \ngroups which are tightly knit, but distinct (preferably) from each other \n(Kettenring, 2006). Often Euclidean distance is used as a metric, but various other \ndistance measures can be envisioned as well.  \n• Selection of clustering method: The extensive – and ever-growing - literature on \nclustering illustrates that there is no such thing like an optimal clustering method. \nWe will group the multitude of methods into a restricted number of classes, and \nwill especially focus on two commonly used classes, one which is based on \nhierarchically performing the clustering, while the other consists of constructively \npartitioning the dataset into a number of clusters, using the k-means method. The \nother classes will be briefly discussed with due reference to literature for further \ninformation. \n• Performing clustering: This involves parameterising the selected clustering \nalgorithm(s) (e.g. choosing starting points for the partitioning method), \ndetermining the number of clusters, and computing the resulting clustering \npartition for these settings. Especially the issue of determining the number of \nclusters is an important one, and we will highlight a general approach which we \napplied for our vulnerability assessment study. \n• Interpretation and evaluation of results: This concerns in the first place a \ndescription of the clustering in terms of cluster characteristics. Moreover - in order \nto use the clustering results - the characteristics and meaning of the various \nclusters have to be interpreted in terms of content matters, which often involve a \nprocess of knowledge building, hypothesis setting and testing, going back and \nforth from the clustering results to the underlying knowledge base. \nFinally, evaluation includes also a study of the sensitivity of the clustering results \nfor the various choices during the various steps of the cluster analysis, e.g. \nconcerning the data selection and pre-treatment, selection of clustering method \netc. Also the effects of uncertainties and errors in the data should be addressed in \nthis step.\n\n7 \n \nThe various steps are described in more detail in the following chapters. In the \nappendices more detailed information is given on the R software and on some specific \nclustering issues. \n \n \nClustering in various contexts (according to Han and Kamber, 2006): \nAs a branch of statistics, cluster analysis has been extensively studied, with a focus on \ndistance-based cluster analysis. Cluster analysis tools based on k-means, k-medoids, \nhierarchical clustering and several other methods have been build into many software \npackages for statistical analysis such as S-Plus, SPSS and SAS. Also dedicated software \n(e.g. Wishart’s CLUSTAN (http://www.clustan.com/index.html), Matlab Statistics \ntoolbox) and public-domain packages abound (see the various R-packages on clustering). \nIn the machine learning context, clustering is an example of unsupervised learning, which \ndoes not rely on predefined classes and class-labeled training data. It is a form of learning \nby observation, rather than learning by examples as in supervised learning (as e.g. in data-\nclassification).  \nIn the data mining field efforts have focused on finding methods for efficient and effective \nanalysis of large databases. Issues as the scalability of clustering methods, the ability to \ndeal with mixed numerical and categorical data, complex shapes and types of data,  high-\ndimensionality, the ability to deal with noisy data, to incorporate domain knowledge, to \neasily deal with updates of the databases, insensitivity to the order of input records, are \nimportant requirements for the clustering methods.\n\n8 \n \n2 Data selection and pre-treatment \n \n \nThe main theme in cluster analysis is to identify groups of individuals or objects (i.e. \n‘cases’ or ‘entities’) that are similar to each other but different from individuals or \nobjects in other groups. For this purpose data on the individuals or objects have to be \ncollected, and it is obvious that the data should be characteristic, relevant and of good \nquality to enable a useful analysis. \n \n2.1 Data-collection: Some important issues  \n \nThis means in the first place that an adequate number of objects/cases/individuals \nshould be available in the dataset to study the phenomena of interest (e.g. identifying \nsituations that show a similar reaction pattern under certain environmental stresses; \nidentifying subgroups of patients with a diagnosis of a certain disease, on basis of a \nsymptom checklist and results from medical tests; identifying people with similar \nbuying patterns in order to successfully tailor marketing strategies etc.).  \nMoreover the researcher should choose the relevant variables/features which \ncharacterize the objects/cases/individuals on basis of which the groups should be \nsubdivided in homogeneous subgroups. Milligan, 1996 strongly advices to be on the \nparsimonious side and ‘select only those variables that are believed to help \ndiscriminate the clustering in the data’. Adding ‘ only one or two irrelevant variables \ncan dramatically interfere with cluster recovery’ (Milligan, 1996).  \nFor further analysis one must also decide - amongst others - whether to transform or \nstandardize the variables in some way so that they all contribute equally to the \ndistance or similarity between cases. \nFurthermore data quality will be another important issue which involves various \naspects as e.g. accuracy, completeness, representativeness, consistency, timeliness, \nbelievability, value added, interpretability, traceability and accessibility of the data, \npresence of noise and outliers, missing values, duplicate data etc. (cf. Pipino, Funk, \nWang (2006)).  \n \n2.2 Data-collection: Type of data \n \nAn important distinction when considering the data that has been collected on the \n‘objects’ and their ‘attributes’1 (i.e. properties or characteristics of an object; e.g. eye \ncolour of a person, length, weight) is the (measurement) scale which has been used in \nexpressing these attributes: \n \n− Nominal scale: In fact this is not really a scale because numbers are simply used \nas identifiers, or names, e.g. in coding a (no, yes) response as (0,1). The numbers \nas such are mostly meaningless in any quantitative sense (e.g. ID numbers, eye \ncolour, zip codes).  \n                                                     \n1 Concerning terminology: ‘attributes’ are also referred to as variables, features, fields, characteristics. \nA collection of attributes describes an ‘object’. An object is also known as record, point, case, sample, \nentity or instance. These terms are often used interchangeably.\n\n9 \n \n− Ordinal scale: The numbers have meaning only in relation to one another, e.g. the \nscales (1, 2, 3), (10, 20, 30) and (1, 20, 300) are in a sense equivalent from an \nordinal viewpoint. Examples of ordinal scale attributes are rankings, grades, or \nexpressing height in {tall, medium, short}-categories. \n− Interval scale: This scale is used to express data in a (continuous) measurement \nscale where the separation between numbers has meaning. A unit of measurement \nexists and the interpretation of the numbers depends on this unit (compare \ntemperature in Celsius or in Fahrenheit). \n− Ratio scale: This is a measurement scale where an absolute zero exist and a unit of \nmeasurement, such that the ratio between two numbers has meaning (e.g. distance \nin meters, kilometres, miles or inches). \n \nThe first two scales refer more to qualitative variables, and the latter to quantitative \nvariables2. In practice, the attributes characterizing an object can be of mixed type.  \n \nAnother distinction can be made between ‘discrete’ and ‘continuous’ attributes, where \nthe first category refers to variables having a finite or countably infinite set of values \n(e.g. zip-code), and can often be represented as integer variables (1, 2, 3, …). Binary \nattributes, taking on the values 0, 1, or “No”, “Yes” are a special case of discrete \nattributes. Continuous attributes can take values over a continuous range, and have \nreal numbers as attribute values. Notice that in practice real values can only be \nmeasured and represented using a finite number of digits. \n \n2.3 Data pre-processing \n \nSince real data can be incomplete (missing attribute values), noisy (errors or outliers) \nand inconsistent (e.g. duplicates with different values), data pre-processing is an \nindispensable part of the cluster analysis. The major tasks involved in data pre-\nprocessing are: \n \n− [A] Data cleaning: Filling in missing values, smoothing noisy data, identifying or \nremoving outliers, correcting inconsistencies and resolving redundancies caused \nby integration or merging of data from various sources/databases. \n− [B] Data integration: Integration of multiple databases, files or data cubes (data \nstructures commonly used to describe time series of image data).  \n− [C] Data transformation: Putting data in form(at)s which are appropriate for \nfurther analysis. This includes normalization and performing summary or \naggregation operations on the data, for instance.  \n− [D] Data reduction: Obtaining reduced representation in volume of the data that \nproduce the same or similar analytical results. \n− [E] Data discretization: Especially for numerical data this denotes a specific \nform of data reduction. \n− [F] Cluster tendency: Determining whether there are clusters in the data. \n− [G] Cluster visualisation: Using graphical techniques can greatly enhance the \nanalysis of the underlying cluster/group-structure in the data. \n                                                     \n2 We restrict our attention to data which have numerical values, and don’t consider symbolic objects. \nSee e.g. Ravi and Gowda (1999) for cluster analysis of this category of objects.\n\n10 \n \n \n \nIn the sequel we will outline these activities in more detail: \n \n2.3.1 Data cleaning  \n \nVarious techniques for performing data-cleaning can be used, of which we only \nbriefly discuss the way missing data and outliers can be handled. Additional dedicated \nmethods for data cleaning originating from the data warehouse literature can e.g. be \nfound in Rahm and Do (2000). \n \n(i) Handling missing data  \nValues can be missing since information is not collected or attributes are not \napplicable in all cases (e.g. annual income for children). One obvious way of handling \nmissing data is simply eliminating the corresponding data objects, and analysing only \nthat part of the dataset which is complete (called marginalization by Wagstaff and \nLaidler, 2005). This strategy does not lead to the most efficient use of the data and is \nrecommended only in situations where the number of missing values is very small. \nAnother option (called imputation) to deal with missing data is to replace the missing \nvalues by a global constant (e.g. ‘unknown’, a new class) or by an estimate, e.g. the \nmean, median, a most probable value; cf. various forms of data-imputation (e.g. mean, \nprobabilistic or nearest neighbourhood imputation3, as presented in Wagstaff and \nLaidler, 2005). \n \nJain and Dubes (1988, page 19-20)) recommend - on basis of experimental results of \nDixon (1979) - to use an imputation approach which redefines the distance between \ndata points xi and xk which contain missing values as follows: First define the distance \ndj between the two points along the j-the feature as dj=0, if xij or xkj is missing, and xij-\nxkj otherwise, then the distance between xi and xk is defined as: \n\n−\n=\n2\nj\no\nik\nd\nm\nm\nm\nd\n \nwhere mo is the number of features missing in xi or xk or both, and m is the total \nnumber of features. \nik\nd  as defined above is the squared Euclidean distance in case \nthere are no missing values. \n \nWagstaff and Laidler (2005) notice that in some applications imputation and \nmarginalization is not suitable since the missing values are physically meaningful and \nshould not be supplemented or discarded. They implemented an algorithm, called \nKSC (K-means with soft constraints) that is dealing with the whole data set including \nthe partially measured objects. \nAdditional information on dealing with missing values can be found in Little & Rubin \n(1987). \n \n(ii) Smoothing noisy data \nNoisy data are caused by (random) error or variance in a measured variable, as well as \nincorrect attribute values due to faulty data collection instruments, data entry and \ntransmission problems, inconsistencies in naming convention etc. In case of noisy \n                                                     \n3 ‘Mean imputation’ involves filling the missing values with the mean of the remaining ones, while \n‘probabilistic imputation’ consists of filling it with a random value drawn from the distribution of the \nfeature. ‘Nearest neighborhood imputation’ replaces it with value(s) from the nearest neighbor.\n\n11 \n \ndata one can decide to filter/smooth them first in order to partially remove some of the \neffects of the noise. E.g. binning, which consists of first sorting the data and then \npartitioning them into (equal frequency) bins and subsequently smoothing them by \nreplacing them by their bin means, medians or bin boundaries, is a simple way of \nfiltering the data. More advanced approaches, like using e.g. regression analysis, \ntrend-detection or noise-filtering (applying e.g. moving averages) can also be invoked \nto partially remove noise from the data. \n \n(iii) Handling outliers \nOutliers are data values that are extremely large or small relative to the rest of the \ndata. Therefore they are suspected to misrepresent the population from which they \nwere collected. Outliers may be the result of errors in measurements, model-results, \ndata-coding and transcription, but may also point to (often unexpected) true extreme \nvalues, indicating more variability in the population than was expected. Therefore, in \ntreating outliers one has to be cautious not to falsely remove outliers when they \ncharacterize important features (e.g. hotspots) of the phenomenon at hand; it is \nobvious that the decision to discard an outlier should not be based solely on a \nstatistical test but should also be taken on basis of scientific and quality assurance \nconsiderations.  \nThe first step in handling outliers consists of the detection of outliers (see also \nRousseeuw et al. 2006). Though detecting outliers can partly be based on process-\ninformation and combined computer and human inspection of graphical \nrepresentations of the data, one often relies on statistical techniques. Hubert and Van \nder Veeken (2008) recently proposed a statistical technique which is especially suited \nfor detecting outliers in skew distributed multivariate data and is also related to the \nadjusted boxplot for skew distributed data (Hubert and Vandervieren (2008)). Though \nseveral more refined robust estimators and outlier detection methods exist which are \ntypically geared to specific classes of skewed distributions, their approach is very \nuseful when no prior information about the data distribution is available, or when an \nautomatic and fast outlier detection method is required. In the CRAN-package \n<<robustbase>>4 functionality is available for this form of outlier detection (function \n<<adjOutlyingness>>) as well as for the adjusted box-plot determination (function \n<<adjbox>>). \nThe second step involves the pre-treatment of outlier-values before performing cluster \nanalysis. In general three general strategies can be applied: (a) using the outlying data \npoints in the subsequent analysis, accounting for their effects on the outcomes; (b) \ntrimming: removing the outlier data from the data set, and not incorporating them in \nthe dataset for the subsequent cluster analysis; (c) winsorising: replacing the outlying \nvalues by a truncated variant, e.g. a specific percentile (e.g. the 1th or 99th percentile) \nof the dataset, or an associated cut off-value of the skewed boxplot (Hubert and Van \nder Veeken, 2008). These truncated data points are included in the cluster analysis. \n \nThe above procedure is in fact centred around detecting outlying values with respect \nto an (supposedly) underlying distribution of the attribute-dataset, before the cluster \nanalysis takes place. There is however also the issue of detecting outliers with respect \nto the obtained partition of the objects into clusters, i.e. after the cluster analysis has \nbeen performed: \n                                                     \n4 Cf. http://cran.r-project.org/web/packages/robustbase/\n\n12 \n \n− Irigoien and Arenas (2008) recently proposed a geometrically inspired method for \ndetecting potential atypical outlying data-points.  \n− Also the Silhouette statistic proposed by Rousseeuw (1987) can be used as an \nindication of the outlyingness of a point in a cluster. It measures how well a \ncertain data point/object, say i, is matched to the other points/objects in its own \ncluster, versus how well matched it would be, if it were assigned to the next \nclosest cluster. The Silhouette of i is expressed as s(i)=[b(i)-a(i)]/max[a(i),b(i)], \nwhere a(i) denotes the average distance between the i-th point and all other points \nin its cluster, and b(i) is the average distance to points in the “nearest” clusters \nwith nearest being defined as the cluster minimizing b(i). s(i) is a value between -\n1 and +1, and large (positive) values indicate strong clustering, while negative \nvalues indicate that clustering is bad. See e.g. Figure 1 which gives an example of \na Silhouette plot, as well as the associated 2-dimensional projection of the cluster \npoints. The Silhouette statistic can e.g. be calculated with the function \n<silhouette> in the CRAN-package <<cluster>>5. \n \n \nSilhouette width si\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nSilhouette plot of pam(x = iris.x, k = 3)\nAverage silhouette width :  0.55\nn = 150\n3  clusters  Cj\nj :  nj | avei∈Cj  si\n1 :   50  |  0.80\n2 :   62  |  0.42\n3 :   38  |  0.45\n   \n-3\n-2\n-1\n0\n1\n2\n3\n-3\n-2\n-1\n0\n1\n2\nCLUSPLOT( iris.x )\nComponent 1\nComponent 2\nThese two components explain 95.81 % of the point variability.\n \n \nFigure 1: An example of a Silhouette plot for a cluster analysis with three clusters. The plot \nexpresses the (ordered) silhouette values for the points in the three clusters. It shows that \nmost points in the first cluster have a large silhouette value, greater than 0.6, indicating that \nthe cluster is somewhat separated from neighbouring clusters. The second and third cluster \ncontain also several points with low silhouette values indicating that those two clusters are \nnot well separated, as exemplified in the 2-dimensional cluster plot in the right frame. \n \nThe R-commands for constructing these results are: \n \n## Partitioning iris-data (data frame) into 3 clusters, \n## and displaying the silhouette plot.   \n## Moreover a 2-dimensional projection of the partitioning is given. \n \nlibrary(cluster)       # Load the package cluster \ndata(iris)             # Load the famous (Fisher’s or Anderson’s) iris-dataset  \niris.x <- iris[, 1:4]  # Select the specific datacolumns: i.e. Sepal.Length, \nSepal.Width, Petal.Length, Petal.Width \n \npr3 <- pam(iris.x, 3)  # Perform the clustering by the PAM-method with 3 clusters \nsi<-silhouette(pr3)    # Compute the Silhouette information for the given \nclustering \n                                                     \n5 Cf. http://cran.r-project.org/web/packages/cluster/\n\n13 \n \n \nplot(si, col = c(\"red\", \"green\", \"blue\")) # draw a silhouette plot with \nclusterwise coloring \n \nclusplot(iris.x, pr3$clustering, shade=TRUE,color = TRUE, col.clus= c(\"red\", \n\"green\", \"blue\")) # draw a 2-dimensional clustering plot for the given clustering \n \n \nFor more information on outlier-detection and analysis we refer to section 7.11 in Han \nand Kamer, 2006, who distinguish 4 different approaches to outlier analysis: statistical \ndistribution-based, distance-based, density-based local outlier detection and the \ndeviation-based approach. \n \n2.3.2 Data integration  \n \nWhen integrating multiple data-sources (databases, files or data-cubes) redundant data \ncan occur, since e.g. the same attribute or object may have different names in different \nsources, or one attribute may be a ‘derived’ attribute in another source (e.g. annual \nvalues, instead of monthly values). Correlation analysis can e.g. be used to point at \npotential redundancies in the data, while additional post-processing (e.g. data-\nreduction; see later) can be used to alleviate their effects.  \nIn data integration one should also be aware of potential data value conflicts which \ncan occur when attribute values from different sources are different e.g. due to \ndifferent representations or scales. These problems can be avoided by carefully \nperforming and checking the data integration. \n \n2.3.3 Data transformation  \n \nData transformation first of all includes normalization of the data to bring them into a \nform which is more amenable for the subsequent analysis. It is well-known that \nmeasurement scale can have a large effect in performing cluster analyses, as \nillustrated in Figure 5 of Kaufman and Rousseeuw, 1990 or in Silver, 1995. Therefore \nit is considered important to bring the data into a form which is less dependent on the \nchoice of measurement/representation scale. A typical standardization (the \n“(min,max)-range standardization”) which is used for this purpose consists of \ndetermining the range of values6 and redefinining the value of X(i) by:  \n(X(i)-min)/(max-min), thus obtaining values between 0 and 1, where 0 and 1 refer to \nthe extreme values (i.e. min and max7). Other statistical transformations, like the Z-\ntransform - which replaces X(i) by (X(i)-mean)/stdev, with mean being the average \nvalue, and stdev the standard deviation of all data-values X(i) - are also conceivable, \nbut are considered less apt when performing cluster-analysis (cf. Milligan and Cooper, \n1988, Kettenring, 2006).  \n \nRemark: Though the (min,max) standardization has the function of transforming the variables \ninto a comparable format, some caution is due in using it. E.g. in situations where certain \nvariables are already measured in a commensurable scale, applying this additional \nstandardization can result in an artificial rescaling of the variables which obscures their actual \ndifferences. E.g. when the actual min-max ranges differ (e.g. the actual values for variable A \n                                                     \n6 This can e.g. be the actual range, consisting of the actual maximum-minimal value of the current data, \nor the maximal feasible range one can think of (i.e. beyond the actual data-sample). \n7 The min and max can here refer to the actual minimum and maximum of the dataset at hand, but can \nalso refer to the feasible minimum and maximum which can realistically be expected, and which can be \nsmaller (for the minimum) or larger (for the maximum) than the actual ones.\n\n14 \n \nrange from .2 to .25, while those of variable B range from .015 to .8), rescaling on basis of the \nactual min-max range will result for both variables in values running from 0 to 1 which \nrenders a very different (and erroneous) view on their difference. In this situation one could \nargue for not automatic rescaling these variables, but proceed with the unscaled version. \nHowever, one can as easily argue against this, by stating that the use of an unscaled version \nfor these variables will result in an unfair bias towards other variables which have been re-\nscaled into the complete (0, 1) range by applying the (min,max) standardization. What choices \nwill be made in the end will depend on what is considered important. This situation in fact \nasks for a sensitivity analysis to study what effects the applied alternative standardization \noptions can possibly have on the clustering results. \n \nAnother issue concerns the use of non-linear transformations on the variables to bring \nthem into a form which e.g. fits more to the underlying assumptions: e.g. a right-\nskewed distribution could possibly be transformed into approximately Gaussian form \nby using logarithmic or square-root transformation, to make the data more amenable \nto statistical techniques which are based on normality assumptions. In analyzing these \ntransformed data one should however realize that re-interpretation of the obtained \nresults in terms of the original untransformed data requires due care, since means and \nvariances of the transformed data render biased estimates when transformed back to \nthe original scale. Therefore, if the nonlinear transformations of the data are expected \nto have no noticeable benefits for the analysis, it is usually better to use the original \ndata with a more appropriate statistical analysis-technique (e.g. robust regression in \ncase one wants to relate variables to each other). \n \n2.3.4 Data reduction \n \nIn situations where the dataset is very large, data reduction is in order to reduce run \ntime and storage problems in performing cluster analysis. The challenge is to obtain a \nreduced representation of the dataset that is much smaller in volume but produces the \nsame (or almost the same) analytical results. Various reduction strategies are in order \nto achieve this: \n \n(i) Aggregation: consists of combining two or more attributes (or objects) into a single \nattribute (or object), thus resulting in a reduced number of attributes or objects. One \nshould strive to find aggregations which make sense, and highlight important aspects \nof the problem at hand. This can also involve a change of scale (e.g. cities aggregated \ninto regions, states, countries; daily, weekly, monthly averages), and can render more \n‘stable’ data (less variability), however at the price of losing information on the more \ndetailed scale. \n \n(ii) Sampling: Instead of processing the complete dataset one can decide to process \npart of the dataset which is obtained by selecting a restricted (random) sample. In this \nprocess one has to be sure that the selected sample accurately represents the \nunderlying cluster- or populations structure in which one is interested. \n \n(iii) Feature selection: Feature Selection consists of identifying and removing features \n(or equivalently attributes, variables) which are redundant (e.g. duplicating much of \nthe information in other features) or irrelevant (e.g. containing no information that is \nuseful for the data mining task at hand, e.g. identifiers of objects). Apart from brute \nforce approaches which try all possible feature subsets, more advanced techniques can \nbe invoked as e.g. filter and wrapper approaches to find the best subset of attributes\n\n15 \n \n(see the extensive literature on these topics in machine learning and data-mining, e.g. \nBlum and Langley, 1997, Kohavi and John, 1997; see also Xing, 2003, Guyon and \nElisseeff, 2003, Guyon et al., 2006, Handl and Knowles, 2006, Liu, Yun, 2005, Saeys \net al. 2007). This last class of techniques can be implemented in a forward (stepwise \nforward selection) or a backward (stepwise backward elimination) fashion, similar to \nstepwise regression. See also table 5 in Jain et al. (2000) where a number of feature \nselection methods are briefly discussed in the context of statistical pattern recognition.  \n \nA number of (recent) publications more specifically address feature (or variable, \nattribute) selection for cluster analysis:  \n• Friedman and Meulman (2004) proposed, in the context of hierarchical clustering \nmethods, a method to cluster objects on subsets of attributes. It is based on the \nidea that subsets of variables which contribute most to each cluster structure may \ndiffer between the clusters. Software is available in R to perform this analysis \n(COSA; see http://www-stat.stanford.edu/~jhf/COSA.html). Damian et al. (2007) \ndescribe applications of this algorithm in medical systems biology. \n• Raftery and Dean (2006), in the context of model-based clustering, propose a \nvariable selection method, which consistently yields more accurate estimates of \nthe number of groups and lower classification error rates, as well as more \nparsimonious clustering models and easier visualization of results. See the CRAN-\npackage <<clustvarsel>>8 for related software.  \nFor interesting further developments see the recent paper of Maugis et al. (2008, \n2009). Methods which especially focus on situations with very many variables \n(high-dimensional data), are furthermore presented in McLachlan et al. 2002, \nTadesse et al. (2005), Kim et al. (2006). See also Donoho and Jin (2008, 2009) for \nthe related case of discriminant analysis (i.e. supervised classification).  \n• Steinley and Brusco (2008b) compared various procedures for variable selection \nproposed in literature, and concluded that a novel variable weighting and selection \nprocedure proposed by Steinley and Brusco (2008a) was most effective. \n• Mahoney and Drineas (2009) recently proposed so called CUR matrix \ndecompositions, i.e., low-rank matrix decompositions that are explicitly expressed \nin terms of a small number of actual columns and/or actual rows of the original \ndata matrix as a means for improved data-analysis, which can be usefully applied \nin clustering. \n• Donoho and Jin (2008, 2009) address optimal feature selection in the context of \nclassification and discriminant analysis in case that useful features are rare and \nweak. Their idea of using a thresholding strategy for feature Z-scores can be \nextended to cluster analysis applications. \n• Fraiman et al. (2008) recently introduced two procedures for variable selection in \ncluster analysis and classification, where one focuses on detecting ‘noisy’ non-\ninformative variables, while the other also deals with multi-colinearity and general \ndependence. The methods are designed to be used after a ´satisfactory´ grouping \nprocedure has already been carried out, and moreover presuppose that the number \nof clusters is known and that the resulting clusters are disjoint. The main \nunderlying idea is to study which effect the blinding of subsets of variables (by \nfreezing their values to their marginal or conditional mean) has on the clustering \nresults as compared to the clustering the full variable set. To enable analysis for \nhigh-dimensional data a heuristic forward-backward algorithm is proposed to \n                                                     \n8 Cf. http://cran.r-project.org/web/packages/clustvarsel/\n\n16 \n \nconsecutively search (in a non-exhaustive way) for an appropriate variable \nselection. The performance of Fraiman’s methods in simulated and real data \nexamples is quite encouraging, and at points it also outperformed Steinley and \nBrusco (2008a) method. \n• Krzanowski and Hand (2009) recently proposed a simple F-test like criterion to \nevaluate whether the ratio of the between-group and the within-group sum of \nsquares for each specific variable is significantly greater than what would be \nexpected in a single homogeneous population (i.e. if no clustering would be \ninvolved). On basis of this easily computable test they expect to make an \nappropriate pre-selection/reduction of the variables for clustering applications \nwith very many variables involved. This is especially the case for applications like \nthe genetic characterization of diseases by microarray techniques, where typically \nvery many gene expression levels p are involved as compared to subjects n (e.g. \nvalues of n are in the hundreds, while values of p are in the thousands). More \nspecialized approaches for these high dimensional situations are more \ncomputationally demanding and more specifically bound to specific cluster \nanalysis techniques like mixture model-based approaches (cf. McLachlan et al. \n2002, Tadesse et al. (2005), Kim et al. (2006)). \n \nIn appendix D, we highlight some simple alternatives related to the latter two methods \nthat can be straightforwardly used for performing this feature selection, and give some \nexamples of their use. \n \nComplementary to variable selection one can also consider the use of variable \nweighting to express the relative (ir)relevance of features or variables (Gnanadesikan, \nKettenring and Tsao, 1995). De Soete, (1986, 1988) initially has developed optimal \nschemes for ultrametric and additive tree clustering (see also Milligan, 1989), and \nMakarenkov and Legendre (2001) have extended these9 also for K-means partitioning \nmethods. For k-means type clustering Huang et al., 2005 propose a procedure that \nautomatically updates variable weights based on the importance of the variables in \nclustering. Small weights reduce the effects of insignificant or noisy variables. As a \nfurther improvement on Huang’s procedure, Tsai and Chiu (2008) recently proposed a \nweight self-adjustment (FWSA) mechanism for K-means to simultaneously minimize \nthe separations within clusters and maximize the separations between clusters. They \ndiscuss the benefits of their method on basis of synthetic and experimental results. \nGnandesikan et al. (2007) recently proposed simple methods for weighting (and also \nfor scaling) of variables. \n \n (iv) Dimension Reduction/Feature Extraction: For reducing the dimensionality of the \ndataset, various methods can be applied which use (non-linear) transformations to \ndiscover useful and novel features/attributes from the original ones (cf. Jain et al. \n1999, 2000, Law and Jain, 2006, Camastra, 2003, Fodor, 2002). E.g. principal \ncomponent analysis (PCA) (Jolliffe, 2002) is a classical technique to reduce the \ndimensionality of the data set by transforming to a new set of variables which \nsummarizes the main features of the data set. Though primarily defined as a linear \nfeature extraction technique, suitable non-linear variants (kernel PCA) have been \ndeveloped in the last decades (see Schölkopf et al. 1999). PCA is often used as a \npreliminary step to clustering analysis in constraining attention to a few variables. But \n                                                     \n9 For downloading this software see http://www.bio.umontreal.ca/casgrain/en/labo/ovw.html\n\n17 \n \nits use can be problematic as illustrated by Sneath, 1980, Chang, 1983. These \nreferences show that clusters embedded in a high-dimensional data-space will not \nautomatically be properly represented by a smaller number of orthogonal components \nin a lower dimensional subspace. Yeung and Russo, 2001 also demonstrate that \nclustering with the PC’s (Principal Components) instead of the original variables does \nnot necessarily improve cluster quality, since the first few PC’s (which contain most \nof the variation in the data) do not necessarily capture most of the cluster structure.  \nIn addition to PCA, alternative techniques can be envisioned for the task of dimension \nreduction, like factor analysis, projection pursuit, independent component analysis, \nmulti-dimensional scaling (MDS10), Sammon’s projection11, IsoMap, Support Vector \nMachines, Self-Organizing Maps etc. (cf. De Backer et al. 1998, Jain et al. 2000, \nFodor, 2000, Tenenbaum et al. (2000)). However, the same caveats as mentioned \nbefore for the PCA remain active. Moreover one should realize that feature extraction \n- unlike feature selection - typically results in transformed variables, consisting of \n(non)linear combinations of the original features, for which the original meaning has \nbeen lost. This can be an impediment in interpreting the results of the subsequent \nclustering in terms of the original variables. \nIn R the packages12 <<kernlab>> and <<MASS>> deal with several of these \ncomputational techniques. \n \n(v) Mapping data to a new space \nIn order to highlight specific dynamics in the data, techniques like using Fourier \ntransforms or wavelet transforms can be used to map the data into a new space, where \nfurther analysis can take place (cf. § 2.5.3. in Han and Kamber, 2006). Underlying \nrationale is that in the novel space less dimensions are needed to characterize the \ndataset to a sufficient extend, thus achieving data reduction. \n                                                     \n10 MDS (multidimensional scaling) represents the similarity (or dissimilarity) among pairs of objects in \nterms of distances between points in a low-dimensional (Euclidean) space, and offers a graphical view \nof the dissimilarities of the objects in terms of these distances: the more dissimilar two objects are, the \nlarger the distance between these objects in Euclidean space should be (Norg and Groenen, 1997). \n11 Sammon’s nonlinear mapping is a projection method for analysing multivariate data. The method \nattempts to preserve the inherent structure of the data when the patterns are projected from a higher-\ndimensional space to a lower-dimensional space by maintaining the distances between patterns under \nprojection. Sammon’s mapping has been designed to project high-dimensional data onto one to three \ndimensions. See Lerner et al. (2000) for information on initialising Sammon’s mapping. \n \n12 Cf. http://cran.r-project.org/web/packages/kernlab and http://cran.r-project.org/web/packages/MASS/\n\n18 \n \n \n \n2.3.5 Data discretisation  \n \nBy dividing the range of continuous attributes into intervals one can reduce the \nnumber of values. Reduction of data can also be established by replacing low level \nconcepts by higher level concepts (e.g. replacing numeric values for the attribute ‘age’ \nby categories as young, middle-aged or senior). Techniques like binning, histogram \nanalysis, clustering analysis, entropy-based discretisation and segmentation by natural \npartitioning can be applied for this purpose (cf. § 2.6 in Han and Kamber, 2006) \n \n2.3.6 Cluster tendency  \n \nOne difficulty of cluster algorithms is that they will group the data into clusters even \nwhen there are none. Later we will discuss the possibilities of validating the results of \na clustering but here we present a number of ways by which the user can estimate a \npriori whether data contains structure.  \n \n \nFigure 2: Artificial data set (left), image-plot (R-function) of the distance matrix of this data \nset (centre), image-plot of the data set after applying VAT-algorithm (right). \n \nIn the VAT-algorithm Bezdek, Hathaway and Huband (2002) represent each pair of \nobjects by their distance. The emerging dissimilarity matrix is subsequently ordered \nand visualized by grey levels (0 if distance is zero and 1 for the maximum distance) \n(Figure 2, right). See also Bezdek, Hathaway and Huband (2007) where a technique is \npresented for the visual assessment of clustering tendency on basis of dissimilarity \nmatrices.  \nHu and Hathaway (2008) further developed this idea beyond the pure graphical \ninterpretation of the result. They implemented several tendency curves that average \nthe distances in the dissimilarity matrix. The peak-values in the tendency curves can \nthen be used as a signal for cluster structures and for automatic detection of the \nnumber of clusters.  \n \n \nFigure 3: Artificial data set with uniformly distributed values (left) – h=0.5, Artificial raster \ndata set (centre) – h=0.1, data with three artificial normally distributed clusters (right) – h=1.\n\n19 \n \n \nAnother possibility to check whether there are clusters in the data or not is the \nHopkins-Index, which is described in Runkler (2000) or Jain&Dubes (1988). The \nlatter reference proposes to use hypothesis tests of randomness for getting insight into \nthe data structure. Also tests like quadrate analysis, inter-point distance and structural \ngraphs can be employed. \n \n \n \n \n \n \n \n \n \n \n \n2.3.7 Visualizing clusters  \n \nA number of graphical techniques for visualizing and identifying clusters in one or \ntwo dimensions can be employed, such as histograms, scatter plots and kernel density \nestimators. For multivariate data with more than two dimensions one can e.g. use \nscatterplot matrices, but these only project two-dimensional marginal views and do \nnot necessarily reflect the true nature of the structure in the p-dimensional dataspace. \nAn alternative approach is to project the multivariate data into one or two dimensions \nin a way that the structure is preserved in some sense as fully as possible. A common \nway (although not necessarily the most appropriate) is principal component analysis. \nOther methods like exploratory projection pursuit, multidimensional scaling, support \nvector machines are also potential candidates for visualization of clusters. See e.g. \nchapter 2 and section 8.6 in Everitt et al. 2001, and chapter 9 in Xu and Wunsch, 2009 \nfor more information. Also graphical techniques for exploring the structure in \nmultivariate datasets, like co-plots or trellis graphics (see e.g. chapter 2 in Everitt and \nDunn, 2001) can offer useful insights for cluster analysis. R offers various \npossibilities to generate such plots. In chapter 6 some of these will be discussed. \n \nSummary \n \nIn this chapter we extensively highlighted what issues and decisions are involved in \nselecting and pre-processing data of interest for the problem at hand. This not only \ninvolves the treatment of missing values and outliers, but also a judicious selection of \nvariables or features of interest (e.g. removing redundancies, avoiding overly strong \ndependencies) for the subsequent cluster analysis, as well as adequate data \ntransformations to bring the data values to a more even and comparable scale. \nPreliminary checks on whether the data indeed contain clusters, and whether some \ngroup structure is visible will also render important information for the next steps in \nthe actual clustering. Finally, since data-processing can influence the outcomes of the \nclustering, it will be important at the end to study the sensitivity of the identified \nclusters for feasible alternative choices in data selection and pre-treatment. \n \nWhich datasets are ‘clusterable’? \nAckerman and Ben-David (2009) theoretically assess several notions of clusterability \ndiscussed in literature and propose a new notion which captures the robustness of the \nresulting clustering partition to perturbations of the cluster centres. They discover that \nthe more clusterable a data set is, the easier it is (computationally) to find a close-to-\noptimal clustering of that data, even showing that near-optimal clustering can be \nefficiently computed for well clusterable data. In practice it is however usually a \ncomputer-intensive problem (NP-hard) to determine the clusterability of a given dataset.\n\n20 \n \n3 Selection of a distance measure in the data space \n \nA central issue in clustering objects is knowledge on how ‘close’ these objects are to \neach other, or how far away. This reflects itself in the choice of the distance measure \nor the (dis)similarity measure on the objects. \n \nIn case that the distances between the objects are ‘directly available’, as e.g. in \nsurveys where people are asked to judge the similarity or dissimilarity of a set of \nobjects, the starting point of the clustering is a n-by-n proximity matrix, which stores \nthe (dis)similarities between the pairs of objects (i.e. d(i,j) is the dissimilarity between \nobjects i and j, with i, j = 1,…, n).  \n \nIf distances are not directly available, information on the objects is typically available \non their features/attributes. The typical starting point for a cluster analysis is then a \ndata-matrix in the form of a table or n-by-p matrix that represents the n objects (rows) \nwith their associated p attributes (columns). In discussing how this data-matrix can be \ntransformed into a dissimilarity matrix, we assume that after the previous step high-\nlighted in section 2 (i.e. “data pre-treatment”) the data space is in a definite form, and \ndoes not need additional normalization or weighing. This means e.g. that the \napplication of weights to individual features to express differences in relevance has \nalready been established. Moreover it presupposes that care has been exerted not to \ninclude non-informative features in our data, since they can trash the clustering by \ndisturbing or masking the useful information in the other features/ variables. \n \n3.1 The binary data case \n \nIn case that all the attributes are binary (say 0 or 1, or no/yes), the similarity between \nobjects is typically expressed in terms of the counts in the matches and mismatches \nthe p features for two objects are compared.  \n \n \n \nObject j \n \n \nOutcome \n1 \n0 \nTotal \nObject i \n1 \na \nb \na+b \n0 \nc \nd \nc+d \n \nTotal \na+c \nb+d \np \n \nTable 1: Counts of binary outcomes for two objects \n \nA number of similarity measures have been proposed, and a more extensive list can be \nfound in Gower and Legendre (1986).\n\n21 \n \n \n \n \n \nMeasure \nSimilarity-measure \nS1 \nMatching coefficient \nS(i,j)=(a+d)/(a+b+c+d) \nS2 \nJaccard coefficient (Jaccard, 1908) \nS(i,j)=a/(a+b+c) \nS3 \nRogers and Tanimoto (1960) \nS(i,j)=(a+d)/[(a+2(b+c)+d)] \nS4 \nSokal and Sneath (1963) \nS(i,j)=a/[a+2(b+c)] \nS5 \nGower and Legendre (1986) \nS(i,j)=(a+d)/[a+.5*(b+c)+d] \nS6 \nGower and Legendre (1986) \nS(i,j)= a/[a+.5*(b+c)] \n \nTable 2: Similarity measures for binary data, cf. table 3.3 in Everitt et al. (2001) \n \nNotice that some of these similarity measures do not count zero-zero matches (i.e. d). \nIn cases where both outcomes of binary variables are equally important (e.g. as in \ngender: male/female) it is logical to include zero-zero-matches when expressing the \nsimilarity between objects. However, in more asymmetric situations where the \npresence of a feature (e.g. an illness) is considered more important than the absence, it \nis advisable to exclude the zero-zero matches (i.e. the d) when assessing the similarity \nof objects, since these could dominate the similarity between objects, especially if \nthere are many attributes absent in both objects (i.e. d is large, corresponding to a,b,c). \nWhen co-absences are considered informative, the simple matching coefficient S1 is \nusually employed, while Jaccard’s coefficient S2 is typically used if co-absences are \nnon-informative. S3 and S5 are examples of symmetric coefficients treating positive \nand negative matches in the same way, but assigning different weights to matches and \nnon-matches. Sokal and Sneath (1963) argue that there are no fixed rules regarding \nthe inclusion or exclusion of negative or positive matches, and that each dataset \nshould be considered on its merits. The choice of the specific similarity measure can \ninfluence the cluster analysis, since the use of different similarity coefficients can \nresult in widely different distance values, as is e.g. the case for S1 and S2. Gower and \nLegendre show that S2, S4 and S6 are monotonically related, as are S1, S3 and S5.  \n \n3.2 The categorical data case \n \nCategorical data where the attributes have more than two levels (e.g. eye colour) \ncould be dealt with similarly as binary data, when regarding each level of an attribute \nas a single binary variable. This is however not an attractive approach since many \n‘negative ’matches (i.e. d) will inevitably be involved. A far better approach is to \nassign a score sijk of zero or one to each attribute k, depending on whether the two \nobjects i and j are the same on that attribute. These scores are then averaged over all p \nattributes to give the required similarity coefficient as: \n\n=\n=\np\nk\nijk\nij\ns\np\ns\n1\n1\n \n \nNotice that this similarity coefficient is a generalisation of the matching coefficient S1 \nfor binary data.\n\n22 \n \n3.3 The continuous data case \n \nWhen all the attribute values are continuous, the proximities between objects is \nexpressed in terms of a distance-measure in the dataspace. Often Euclidean distance is \nused: \n2\n2\n2\n2\n2\n1\n1\n)\n(\n)\n(\n)\n(\n)\n,\n(\njp\nip\nj\ni\nj\ni\nx\nx\nx\nx\nx\nx\nj\ni\nd\n−\n+\n+\n−\n+\n−\n=\n\n \n \nbut various other distance measures can be applied as well, as the Manhattan or city-\nblock distance: \n \n|\n|\n|\n|\n|\n|\n)\n,\n(\n2\n2\n1\n1\njp\nip\nj\ni\nj\ni\nx\nx\nx\nx\nx\nx\nj\ni\nd\n−\n+\n+\n−\n+\n−\n=\n\n \n \nor the general Minkowski distance (q ≥1) \n \n(\n)\nq\nq\njp\nip\nq\nj\ni\nq\nj\ni\nx\nx\nx\nx\nx\nx\nj\ni\nd\n/\n1\n2\n2\n1\n1\n)\n(\n)\n(\n)\n(\n)\n,\n(\n−\n+\n+\n−\n+\n−\n=\n\n \n \n \nWe assume that missing values have been treated, e.g. by replacing them by the mean-\nvalue over the non-missing part, or by redefining the distance measure accordingly. \n \nAlso the correlation between the p-dimensional observations of the ith and jth objects \ncan be used to quantify dissimilarities between them, as in:  \n\n\n\n−\n−\n−\n−\n=\n−\n=\n=\nk\nj\njk\nk\ni\nik\np\nk\nj\njk\ni\nik\nij\nij\nm\nx\nm\nx\nm\nx\nm\nx\nwhere\nj\ni\nd\n2\n2\n1\n)\n(\n)\n(\n)\n)(\n(\n;\n2\n)\n1(\n)\n,\n(\nρ\nρ\n \n \nwith mi and mj the corresponding averages over the p attribute-values. This measure is \nhowever considered contentious as a measure for dissimilarity since it does not \naccount for relative differences in size between observations (e.g. x1=(1,2,3) and \nx2=(3,6,9) have correlation 1, although x1 is three times x2). Moreover the averages are \ntaken over different attribute values, which is problematic if their scales are different. \nBut in situations where attributes have been measured on the same scale, and refer to \nrelative profile (e.g. for classifying animals or plants absolute sizes of organism or \nparts are often considered less important than their shapes), correlation measures can \nbe also used to express dissimilarities. Further information can be found in section 3.3 \nin Everitt et al. (2001), Gower and Legendre (1986) and Calliez and Kuntz (1996). \n \n3.4 The mixed data case \n \nWhen the attribute values are mixed, i.e. containing both continuous and categorical \ndata values, a similarity measure can be constructed from weighing and averaging the \nsimilarities for the separate attribute values, as proposed by Gower (1971):\n\n23 \n \n\n\n=\n=\n=\np\nk\nijk\np\nk\nijk\nijk\nij\nw\ns\nw\ns\n1\n1\n \n \nwhere sijk is the similarity between the ith and the jth object as measured by the kth \nfeature, and wijk is typically one or zero depending on whether or not the comparison \nis considered valid. E.g. wijk can be set to zero if the outcome of the kth feature is \nmissing for either or both of the objects i and j, or if the kth feature is binary and it is \nthought appropriate to exclude negative matches. For binary variables and for \ncategorical variables with more than two categories the component similarities, sijk, \ntake value one when the two objects have the same value and zero otherwise. For \ncontinuous variables the similarity measure is defined as: \nk\njk\nik\nij\nR\nx\nx\ns\n−\n−\n= 1\n \n \nwhere Rk is the range of observations for the kth attribute (i.e. the city-block distance \nis used after scaling the kth variable to unit range). \n \n3.5 The proximity between groups of objects \n \nThe proximity between the individual objects can be used as a basis to construct \nexpressions for the proximity between group of objects. Various options exist for this: \ne.g. taking the smallest dissimilarity between any two objects, one from each group, \nleads to a nearest-neighbour distance and is also the basis for the hierarchical \nclustering technique applying ‘single linkage’. \nThe opposite is to define the inter-group distance as the largest distance between two \nobjects, one from each group and renders the furthest-neighbour distance which is the \nbasis for the ‘complete linkage’ hierarchical clustering technique. An in-between \napproach is taking the average dissimilarity, which leads to a form of ‘group average’ \nclustering when applied to hierarchical clustering methods. Cf. Everitt et al. 2001, \nsection 3.5, where also alternative ways to express inter-group distances are proposed \nwhich are based on group summaries for continuous as well as for categorical data. \n \nSummary \n \nIn order to express the similarity or dissimilarity between data points a suitable \ndistance measure (metric) should be chosen. It forms the basis for performing the \nclustering to identify groups which are tightly knit, but distinct (preferably) from each \nother. Often Euclidean distance is used as a metric, but various other distance \nmeasures can be envisioned as well.\n\n24 \n \n4 Selection of clustering method \n  \nThe extensive (and ever-growing) literature on clustering illustrates that there is no \nsuch thing like an optimal clustering method, an observation which is further \nunderpinned by theoretical insights from Kleinberg (2002); see also Zadeh and Ben-\nDavid (2009). From the multitude of methods we will consider a number of classes of \nmethods, giving most attention to traditional methods based on performing the \nclustering hierarchically and methods that constructively partition the dataset into a \nnumber of clusters (section 4.1 and 4.2), while describing the other methods only \nbriefly (section 4.3-4.6). We will finish this chapter with a brief discussion on which \nmethod to choose (section 4.7). \n \n4.1 Hierarchical methods \n \nA hierarchical clustering method groups data objects into a tree of clusters. It does so \nin an iterative way by constructing clusters from joining (agglomerative) or dividing \n(divisive) the clusters obtained in a previous iteration. Agglomerative methods start \nthis iterative process from the initial situation where each data point is considered as a \nseparate cluster, and form the hierarchical composition in a bottom up fashion by \nmerging the clusters. Divisive methods start with the mega-cluster consisting all data \npoints, and work in a top-down fashion by splitting the clusters subsequently. \nMerging or splitting is done on basis of the mutual distances between the clusters. A \nnumber of linkage-rules can be applied to express the distance between clusters. For \nexample the “simple”-rule (‘single-linkage’) always takes the smallest of all possible \ndistances between the data points within two different clusters; the “complete”-rule \n(‘complete-linkage’) chooses the largest of all distances, while the “average”-rule is \nbased on the average distance (‘average-linkage’). A popular linkage-rule is the \n“Ward’s” method which merges clusters that produce the least within-cluster \nvariance. All the information on the process of merging can be represented in a tree \n(dendrogram) which can be cut at a selected point (number of clusters), revealing a \nsuitable cluster structure for the data. A more formal method for determining the \nnumber of clusters, based on detecting the ‘knee’ in an associated clustering \nevaluation graph, is proposed in Salvador and Chan (2004) and favourably compared \nwith two alternative methods.  \n \nHierarchical clustering methods have a large computational complexity (O(n2)), \nwhere n is the number of data points or objects, which constrains their application \nusually to small and medium data size. In building the dendrogram, non-uniqueness \nand inversions can occur due to ties in data and due to the order of the dataset, cf. \nMorgan and Ray (1995), MacCuish et al. (2001) and Spaans and Heiser (2005).  \n \nThe linkage-rule in hierarchical clustering can be tuned to the data, and thus also non-\nspherical clusters can be identified. One should however be aware that applying \nhierarchical clustering can lead to very different results on the same dataset, \ndependent on the linkage rule used: the single linkage strategy tends to produce \nunbalanced and elongated clusters, especially in large data sets, since separated \nclusters with ‘noise’ points between them tend to be joined together (‘chaining’);\n\n25 \n \ncomplete linkage leads to compact clusters with equal diameters; average linkage \ntends to join clusters with small variances and is an intermediate between single and \ncomplete linkage; Ward’s method assumes that the objects can be represented in \nEuclidean space and tends to find spherical clusters of similar size. It is sensitive to \noutliers. See e.g. table 4.1 in Everitt et al. (2001) and Kaufman and Rousseeuw for \nmore information on the effects of linkage rules. \n \nIn their pure form hierarchical methods suffer from the fact that is not possible to \nadjust a merge or a split decision which was taken in a previous iteration. This rigidity \nis useful since it restricts computational costs in preventing a combinatorial number of \ndifferent choices, but it may lead to low-quality clusters if the merge or split decisions \nturn out to be not well-chosen. To improve this one can try to integrate hierarchical \nclustering with other clustering techniques, leading to multi-phase clustering. Three \nsuch methods are discussed in more details in Han and Kamber (2006). The first, \ncalled BIRCH, applies tree structures to partition the objects into ‘microclusters’ and \nthen performs ‘macroclustering’ on them using another clustering method such as \niterative relocation. The second method, called ROCK, merges clusters based on their \ninterconnectedness, and is a hierarchical clustering algorithm for categorical data. The \nthird method, called Chameleon, explores dynamic modelling in hierarchical \nclustering. \n \nIn R hierarchical clustering can be invoked by the general function hclust(); various \nmore specific hierarchical clustering techniques have also been implemented, e.g. the \nmethods proposed in Kaufman and Rousseeuw (1990) (see the R-package \n<<cluster>>): \n• DIANA() for divisive clustering \n• MONA() for clustering binary data., using the monothetic divisive algorithm. \n• AGNES() for agglomerative clustering, providing six methods for the \nagglomeration process: \n \nOther R-packages with hierarchical clustering methods are <<ctc>> (function “xcluster()”); \n<<amap>> (function “hcluster()” and “hclusterpar()”). \n \n \nFigure 4: Example of hierarchical clustering: clusters are consecutively merged with the most \nnearby clusters. The length of the vertical dendogram-lines reflect the nearness.\n\n26 \n \n4.2 Partitioning methods \n \nPartitioning algorithms divide a data set into a number of clusters, typically by \niteratively minimizing some criterion expressing the distances between the data points \nand prototypical elements of a cluster (e.g. cluster-centroids). \n Usually the square error criterion is used, defined as  \n\n=\n∈\n−\n=\nk\ni\nC\nx\ni\ni\nm\nx\nE\n1\n2           \nwhere E is the sum of the square error for all objects in the data set; x is the point in \nthe space representing a given object, and mi is the mean of cluster Ci. I.e. for each \nobject in each cluster the distance from the object to its cluster centre is squared and \nthe distances are summed. This criterion tries to make the resulting k clusters as \ncompact and as separate as possible. The number of clusters k is usually \npredetermined, but it can also be part of a search procedure using an explicit error-\nfunction. \nWhen using the popular k-means partitioning algorithm one starts with k initial cluster \ncentroids. The data points are then assigned to the nearest centroid. Subsequently the \nnew center is determined as the average of all points within the cluster thus obtained \nand again all points are re-assigned to their nearest centroid. This procedure is \nrepeated until a convergence is reached (e.g. points no longer change position), see \nFigure 5. \n \n \nFigure 5: Example of the iterative cluster-partitioning by K-means. Starting with an initial \nguess of the centroids (a), consecutively the data points are grouped to the nearest \ncentroids (b), and the new centroids are determined as the centres of these groups. In the \nnext step (c) the points are regrouped to the nearest (new) centroid. This process is \nrepeated until the groups don’t change anymore.  \n \nk-means has a computational complexity of order O(kn), where n is the number of \ndata points, and is therefore also suitable for large datasets (n large). Its outcomes are \nsensitive for the initialization of the iterative search process and an appropriate \ninitialization is therefore of concern. E.g. Milligan (1980) proposes an initialisation on \nbasis of Hierarchical clustering with Ward’s method on a small random subset of the \nlarge dataset; Arthur and Vassilvitskii (2001) recently proposed a smart seeding\n\n27 \n \ntechnique for initializing k-means. See Steinley & Brusco (2007) on various strategies \nfor initializing k-means. \n \nAnother shortcoming of k-means is that it does not perform well for non-spherical and \nnon-well separated clusters, or for clusters of very different sizes. Moreover it is \nsensitive to noise and outlier data points since a small number of such data can \ndrastically influence the mean value/center points.  \n \nThere are quite some variants of the k-means method (see e.g. Steinley (2006)), which \nhave been developed to improve the weak points. E.g. when clustering categorical \ndata, the means of the clusters are not suitable representatives, and k-means has been \nreplaced by the k-modes method (Chaturvedi, Green and Carroll (2001)) which uses \nnew dissimilarity measures to deal with categorical objects and a frequency-based \nmethod to update modes of clusters. For data with mixed numeric and categorical \nvalues k-means and k-modes can be integrated. \n \nTo deal with the sensitivity to outliers Kaufman and Rousseeuw (1990) proposed k-\nmedoids clustering by the PAM-approach (Partitioning Around Medoids; see the \nfunction pam() in the R-package <<cluster>>)). The main difference to k-means is \nthe choice of representative objects as cluster centres instead of the arithmetic mean. \nIn the same way as above after choosing k representative medoids the objects of the \ndata set will be assigned to the nearest representative medoids. In fact the partitioning \nmethod is performed by minimizing the sum of the dissimilarities between each object \nand its corresponding representative point, i.e. using the absolute error criterion which \nis less sensitive to outliers \n \n\n=\n∈\n−\n=\nk\ni\nC\nx\ni\ni\no\nx\nE\n1\n          \nwhere oi is the representative medoid, being the most centrally located object of its \ncluster. In an iterative way the set of representative medoids will be calculated \nfollowed by a new assignment of the objects and so on. A nice feature in connection \nwith PAM is the Silhouette plot (in R: silhouette () or by plotting the PAM-Result). \nThis plot illustrates how well an object lies within a cluster or merely at the edge of \nthe cluster (Rousseeuw (1987).  \n \nThe computational complexity of PAM is in the order O(k(n-k)2, which makes \ncomputation very costly for large values of n and k. For these situations Kaufman and \nRousseeuw constructed a method called CLARA (Clustering LARge Applications). In \nthe first step a small portion of the dataset is chosen as a representative of the \ncomplete dataset. Using PAM on this small sample, medoids are determined, which \nare subsequently used to assign each object of the complete dataset to a specific \ncluster or medoid. CLARA draws multiple small samples from the complete dataset, \napplies PAM on each sample and returns its best clustering as the output. The \ncomputational complexity is of the order O(ks2+k(n-k)), where s is the size of the \nsubsample. The effectiveness of CLARA is dependent on the sample sizes and - in \ncase that the best medoids of the selected subsample do not cover the best overall \nmedoids - CLARA will never find the best clustering. The quality and scalability of \nCLARA can be enhanced by allowing for an extra randomization in the iterative search \nfor new medoids, leading to the so-called k-medoids algorithm CLARANS (Clustering \nLarge Applications based upon RANdomized Search) proposed by Ng and Han\n\n28 \n \n(1994), and improved by Ester, Kriegel and Xu (1995). CLARANS also enables the \ndetection of outliers and has a computational complexity of about O(n2). Its clustering \nquality is dependent on the sampling method used. See also section 7.4.2 in Han and \nKamber (2006).  \n \nAnother way to generalize k-means is to explicitly consider other clustering criteria \nfor an optimal partitioning of the clusters. In chapter 5 of Everitt, 2001 some \nalternatives are presented to minimizing the total within-cluster sums of squares, \nwhich underlies k-means (i.e. trace W), and which are less sensitive to scale changes \nin the observed data and which can also tackle clusters of different shapes (than \nspherical) and sizes. \n \nAlso k-means can be generalized by considering it as a special case of model-based \nclustering, which applies a mixture of normal distributions to describe the underlying \nprobability density of the dataset (see section 4.5). \n \nOther extensions of k-means - as e.g. X-means (Pelleg and Moore, 1999, \nIshioka,2005), G-means and PG-means (Hamerly and Elklan, 2003; Feng and \nHamerly,2006), PW-K-means (Tseng,2007) - focus especially on the automatic \nestimation of the number of clusters, where the X-means variant implements Bayesian \nInformation criterion to tackle the choice of dimension. See also Tseng (2007) who \nproposes the use of penalty terms and weighting (PW-K-means) to extend K-means \nfor clustering with scattered objects and prior information. See Bies et al. (2009) for a \nrecent comparison study of X-means, G-means and some other methods for \nestimating the number of clusters. \n \n \nTo identify non-convex clusters, extensions as kernel k-means and spectral clustering \nhave been put forward, which enable identifying clusters that are non-linearly \nseparable in input space (see e.g. Schölkopf et al.,1999, Girolami, 2002, Camastra and \nVerri,2005, Filipone et al. 2007, Chang et al., 2008). See also section 4.6.  \n \nFinally, the sensitivity to initial conditions in K-means is a well-known problem for \nwhich many initialization strategies have been proposed (see e.g. Arthur and \nVassilvitskii, 2001, Steinley and Brusco, 2007). Barbakh and Fyfe (2008) propose a \nnew family of algorithms to solve the problem of sensitivity to initial conditions in K-\nmeans, by applying alternative performance functions which incorporate global \ninformation.  \n \n \n4.3 Density-based methods \n \nDensity-based clustering methods have been developed to discover clusters with \narbitrary shape. These methods typically regard clusters as dense regions of \nobjects/points in the dataspace that are separated by regions of low density \n(representing noise). DBSCAN grows clusters according to a density-based \nconnectivity analysis. OPTICS is an extension of DBSCAN, producing a cluster \nordering obtained from a wide range of parameter settings. DENCLUE clusters \nobjects based on a set of density distribution functions. It has a solid mathematical\n\n29 \n \nfoundation, allowing compact mathematical description of arbitrarily shaped clusters \nin high dimensional datasets. It generalizes various clustering methods, including \npartitioning and hierarchical methods, and applies a computationally efficient \ncalculation by applying a tree-based access structure. However the method requires \ncareful selection of the density parameters and noise threshold that may significantly \ninfluence the quality of the clustering results. For a concise description of these \nmethods we refer to Han and Kamber, 2006. See Tan et al., 2010 for a recent proposal \nfor improvements of density-based clustering algorithms. \n \n4.4 Grid-based methods \n \nThis approach uses a multi-resolution grid data structure. For this purpose it quantizes \nthe data space into a finite number of cells, forming the grid structure. \nThe main advantage of the approach is its fast processing time, which depends only \non the number of cells in each dimension of the quantized space, and not on the \nnumber of data objects. Approaches as STING, WaveCluster and CLIQUE are various \nexamples of this approach and can be found in section 7.7 and 7.9 of Han and \nKamber, 2006. \n \n4.5 Model-based clustering methods \n \nModel-based clustering methods hypothesize a model for each of the clusters and find \nthe best fit of the data to the given model. The clusters are determined by constructing \na density function reflecting the spatial distribution of the data points. Often also the \nnumber of clusters can be automatically determined on basis of statistical criteria \ntaking account of noise and outlier effects (see the textbox below).  \nIn fact the k-means method can be viewed as a special case of model-based clustering \nfor a Gaussian mixture model with equal mixture weights and equal isotropic \nvariances (see Celeux and Govaert, 1992). As noticed before, this directly offers a \nfruitful alley for generalization of k-means and finding more suitable forms of \nclustering non-spherical clusters and large datasets. Celeux and Govaert (1995), \npropose a generalization of k-means which enables the clustering of non-spherical \nmodels (Biernacki et al.,2006). The MIXMOD- software that they developed to \nanalyse multivariate datasets as mixtures of Gaussian populations, for clustering and \nclassification purposes, can be downloaded from http://www-math.univ-\nfcomte.fr/mixmod/index.php. Another popular package is the EMMIX-software \nwhich was developed by McLachlan et al. (2000). Related is also the R-package  \n<<mclust>> developed by (McLachlan, Fraley and Raftery (2002), Fraley and \nRaftery (2007). See also Samé et al. (2007), Maugis et al. (2009) which discuss the \napplication in variable selection; see also Li (2005), Yeung (2001). \nEstablishing such a probabilistic framework for clustering also suggests the use of \nseveral information criteria to automatically determine the number of clusters, like \nAkaike’s first information criterion, Schwartz Bayesian information criterion, and the \nintegrated classification-likelihood (see textbox below). See also Fraley and Raftery \n(1998) and Tibshirani et al. (2001) paper on the use of the gap statistic for estimating \nthe number of clusters (the R-package <<clusterSim>> provides functionality to \ncalculate this statistic).\n\n30 \n \nInformation criteria for k-means \n \nTo view k-means in a statistical context it is assumed that the underlying density for the points in the \ndata space can be expressed as a mixture of K equally weighted Gaussian distribution having mean μk \nand common variances σ2: \n\n\n\n\n\n\n\n\n−\n−\n=\n\n=\n2\n2\n1\n2\n2\n2\n1\nexp\n2\n1\n1\n)\n,\n|\n(\nσ\nμ\nπσ\nσ\nk\nj\nK\nk\nj\nx\nK\nM\nx\nP\n \nIn fact the μk refers to the centres of the resulting clusters k=1, …, K, while the variance σ2 refers to \nthe within-cluster variances,  \n\n=\n∈\n−\n=\nK\nk\nC\nj\nk\nj\nk\nx\nN\n1\n2\n2\n1\nμ\nσ\n \nwhere N is the number of data points. The associated likelihood of the complete dataset D={xj} is \nequal to, under the assumption of independence: \n)\n,\n|\n(\n)\n,\n|\n(\n2\n2\nσ\nσ\n∏\n=\nj\nj M\nx\nP\nM\nD\nP\n \nBy assigning each data point xj to the mixture component kj having highest probability, the \nclassification likelihood of the data point xj  is equal to: \n \n\n\n\n\n\n\n\n\n\n\n−\n−\n=\n2\n2\n2\n2\n2\n1\nexp\n2\n1\n)\n,\n|\n(\nσ\nμ\nπσ\nσ\nj\nk\nj\nj\nc\nx\nM\nx\nP\n \nK-means can be viewed as an attempt to maximize the joint negative classification log-likelihood of \nthe data: \n\n\n\n\n\n\n\n\n\n\n−\n+\n⋅\n−\n=\n=\n=\n\n\n∏\nj\nk\nj\nj\nj\nc\nj\nj\nc\nc\nj\nx\nK\nM\nx\nP\nM\nx\nP\nM\nD\nP\n2\n2\n2\n2\n2\n2\n)\n2\nln(\n2\n1\n)\n,\n|\n(\nln\n))\n,\n|\n(\nln(\n))\n,\n|\n(\nln(\nσ\nμ\nπσ\nσ\nσ\nσ\n \nIn the light of this interpretation a number of information criteria can be proposed to estimate the \noptimal number of clusters (see the appendix in Goutte et al. 2001): \n− \nAkaike’s first information criterion:  \n)1\n(\n2\n)\n,\n|\n(\nln(\n2\n2\n+\n⋅\n⋅\n−\n⋅\n=\np\nK\nM\nD\nP\nAIC\nc\nσ\n \n where (Kp+1) is the number of free parameters in the underlying mixture model with K components \n(i.e. K times the number of parameters in the mean μk and the variance σ2) \n− \nSchwartz Bayesian information criterion: \n \n)\nln(\n)1\n(\n)\n,\n|\n(\nln(\n2\n2\nN\np\nK\nM\nD\nP\nBIC\nc\n⋅\n+\n⋅\n−\n⋅\n=\nσ\n  \n− \nThe integrated completed likelihood (Goutte et al., 2001): \n \n)\n2\n3\nln(\n2\n)\n2\n2\nln(\n2\n)\nln(\n)1\n(\n)\n,\n|\n(\nln(\n2\n1\n1\n1\n2\n\n\n=\n=\n=\n+\n⋅\n+\n+\n+\n⋅\n−\n⋅\n+\n⋅\n−\n⋅\n=\nK\nk\nN\nj\nN\ni\nc\nk\nj\nK\ni\nN\np\nK\nM\nD\nP\nICL\nσ\n \nwith p being the number of attributes, N the number of data points where \n\n=\nk\nk\nN\nN\nwith Nk being \nthe number of data points in cluster Ck. The number of clusters Kopt  rendering the highest value of \nthe information criterion is chosen in the end as the number of clusters K.  \nThe AIC is known to overestimate the number of clusters, especially if the clusters are non-sperical, \nwhile the BIC is known to asymptotically estimate the ‘true’ model structure  in case that the \nunderlying Gaussian mixture model is an adequate model. The ICL takes into account that the \nunderlying mixture model might not be an adequate model for classifying the data points accordingly. \nSee (Goutte et al. 2001) for further details and  references.\n\n31 \n \n \nFor a good recent overview paper on finite mixture models and model-based \nclustering methods see Melnikov and Raita (2010). We notice that other approaches \nalso can be listed in the category of model-based approaches, like COBWEB which is \na conceptual learning algorithm taking concepts as a model for clusters and \nperforming an associated probabilistic analysis. SOM (or self-organized feature map; \nsee next section) is a neural-network-based algorithm that maps high-dimensional \ndata into a 2-D or 3-D feature map, which renders useful data visualization and can be \nused subsequently as a basis for clustering.  \n \n4.6 Clustering methods: Miscellanea  \n \nBelow we briefly discuss various alternative methods which have been developed for \nspecific application situations. \n \nSOM \nThe self-organizing map (SOM) due to Kohonen (1982) is a well-known neural \nnetwork method for unsupervised learning and thus can be suitably applied for cluster \nanalysis. The network classifies the data points according to internally generated \nallocation rules, which it learns from the data. SOM’s goal is to represent all points in \nthe original (often high-dimensional) data space by points in a low-dimensional one \n(usually 2-D or 3-D), such that the topology (distance and proximity relations) is \npreserved as much as possible. The method is particularly useful when a nonlinear \nmapping is inherent in the data, and it is an appropriate tool for clustering and data-\nvisualisation of high dimensional data spaces. \nSee Murtagh and Hernandez-Pejaras (1995), Flexer (2000), Vesanto (1999), Vesanto \nand Alhoniemi (2000) and Bacao et al. (2005) for further information. Waller et al. \n(1998) compared SOM with two partitioning and three hierarchical methods for more \nthan 2500 datasets and showed that SOM was similar to or better in performance than \nthe other methods. Moya-Anegón et al. (2005) compared SOM to Multi Dimensional \nScaling (MDS) and Ward’s method for analysing co-citations in the context of \nscientometrics and illustrated the complementarity of the various methods. See also \nYiang and Kumar (2005) for further results on comparison of SOM with k-means. \n \nFuzzy clustering \nAll the methods described so far have in common that an object is always fully \nassigned to one and only one cluster. In the so called fuzzy clustering the \nobjects/points have a degree of belonging (‘membership’ expressed in a value \nbetween 0 and 1) to the various clusters. Points on the edge of a cluster may thus be in \nthe cluster to a lesser degree than points in the centre of a cluster. For each point x we \nhave a coefficient uk(x) giving the degree of which it is in the k-th cluster. Typically \nthese coefficients are normalized such that they sum up to 1 for each x. k-means can \nnow be generalized into ‘fuzzy c-means’, where the centroid of the cluster is a kind of \n‘mean’ of all points, weighted by their degree of belonging to the specific cluster: \n(\n)\n(\n)\n\n\n=\nx\nj\nx\nj\nj\nx\nu\nx\nx\nu\ncenter\nν\nν\n)\n(\n)\n(\n\n32 \n \nwith v ≥1 being a coefficient which is called the fuzzifier. Typically v is taken as 2. \nSee Hathaway and Bezdek (1988) for further details. See also Kaufman and \nRousseeuw, 1990 with their fuzzy cluster analysis program FANNY, which is \navailable as function (fanny()) in the R-package <<cluster>>13. Mingoti and Lima \n(2005) present a comparative study between SOM, fuzzy c-means, k-means and \ntraditional hierarchical clustering, showing that especially fuzzy c-means has a very \ngood performance and renders robust results in the presence of outliers and \noverlapping clusters. \n \nClustering high-dimensional data \nThe curse of dimensionality is plaguing the clustering in applications where objects \nthat contain a large number of features or dimensions have to be classified (e.g. text-\ndocuments containing thousands of keywords as features; DNA microarray data \nproviding information on the expression levels of thousands of genes under hundreds \nof conditions). Many dimensions may not be relevant, moreover the data become \nincreasingly sparse when the number of dimensions increases, causing the distance \nmeasure between pairs of points to become meaningless, while the average density of \npoints in the data-space is likely to be low. This requires specific clustering \nmethodologies for high-dimensional data. CLIQUE and PROCLUS are two influential \nsubspace clustering methods, searching for clusters in subspaces or subsets of \ndimensions, rather than in the entire data-space. Another methodology, so called \nfrequent pattern-based clustering, extracts patterns to group objects into meaningful \nclusters. An example of this is pCluster. See section 7.9 of Han and Kamber (2006), \nand chapter 8 in Xu and Wunsch (2009). \n \nConstraint-based clustering \nMost clustering approaches discussed by now are implemented in an automatic, \nalgorithmic fashion, with little user guidance or interaction involved. However in \nsituations where there are clear application requirements (e.g. preferences and \nconstraints), one ideally wants to use these requirements to guide the search for \nclusters.  \nThis can include e.g. information on the expected number of clusters, the minimal or \nmaximal cluster size, weights for different objects, and other desirable characteristics \nof the resulting clusters. For clustering tasks in high-dimensional spaces, user input on \nimportant dimensions or desired results can render crucial hints or meaningful \nconstraints for effective clustering. Some examples how constraints and semi-\nsupervised clustering tasks can be established are presented in section 7.10 of Han \nand Kamber (2006). \n \nMulti-objective clustering \nWhen clustering a dataset having different properties or when analyzing it from \nvarious user-perspectives, the reliance on one sole clustering criterion is often not \nappropriate. In these cases it is more of interest to consider various clustering criteria \nsimultaneously, although they can be partially complementary and even conflicting to \na certain extent. The framework of multi-objective clustering allows this perspective, \nby framing clustering as a multi-objective optimization problem, see e.g. Handl and \nKnowles (2006a). They propose MOCK (Multi Objective Clustering with automatic \nK-determination) as an multi-objective extension of k-means, which uses an \n                                                     \n13 See http://cran.r-project.org/web/packages/cluster/\n\n33 \n \nevolutionary search algorithm to obtain a set of trade-off solutions between the \nvarious (often conflicting) goals as a good approximation of the Pareto front. These \nsolutions correspond to different compromises of the considered objectives, and \nprovide a range of alternative hypotheses to the researcher. Moreover they may lead \nto additional insight into the properties of the data, and thus increase confidence in the \nresults obtained. The algorithm is shown to give robust performance for data with \ndifferent properties and outperforms traditional single-objective methods. Moreover it \nallows for automatic determination of the number of clusters. Runtime of the method \nis however high, and for data where clustering criteria are more specifically known, \nspecialized methods will generally be more efficient. In Handl and Knowles (2007) \nand Handl, Kell and Knowles (2007) alternative applications of multi-objective \noptimization are presented in the context of semi-supervised learning and feature \nselection. \n \nMining sequential data (data streams, time-series)  \nSequential data consist of a sequence of sets of objects with possibly variable length \nand other changing characteristics like dynamic behaviour and time constraints. \nRecognizing patterns or groups in these dynamic datasets requires specific \napproaches, which we will not discuss. We refer to chapter 8 of Han and Kamber \n(2006) and chapter 7 in Xu and Wunsch (2009) for more information on these topics. \n \nSpatial clustering \nWhen spatial dimensions are involved in the data, e.g. for objects having a location or \nhaving features which differ as function of location, then it can be beneficial to \nexplicitly account for spatial structure when looking for clusters in the data. Methods \nfor exploratory spatial data analysis can serve as means to identify groups in the data. \nE.g. methods for identifying (local) spatial associations and correlations from the field \nof spatial statistics and GIS (see e.g. Jacquez, 2008), like Moran’s I or Geary’s c (cf. \nBao and Henry, 1996) of Anselin’s LISA (Local Indicators of Spatial Association, cf. \nAnselin, 1995, 2005), or Getis and Ord’s statistics (Getis and Ord, 1996, Ord and \nGetis, 2001, Aldstadt and Getis, 2006) for identifying statistical significant hot spots \ncan be a good basis for these analyses, leading to the identification of characteristic \nspatial patterns (see e.g. Premo, 2004, Nelson and Boots, 2008). For software see the \nR-package <<spdep>>14 which supports part of these analyses. See also the \ninformation page on spatial statistical software in R15 for further software for further \nsoftware, as e.g. packages as <<DCluster>> and <<clustTool>>16.  \n \n \nDiscovering clusters in networks \nThe analysis of networks and their structure and behaviour is presently an important \ntopic in studying complex systems in nature and society (e.g. Palla et al. 2005). \nEspecially the property of the ‘community structure’, in which network nodes are \njoined together in tightly knit groups, between which there are only loose connections, \nis an important research topic, as exemplified by Girvan and Newman (2002), \nNewman (2003,2004), Newman and Leicht (2007), Mishra et al. (2007), Handcock \n                                                     \n14 http://cran.r-project.org/web/packages/spdep/index.html \n15 http://www.spatialanalysisonline.com/output/html/R-Projectspatialstatisticssoftwarepackages.html \n16 http://cran.r-project.org/web/packages/DCluster/index.html and http://cran.r-\nproject.org/web/packages/clustTool/index.html\n\n34 \n \net al. (2009, 2007). See also the R-package <<latentnet>>17 which has been \ndeveloped for the analysis reported in the latter reference.  \nRemark: According to (Newman, 2003) network clustering is not to be confused with \ndata clustering which detects groupings of data points in high-dimensional data \nspaces. The two problems have common features, and algorithms for the one can be \nadapted for the other, and vice versa, but, on balance, one typically finds that this \ntransposition of algorithms between fields works less than the algorithms which have \nbeen directly developed. \n \n \nBootstrapping cluster analysis  \nBy experimentally replicating the cluster analysis, using e.g. random \nrestarts/initializations or random noise simulations, one can get clues about the \nstability (robustness) of the clustering results. Kerr and Churchill, 2001 elaborate on \nthis technique in an ANOVA setting, allowing for a distinction between systematic \nsources of variations and noise. They illustrate the bootstrapping technique with a \npublicly available data set and draw conclusions about the reliability of clustering \nresults in light of variation in the data; implications of replication and good design in \nmicroarray experiments are discussed. See also the R-package <<maanova>> \n18which builds consensus groups (for k-means methods) or consensus trees (for \nhierarchical methods) on basis of bootstrap. \n \nRandom Forest clustering:  \n‘Random Forests’ (RF) is a popular ‘ensemble-based learning’ technique, based on \nconstructing many classification trees from bootstrap sampling of the data, and \nsubsequently generating a classification on basis of the thus generated ‘forest’ of \ntrees. The procedure provides a classification with an associated estimate of the error \nrate, and moreover generates a measure of the importance of the involved (predictor) \nvariables, as well as a measure of the internal structure of the data (e.g. the proximity \nof different data points to each other). The RF-technique is user-friendly and performs \nvery well compared to many other classifiers, including discriminant analysis, support \nvector machines and neural networks, and is robust against over fitting (Breiman, \n2001).  \n \nThough initially meant for supervised learning activities like classification and \nregression, it can also be applied for unsupervised learning, like clustering. To this \nend one invokes a ‘trick’, calling the original data “class 1”, and constructing a \nsynthetic dataset, “class 2”. The synthetic dataset “class 2” can be constructed in two \nways: (1) the “class 2” data are sampled from the product of the marginal distributions \nof the variables (by independent bootstrap of each variable separately); \n(2) the “class 2” data are sampled uniformly from the hypercube containing the data \n(by sampling uniformly within the range of each variable).  \n \nSubsequently one tries to classify the combined data with the RF-procedure. The idea \nis that real data points that are similar to each other will often end up in the same \nterminal node of a tree, as measured by the proximity matrix returned by the RF-\ntechnique. This proximity matrix can thus be taken as a similarity measure, and \nclustering or multi-dimensional scaling on basis of this similarity can be used to \n                                                     \n17 http://www.stat.washington.edu/raftery/latentnet.html \n18 http://cran.r-project.org/web/packages/maanova/index.html\n\n35 \n \ndivide the original data points into groups for visual exploration. See the example in \nLiaw and Wiener (2002) as a work-out how to perform such an analysis with the \n<<randomForest>> package in R19. \n \nKernel-Based Clustering, Support Vector Clustering and Spectral clustering \nAll these approaches allow to identify non-spherical clusters, which is typically not \nprovided for by direct k-means oriented methods. The kernel-based method \napproaches the problem by non-linearly transforming the data into a high dimensional \n‘feature space’. In this space it is more likely to obtain a linear separation of these \nclusters/patterns, applying e.g. a SVM (Support Vector Machines) which constructs an \noptimal hyper-plane on basis of a small number of support points (the “support \nvectors”). The difficulty of the curse of dimensionality in the mapping to a high-\ndimensional ‘feature space’ can be overcome by the ‘kernel trick’, i.e. applying an \ninner-product kernel which avoids the time-consuming process of explicitly nonlinear \nmapping the data-points to the transformed space. Commonly used kernels include \npolynomial kernels, Gaussian radial basis function kernels and sigmoid kernels (cf. \nMuller et al. 2001). Different kernel functions usually lead to different non-linear \nseparating hyper-surfaces (and thus clusters) in the original data-space. The selection \nof an appropriate kernel is still an open problem and is currently determined \nempirically. In the above way kernel versions of classical clustering algorithms can be \nconstructed. See e.g. papers on kernel k-means and support vector clustering (Ben-\nHur et al. (2001), Moguerza, Munoiz, Martin-Merino (2002) and Winters-Hilt and \nMerat (2007).  \n \nSpectral clustering is based on regarding the data as a graph with a set of vertices and \nedges (with corresponding weights). The clustering is configured as a graph cut \nproblem where an appropriate objective function has to be optimized. The problem is \nsolved by an eigenvector algorithm involving the matrix of weights, which performs \nthe spectral decomposition. It results in an optimal sub graph-partitioning (see e.g. Shi \nand Malik, 200, Ng et al. 2002, von Luxburg, 2008). Dhillon et al. (2004), Filippone \net al. (2007) show that spectral clustering and kernel-based clustering are in fact \nclosely linked; see also Kulis et al. (2009a). \nTo enable analysis of large datasets - for which a full spectral decomposition is \ncomputationally prohibitive – Fowkles et al. (2004) propose the use of the Nyström \nmethod for solving eigenfunction problems; see also Drineas and Mahoney (2005) for \nmore information on the use of this approximation in kernel-based learning. Recently \nBelabbas and Wolfe (2009a) provide two methods, one based on sampling and \nsorting, to enable the use of spectral models for very large datasets. \n \nR-software for performing spectral clustering is available in the R-package \n<<kernlab>>20. The high-computational costs of the above methods (polynomial, \norder (O(n3)) can be prohibitive, but recently proposals for alternative faster variants \nhave been put forward, see e.g. Yan et al. 2009, Kulis et al. 2009b, Belabbas and \nWolfe (2009a, 2009b). \n \nBi-clustering  \nBi-clustering (co-clustering or two-mode clustering) is a clustering method which \nattempts to simultaneously cluster both the samples and the features (i.e. rows and \n                                                     \n19 http://cran.r-project.org/web/packages/randomForest/ \n20 http://cran.r-project.org/web/packages/kernlab/\n\n36 \n \ncolumns of the data-matrix), with the goal of finding “bi-clusters”, subsets of features \nthat seem to be closely related for a given subset of samples. It is for example used in \ngene expression analysis by clustering microarray-data (see e.g. Cheng and Church, \n2000, Madeira and Oliveira, 2004, and Tanay et al., 2002). The field shows a rapid \nexpansion of approaches and software tools, compare e.g. Wu and Kasif (2005), Kerr \net al. (2007,2008), Li et al. (2009). See also the <<BicARE>> R-package21 for \nBiclustering Analysis and Results Exploration in the BioConductor-suite \n \nConsensus clustering \nConsensus clustering, also called ‘ensemble clustering’ or ‘clustering aggregation’,   \ninvolves reconciling of diverse clusterings performed on the same dataset. The \nvarious clusterings come e.g. from different sources (e.g. using different clustering \nalgorithms; different selections of attributes) or from different runs of the same \nalgorithm (using other parameters; different subsamples, selections of attributes). \nWhen viewed as an optimization problem (“given a number of clusterings of some set \nof elements, find a clustering of those elements that is as close as possible to all the \ngiven clusterings”), it is known as “median partition”, and has been shown to be a \ncomputationally hard problem (NP-complete), see Goder and Gilkov (2008). For \nfurther information on alternative approaches to consensus clustering we refer to \nliterature, e.g. Strehl and Ghosh (2002), Monti et al. (2003), Gionis et al. (2005). \nSee also the R-software package <<clue>>22 which provides an extensible \ncomputational environment for creating and analysing cluster ensembles. \n \n4.7 Which method to choose? \n \nAgainst the background of the multitude of methods (different, as well related) for \ncluster analysis, one is confronted inevitably with the question ‘which one to choose’? \nIn a certain sense clustering can be considered both as an art and as a science, as \nreflected by discussions on a recent conference on this issue \n(http://stanford.edu/~rezab/nips2009workshop).  \nThe choice of the clustering algorithm is not an application-independent issue, but \nshould always be addressed in the context of its end-use, taking also account of the \ncharacter and type of data which is available. Typically it is considered a good idea to \ntry several algorithms on the same data to study what they will disclose. This however \nleaves one with the task to decide what methods to apply, and how to use and interpret \nthem. An important issue in using and interpreting the results from the cluster analysis \nwill be the flexibility in going back-and-forth from statistical technique to subject-\ncontent. This involves combining expertise on cluster analysis with expertise on the \nspecific subject area where the cluster analysis is applied, and typically requires a \nclose cooperation between content-expert and cluster-analysts, if the analysis is not \ndone by the content-expert. \n \nObviously it will depend on the available expertise (on clustering and on the specific \nsubject), software, time, money and mancraft to what extent the choice of the \nclustering algorithm is covered. Requirements with which one should account can be \ndiverse, as exemplified e.g. by the list of issues like ‘scalability’, ‘ability to deal with  \n                                                     \n21 See http://www.bioconductor.org/packages/2.6/bioc/vignettes/BicARE/inst/doc/BicARE.pdf \n22 http://cran.r-project.org/web/packages/clue/index.html\n\n37 \n \n \n \ndifferent types of attributes’, ‘discovery of clusters with arbitrary shape’, ‘ease of \nusing the cluster analysis procedure’, ‘ability to deal with noisy data’, ‘treatment of \nnewly inserted data’, ‘insensitivity to the order of the input records’, ‘high \ndimensionality’ presented in chapter 7.1 Han and Kamber (2006). Moreover also \nissues related to cluster validity (see next chapter) will be of importance. \n \nThree fundamental properties for clustering  \n(according to Handl and Knowles (2005)) \n \nHandl and Knowles  (2005) distinguish three fundamental properties for clustering, which can \ngive rise to conflicting objectives, and which would argue for a multi-objective approach \ntowards clustering as exemplified e.g. in Handl and Knowles (2005).  \n \nCompactness: Generally this is implemented by keeping intra-cluster variation small. \nAlgorithms like k-means, average link-agglomerative clustering, self-organizing maps or \nmodel-based clustering fit into this category. These methods are very effective for spherical or \nwell-separated clusters, but may fail for more complicated cluster structures. \n \nConnectedness: This more local concept of clustering is based on the idea that neighboring \ndata items should share the same cluster, and methods as density based clustering and single-\nlinkage agglomerative clustering are related to this property. Detection of arbitrarily shaped \nclusters is possible, but these methods can lack robustness in case clusters are not clearly \nseparated spatially. \n \nSpatial separation: This property on its own does not give much guidance for clustering, and \ncan easily lead to trivial solutions. Typically it is combined with other objectives, as \ncompactness of clusters or balance of cluster sizes.  \n \n \nExamples of data sets exhibiting compactness, connectedness and spatial separation, \nrespectively. Connectedness and spatial separation are related (albeit opposite), and in \nprinciple, the cluster structure in data sets B and C can be identified by a clustering \nalgorithm based on either connectedness or on spatial separation, but not by one based \nprincipally on compactness.\n\n38 \n \nHandl and Knowles (see textbox) state that in clustering various objectives are involved, \nwhich can be conflicting. Therefore they argue that multi-objective approach to clustering is \nappropriate. \n \n \nSummary \n \nThe extensive – and ever-growing - literature on clustering illustrates that there is no \nsuch thing like an optimal clustering method. We have grouped the multitude of \nmethods into a restricted number of classes, and have especially focused on two \ncommonly used classes, one which is based on hierarchically performing the \nclustering, while the other consists of constructively partitioning the dataset into a \nnumber of clusters, using the k-means method. The other classes are briefly discussed \nwith due reference to literature for further information.\n\n39 \n \n5 How to measure the validity of a cluster? \n \n5.1 Comparing cluster solutions \nThe comparison of cluster solutions (e.g. partitions or trees) either with each other or \nwith benchmark information is an important aspect of cluster validation. For example, \ntesting whether different subsamples of the same dataset or different methods applied \nto the data generate similar results is considered as a relevant activity in evaluating the \ncluster-quality (‘robustness issue’). Moreover, in situations where an external \nclassification is available, one would like to check the similarity of this classification \nand the clustering results as an indication of external clustering validity.  \nBelow we briefly highlight a number of well-established techniques for comparing \ntwo partitions. See Everitt et al. 2001, section 8.4, for additional material on \ncomparing two dendrograms/trees or two proximity matrices; see also Campbell, \nLegendre and Lapointe (2009) for further information on these issues. \nE.g., when two classifications of a group of n objects are available, one can represent \nthem as a c1-by-c2 matrix N=[nij] where nij is the number of objects in group i of \npartition 1 (i=1, …, c1) and group j of partition 2 (j=1,…,c2). The labelling of the two \npartitions are arbitrary. When the partitions have the same number of clusters and \ntheir agreement is good, it is usually obvious from inspection how the labels \ncorrespond, and one partition can straightforwardly be relabelled to match the other. \nUsing simple percentage agreement or the kappa coefficient (see Cohen, 1960) the \npartitions can then be compared, after relabeling. \nRemark: One can think of various procedures to match the labels of two cluster partions, say \n1 and 2.  \nA straightforward strategy consists of: \n(a) first determining the Euclidean distances between the cluster-centres for clustering 1 and \nclustering 2. These distances are stored in a ‘distance matrix’ with entry di,j expressing the \ndistance between the i-th cluster-centre for clustering 1 and the j-th cluster-centre for \nclustering 2; \n(b) next linking the labels for clustering1 and 2 by consecutively searching for the smallest \nentry in this matrix (smallest distance), matching the corresponding row and column and \neliminating them from the matrix consecutively. \nIn this way a match between the cluster-classes in clustering 1 and those in clustering 2 is \nobtained iteratively. This is however not the only procedure to perform this matching. One \ncan easily come up with alternatives when considering these steps: \n \nConcerning step (a): Matching can also be done by comparing the cluster-class counts in \nthe cross table-matrix N. The idea behind this matching is to find a match which renders \nthe largest number of counts (data points) in the corresponding matched cluster-classes.  \nNotice that the match proposed sub (a) above, is based on the underlying (average) \nfeatures of the data points, and aims to establish a match on basis of these averages. \n \nConcerning the search step (b): Instead of performing the search heuristically like \nsketched above one can envisage to perform this search exhaustively (i.e. exact) by \nconsidering all cluster-combinations involved, and finding the one which renders the sum\n\n40 \n \nof the distances minimal23. Although the number of all cluster-combinations involved is \nequal to k! (k is the number of clusters in clustering 1), this task of finding the exact \noptimum cluster combinations can be performed far more efficiently (in O(k3) steps) by \nusing the ‘Hungarian algorithm’ proposed by Kuhn (1955) and Munkres (1957). This \nalgorithm is available in the R-package “<<clue>>24”, i.e. use the LSAP function for \noptimal cluster matching/assignment \n \nA simple example illustrates that the outcomes of both search methods (in step (b) can be \ndifferent. E.g. let the cross-table for two cluster partions (5 cluster-classes) be: \n   \n17 24 \n1 \n8 15\n23 \n5 \n7 14 16\n25 \n6 13 20 22\n10 12 19 21 \n3\n11 18 25 \n2 \n9\n \nThe heuristic search method and the optimal search method match the rows 1,2,3,4, and 5 \nwith the columns 2, 5, 1, 4, 3 (heuristic) and 2, 1, 5, 4, 3 (exact) respectively, giving a total \nnumber count of 111 and 115 respectively, which shows the (slightly) suboptimal \nperformance of the heuristic method.  \nRemark: Cohen’s Kappa-statistic which corrects for chance effects in comparing two cluster \npartitions is given by (N* stands for the relabelled cross-table): \n \n[\n]\nchance\nagree\nchance\nagree\nreal\nagree\nKappa\ni\nj\nji\nk\nik\nchance\nagree\nreal\nagree\nP\nP\nP\nI\nN\nN\nN\nSum\nP\nN\nSum\nN\nTrace\nP\n,\n,\n,\n*\n*\n2\n,\n,\n1\n*)\n(\n1\n*)\n(\n/\n*)\n(\n−\n−\n=\n\n\n\n\n\n\n\n\n\n\n\n\n\n⋅\n\n\n\n\n\n=\n=\n\n\n\n     \nwhere \nreal\nagree\nP\n,\n refers to the relative observed agreement between clustering 1 and 2, and \nchance\nagree\nP\n,\n refers to the hypothetical probability of the agreement by chance, in case random \nclasses would have been assigned to the objects for both clustering 1 and 2. If the clusterings \nare the same \nKappa\nI\n is 1, if there is no agreement, other than the one happening by chance, \nKappa\nI\n <=1. \nWhen the number of clusters differs between the two partitions/clusterings, one can \ntake another alley towards comparing the partition rather than by analysing the cross-\ntabulation of frequencies. Starting point is to investigate the co-occurrence of the \ngroupings of every pair of n objects in the partitions. This can be presented in a 2 x 2 \ncontingency table: \n \n                                                     \n23 For the case of matching on basis of cluster-counts, one would strive to find a match which renders a \nmaximal sum of the number of counts. \n24 http://cran.r-project.org/web/packages/clue/index.html\n\n41 \n \n \n \n \nPartition 2 \n \n \n \nPair in \n same group \nPair in \ndifferent \ngroups \nTotal \nPartition 1 \nPair in same \n group \na \nb \na+b \nPair in different \n groups  \nc \nd \nc+d \n \nTotal \na+c \nb+d \n( )\n2\nn  \n \nThis contingency table can be directly derived from the cross-table N with cluster-\nclass counts, using the relationships presented in table 1 and 2 of Hubert and Arabie, \n1985.  \nThe Rand and Jaccard index for expressing the correspondence of these partitions are \ndefined by (a+d)/(a+b+c+d) and a/(a+b+c) respectively. Correcting for the effects \nof chance in grouping points in clusters, adjustments of the Rand index have been put \nforward in literature of which the adjusted Rand index of Hubert and Arabie (1985) is \nespecially judged a suitable one (see also Steinley, 2004). It is defined as: \n)]\n)(\n(\n)\n)(\n[(\n2\n)]\n)(\n(\n)\n)(\n[(\n)\n(\n2\n2\nd\nb\nd\nc\nc\na\nb\na\nn\nd\nb\nd\nc\nc\na\nb\na\nd\na\nn\nadjRand\n+\n+\n+\n+\n+\n−\n\n\n\n\n\n\n+\n+\n+\n+\n+\n−\n+\n\n\n\n\n\n\n=\n \nwhere \n\n\n\n\n\n\n2\nn  denotes the total number of object-pairs (i.e. (a+b+c+d)). \nMeila (2007) recently proposed a novel criterion for comparing partitions, the \n“Variation of Information”-criterion, which accounts for the amount of information- \nloss and gain when changing from clustering 1 to clustering 2. It is calculated on basis \nof information theoretic measures which can be directly evaluated in terms of the \nentries in the cross-table-matrix N with the cluster-class counts. See Meila (2007) for \ndetails. Vinh et al. (2009) recently argue that also for information theoretic measures a \ncorrection for chance is needed, similar to the adjustment of the Rand index.  \nThe above mentioned indices that can be calculated on basis of the cross-table N of \nthe cluster-class counts appear to be insensitive to permutations of the columns and \nrows of the cross-table. This implies that they do not depend on the cluster-label-\nmatching strategy involved in linking clustering 1 to 2. \nThe presented indices have been implemented in the CRAN-package <<mcclust>>25 \nwhere the adjusted rand index is evoked by the function arandi() and Meila’s criterion \nby the function vi.dist(). \n                                                     \n25 http://cran.r-project.org/web/packages/mcclust/\n\n42 \n \nRemark: The above indices can be used to measure the influence of individual data points on \na cluster analysis: by comparing the partitioning which results from deleting specific data \npoints from the dataset, with the partitioning of the complete reference dataset one can detect \nhighly influential data points that directly impact the resulting partition. Cheng and Milligan \n(1995, 1996a,b) e.g. advocate the use of the adjusted Rand index for this purpose. See also \nsection 8.5.3 in Everitt et al. 2001. \n \n \n \n \n \n \n \n \n5.2 Validation measures \nValidation measures are intended to measure how well the clustering captures the \nunderlying structure in the data. An excellent account of different types of validation \nmeasures and their potential biases is given in Handl et al. (2005). This reference \nunderlines that there does not exist a golden standard in clustering methods nor in \nvalidation measures. It will often not be sufficient to use a single clustering algorithm \nand/or a single validation measure when the real underlying structure of the data is \nunknown. Rather one should apply a number of different clustering algorithms and \nvalidation measures that optimize different aspects of a partitioning for an appropriate \nrange of cluster sizes. Also Brun et al. (2007) address similar points, and advise to be \ncautious with automatically applying and interpreting results from calculated validity \nindices.  \nTypically three groups of validation measures are distinguished (see Figure 6): the \nfirst type is based on calculating properties of the resulting clusters, such as \ncompactness, separation, roundness, and is called internal validation, since it does not \nrequire additional information on the data.  \nThe second approach is called relative validation and is based on comparisons of \npartitions generated by the same algorithm with different parameters (e.g. \ninitializations), or different subsets of the data. This approach in fact measures \nrobustness of the clustering results and - similar to internal validation - also doesn’t \nrequire additional information. \nAn axiomatic approach to measure cluster quality \nAckerman and Ben-David (2008) have recently initiated a systematic study of measures \nfor the quality of a given data clustering. These measures, given a data set and its \npartition into clusters, return a non-negative real number representing how ‘strong’ or \n‘conclusive’ the clustering is. They propose to use the notion of ‘cluster quality measure’ \nas a basis for developing a formal theory of clustering, which unlike Kleinberg’s \naxiomatic approach (Kleinberg, 2002) does not lead to contradictions.  \nAckerman and Ben-David have proposed quality measures for wide families of common \nclustering approaches, like center-based clustering (e.g. k-means, k-median), loss-based \nclustering (e.g.  k-means) and linkage-based clustering (e.g. hierarchical clustering), and \nanalyze their computational complexity. In addition, they show that using these quality \nmeasures, the clustering quality of a clustering can be computed in low polynomial time.\n\n43 \n \nValidation\nProperties of \nthe clusters\n(internal)\nComparison\nof\npartitions\nBetween clusters\ngenerated by the\nalgorithm\n(relative)\nBetween clusters\nand classes\n(external)\nDetermination of quality\nof algorithm to generate\ninteresting partition\nDetermination of quality\nof algorithm to generate\nconsistent groups\nDetermination of quality\nof algorithm to recognize\nexisting groups\n \nFigure 6: Different approaches for cluster validation \nThe third approach, called external validation is based on comparison of the \nclustering partition of the data with a known class partition of the data, thus \npresupposing that the class labels are known and uncontested. It is clear that this kind \nof validation will only be possible for a limited number of situations, e.g. for \nbenchmark data, or for situations where cluster labels are known beforehand. It will \nevidently depend on the application field whether (and which) explicit validation \ncriteria are feasible and useful: e.g. Datta and Datta (2006) propose two specific \nevaluation indices in the context of gene expression data-analysis with a content \nrelated meaning, namely the biological homogeneity index and the biological stability \nindex. \nIn appendix E a large number of internal validation indices are listed that use the \ninter-cluster and the intra-cluster distances to identify the best partition. These indices \nuse the inter-cluster and the intra-cluster distances to identify the best partition. They \nare appropriate when clusters are compact and well-separated, but fail when sub-\nclusters exist or when the clusters are arbitrarily shaped (and thus have no \nrepresentative centre points to assess the inter-cluster variance). Therefore frequently \nalternative approaches are put forward in literature, which are compared to the \nestablished ones on basis of synthetic and/or real data. These comparative studies are \nnecessarily always limited to a certain extent: their scope is given by the datasets \nwhich are analysed, and one can often find other data on which the one method \nperforms better than another candidate. Jonnalagadda and Srinivasan (2009) propose \nan approach that overcomes this limitation by not using inter-cluster distances, but \ninstead focusing on information which is lost or gained when a cluster intersects with \nanother. The proposed NIFTI-index (Net InFormation Transfer Index) was compared\n\n44 \n \nwith other ones - Dunn’s, Silhouette, Davies-Bouldin and the Gap-statistic – and it \nwas shown - on synthetic datasets as well as on real-life data - that NIFTI outperforms \nthese methods in determining the appropriate number of clusters. However, the \nproposed method has as limitation that it models clusters as hyper-spheres, which \nmake it less appropriate for clusters that do not have a spherical shape. Also Saitta et \nal. (2008) propose a new bounded index for cluster validity, the score function (SF). It \nis found to be always as good or better than four common validity indices – Dunn’s, \nSilhouette, Davies-Bouldin and the Maulik Bandyopadhyay-statistic – in the case of \nhyper-spherical clusters. It works well on multidimensional data sets and \naccommodates unique and sub-cluster cases. \nRelative validation indices are based on measuring the consistency of algorithms, \ncomparing the clusters obtained by the same algorithm under different conditions, or \nby different clustering algorithms, and two typical approaches are discussed \nsubsequently: \n• The use of a Figure of Merit (FOM, see Yeung, Haynor and Russo, 2001) assesses \nthe ‘predictive power’ of a clustering technique and strikes a balance between the \nexternal and internal criteria: FOM requires no prior knowledge nor relies entirely \non information from the clustering process. It can e.g. be obtained by leaving out a \nvariable, j, clustering the data (into k clusters), then calculate the RMSE (Root \nMean Squared Error) of j relative to the cluster means: \n     \n \n \n\n=\n∈\n−\n=\nk\nr\nC\nx\nC\nij\nr\ni\nr j\nx\nN\nk\nj\nRMSE\n1\n2\n))\n(\n(\n1\n)\n,\n(\nμ\n \nwith xij being the measurement of the j-th variable for the i-th observational unit; \nN the number of observational units, Cr the set of observational units in the r-th \ncluster; \n)\n( j\nr\nC\nμ\n the mean of variable j over the observational units in the r-th \ncluster. Summing these RMSE over all variables j renders an aggregate FOM \n(AFOM): \n\n=\n=\np\nj\nk\nj\nRMSE\nk\nAFOM\n1\n)\n,\n(\n)\n(\n \nCalculating the AFOM for each k and adjusting for cluster size, and dividing by \nthe number of variables ‘left out’ renders the adjusted AFOM: \n \n)\n(\n1\n)\n(\nk\nAFOM\nN\nk\nN\np\nk\nAFOM adj\n⋅\n−\n=\n \nLow values of the clustering algorithm’s AFOM indicate a high predictive power. \nBy comparing the AFOM values at each k for different clustering algorithms their \nperformance can be compared. However, Yeung et al. (2001) comment that this\n\n45 \n \nshould only be done if the similarity metrics of the compared clustering \nalgorithms are identical. Olex et al. (2007) show limitations of the FOM when the \nunderlying similarity measure is non-Euclidean. For similarity measures based on \nthe Pearson correlation coefficients they propose a more suitable alternative FOM. \n• The use of a stability measure expresses how the cluster-membership assignment \nis affected by small changes/alterations in the dataset (e.g. sampling different \ndata(sub)sets; adding noise to data) or by applying different parameter-settings for \nthe cluster algorithm. It provides information on the stability/robustness of the \nprevailing clustering partition for these alternative choices. The stability measure \nis typically based on the use of an explicit criterion for cluster comparison, like the \nadjusted Rand index, or Meila’s variation of information criterion, cf. Meila \n(2007). The stability-based approach can also be used to determine the appropriate \nnumber of clusters k, by studying for which k the resulting cluster partition is \nrelatively stable/robust towards (re)sampling of the data or noise in the data. This \napproach is presently very popular and was initially advocated by Dudoit and \nFridlyand (2002), Tibshirani et al. (2002), Ben-Hur et al. (2002), Bel Mufti and \nBertrand (2007). Notice that these resampling methods in fact assume that the \nemployed subset-samples are representative enough to reflect the inherent \nstructure in the whole dataset. In situations where some clusters are of small size, \nthis may be a problematic assumption. See also Lange et al. (2005), Hennig \n(2006), the <<fpc>> package26 and Volkovich et al. (2008) for related \napproaches. Kuncheva and Vetrov (2006) specifically analyse the stability of the \nk-means cluster results with respect to random initialization. See the next textbox \nfor s critical remarks on the appropriateness of the stability approach for the \ndetermination of the number of clusters \n \nIn the cluster analysis that we have set up for identifying patterns of vulnerability for \nglobal change we have implemented the above mentioned stability procedure in the \nfollowing way in order to determine an adequate number of clusters k on basis of \nrepetitively performing clustering for k=2 until a maximum value Kmax: \n \n1. Initialize k:=2; \n2. IF [k ≤ Kmax] THEN   \n{ Repeatedly (e.g. n=150) perform two clusterings by k-means, initializing \neach clustering with a random start-setting and compare these clusterings on \nbasis of a criterion which gives a value between 0 and 1 to express their \nsimilarity (values around 1 hint at high similarity of the pair of clusterings).  \nNext take the average of this criterion value \n)\n(k\nS\n over all these n repetitions \nas a measure for the stability of this resampling procedure for the specific k.} \nELSE Go to step 4 \n                                                     \n26 http://cran.r-project.org/web/packages/fpc/index.html\n\n46 \n \n3. k:=k+1; Go to step 2; \n4. Plot the average values \n)\n(k\nS\n as a function of k for k=2, …, Kmax. This is a so-\ncalled consistency graph, which displays the average stability/robustness of \nthe outcome of the clustering analysis for the resampling. \nFigure 7 gives a graphical overview of the procedure (from Dietz et al., 2011). Since \nwe used the counting of overlap method we had to reallocate the labelling of the \ncluster via the straightforward method of the Euclidean distance (See 5.1) to achieve \ncomparable maps. \nNo\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nNo\nYes\nYes\nYes\nYes\nNo\n1.\n2.\nidentification of overlap\nconsistencymeasure =\nnumber of overlaps\nnumber pixel\n1.\n2.\nresults of two \ncluster analysis runs\n(color allocations arbitrary)\nreallocation by \ncomparison of the \neuclidean distance of \nthe cluster centroids\nnumber of identical colours for each pixel\ncomparable maps\n \nFigure 7: Operational sequence for calculating the consistency measure exemplary for k=4.  \nThe value of k for which this consistency measure is optimal indicates a suitable \nchoice for the number of clusters. Figure 8, shows an example from Kok et al. \n(2011). Besides the global optimum at k=3 there is an interesting relative \nmaximum for eight clusters, suggesting that this number of clusters reflects also \nthe structure of the data in case one is looking for a more differentiated partition. \n \n2\n4\n6\n8\n10\n12\n0.75\n0.85\n0.95\nConsistency Measure\n# cluster\naverage consistency\n \nFigure 8: Consistency graph for determining the number of clusters. The local optimum at \nk=8 indicates that possibly an interesting suitable clustering can result if choosing e.g. 8 \nclusters. The number of repetitions n has been 150 in this case.\n\n47 \n \nAlthough the above procedure is formulated primarily for the k-means method, it can \nalso be applied to other clustering methods as well.  \nMoreover, our R-code offers various options the choice of the criterion to express the \nsimilarity between the clusterings: next to using the adjusted Rand index or Meila’s \nvariation of information criterion, it is possible to explicitly calculate the fraction of \ndata points which have been clustered similarly when repeating the clustering with a \nrandom restart. In this case the average value \n)\n(k\nS\ncan be viewed as the average \nfraction of data points which are clustered similarly when randomly restarting the \nclustering for this specific k. Typically the criterion choice does not lead to different \nchoices in the ‘optimal’ number of clusters. \n \n \n \n \n5.3  Software for cluster validation \n \nThe R-package <<clValid>> provides software for cluster validity (see Brock et al., \n2008), where the generic function cl_validity() can be used to evaluate cluster validity \nindices for partitions and hierarchies obtained by clustering. See also cluster.stats in \npackage <<fpc>> for a variety of cluster validation statistics; fclustIndex in package \n<<e1071>> for several fuzzy cluster indexes; clustIndex in package <<cclust>>; \nsilhouette in package <<cluster>>. The R-package <<clusterSim>> provides \nvarious measures to express the performance of a clustering on a dataset, including \nthe Tibshirani et al. (2001) gap statistic. \nCriticism on the stability-based approach for choosing the number of clusters \n \nBen-David and von Luxburg (2006) have recently criticized the popular stability-\nbased methods on basis of a theoretical analysis of stability issues in cluster-analysis  \nmethods that determine the clusters by globally minimizing an objective function. \nThey discovered that for large datasets the common belief (and practice) that stability \nreflects the validity or meaningfulness of the chosen number of clusters is not true. \nFor an elegant and useful exposition of the implications of these and other related \nfindings see the recent publication von Luxburg (2009). Albeit the initial critical \ntheoretical findings on the stability-based approach von Luxburg at the end draws a \n“carefully optimistic picture about model selection base on clustering stability for the \nk-means algorithm. Stability can discriminate between different values of k, and the \nvalues of k which lead to stable results have desirable properties. If the data set \ncontains a few well-separated clusters which can be represented by center-based \nclustering then stability has the potential to discover the correct number of clusters.” \n(von Luxburg, 2009; italics are added by us). In case of very elongated clusters or \nclusters with complicated shapes the k-means algorithm cannot find a good \nrepresentation of the dataset, regardless of the number k used, and in these situations \nstability based model selection breaks down. Von Luxburg moreover states that these \nresults only hold true for situations where the number of clusters is relatively small \n(in the order of 10, rather than in the order of 100). For other clustering algorithms \nthat work very different from k-means it remains an open question whether the \nstability-based model selection is a suitable approach.\n\n48 \n \nAlternative tools for validity assessment are proposed by Bolshakova et al. (2003, \n2005a,b) and contain also visualization method for evaluating the clustering results. \n \nSummary \n \nVarious ways to evaluate clustering performance and compare different clusterings have been \npresented. A general (stability-based) approach is put forward which assesses the robustness \nof clustering results for repeated analysis of the dataset under different settings (e,g, \ninitialisations) of the cluster algorithm. It can be used for estimating the number of clusters.\n\n49 \n \n6 Graphical representation of the results \n \nData visualisation can greatly support the interpretation of the cluster analysis. \nVarious ways to visualize the results of the cluster analysis are possible (see also \nsection 2.3.7). In the last chapter of this guideline we do not intend to give a \ncomprehensive overview of all possibilities but to show some examples which \noccurred to be useful to us.  \n \n \n \n \nFigure 9: Heatmap of the dataset shown in Gentleman et al. (2004). See \nhttp://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/r/heatmap/ for further explanation. \n6.1 Hierarchical cluster analysis \nHierarchical cluster analyses are typically illustrated by dendograms, showing clearly \nhow the groupings are established. This information can further be enhanced by using \nheat-maps which provide a sorting/structuring of the data-matrix, permuting the\n\n50 \n \ncolumns and rows of this matrix to conform with the hierarchical clustering of \nvariables and objects (see Figure 9).  \n \nThe ‘clustergram graph’ proposed by Schonlau (2002,2004) as alternative to \ndendrogram-graphs (e.g. by using the R-function dendrogram()) is in fact of similar \nnature as the branching diagram. It examines how objects are assigned to clusters as \nthe number of clusters increases. Clustergrams are useful for non-hierarchical \nclustering algorithms such as k-means as well as hierarchical cluster algorithms when \nthe number of objects is large enough to make dendrograms impractical.  \n \nAgrafiotis et al. (2007) propose radial clustergrams to visualize the aggregate \nproperties of hierarchical clusters, which are specially apt for visualizing large trees \nwhich can not be displayed appropriately in straightforward dendrograms. One can \nalso consider the use of the Dendroscope software from the University of Tübingen \nfor this purpose (Huson et al. 2007, see Figure 10). \n \n \n \n \nFigure 10: Seven alternative views for visualizing the same tree, implemented in the \nDendroscope software (Huson et al. 2007): Rectangular Phylogram, Rectangular Cladogram, \nSlanted Cladogram, Circular Phylogram, Circular Cladogram, Radial Phylogram and Radial \nCladogram. \n \n6.2 Partitioning cluster analysis \nPartitioning cluster analyses are often visualized by projecting the data in two-\ndimensional space, using e.g. multidimensional scaling (MDS) or self-organized maps \n(SOM) (see Figure 11, using Clusplot as in Pison et al. 1999; see also Vesanto, 1999, \nEwing and Sherry, 2001).\n\n51 \n \n-3\n-2\n-1\n0\n1\n2\n3\n-3\n-2\n-1\n0\n1\n2\nCLUSPLOT( iris.x )\nComponent 1\nComponent 2\nThese two components explain 95.81 % of the point variability.\n \nFigure 11: Two dimensional projection of the clusterpoints for the Iris dataset.  \n \n6.3 Cluster membership \n \nCluster membership is usually indicated by different colours and glyphs. The \ncharacteristics of the various clusters can e.g. be displayed by showing boxplots per \nvariable/feature for the various clusters (see Figure 12), or by showing a graph of the \ncluster centres (see the spectral plot Fig. 13).  \nIn the boxplot the cluster centre is indicated by the circle, while the spread around this \ncentre is indicated by the box-boundaries denoting the lower and upper quartiles (25th \nand 75th percentile) of the data; thus the box-length indicates the interquartile \ndistance, IQR. The band near the middle of the box denotes the median. Typically, \nboxplots are extended by whiskers denoting the minimum or maximum data values \nwithin 1.5 IQR of the lower and upper quartile. But, since we are specifically \ninterested in high/low end percentiles, and in highlighting potential asymmetry of the \ndistribution, we have chosen to work with alternative whiskers, and indicate them by \nthe ends of the dotted lines which show the 5th and 95th-percentile. So between these \ntwo points 90% of the objects within a cluster are located. Notice that the boxplots for \nthe clusters in fact only display one-dimensional information, as projected on the \nindividual axes associated to the various variables/indicators. Information on the \nspecific spatial structure of the cluster of points in the multi-dimensional data space \n(spanned by all variables/indicators considered) does not clearly show up in the \nboxplot.\n\n52 \n \n Figure 12: Boxplots, showing the variation in indicator values per cluster (colours indicate \nclusters; all indicator values are between 0 and 1); see Kok et al. (2010).  \nNote: the boxes present the 25-75 percentile range of the indicator values; the circles at the \nend of the dotted lines indicate the 5- and 95-percentile, while the red circle indicates the \narithmetic mean; the band near the middle of the box indicates the median value. The \nnumber of points in the respective clusters is indicated in the top of the sub frames. \n \nGraphs of the normalized cluster centres give information on how the average \ncharacteristics of the clusters differ (see Figure 13). They are helpful in suggesting the \n(dis)similar properties and characteristics of the various clusters.  \n \n0\n0.2\n0.4\n0.6\n0.8\n1\n1.2\nROD\nRUF\nAGP\nSER\nPOD\nGDC\nIMR\nC7\nC1\nC2\nC5\nC6\nC8\nC4\nC3\n \nFigure 13: Cluster centres (= typical indicator values) for the 8 clusters C1 - C8; see Kok et al. \n(2010).  \n \nIn case that the data have a spatial dimension, showing maps can give a clue on how \nthe clusters are geographically distributed, serving to identify and connect features \nwith similar characteristics at different geographical locations (see Figure 14).\n\n53 \n \n \nFigure 14: Distribution of clusters within the drylands (see Kok et al., 2010). Light grey: non-\narid areas. Each of the 8 clusters denotes a typical constellation of the 7 indicators road \ndensity, renewable water resource, agro-potential, soil erosion, population density, GDP/cap \nand infant mortality rate, which are also displayed in the boxplots of Figure 12.  \n \n \nFigure 15: Branching diagram, showing cluster subdivision when increasing cluster-numbers \nin k-means cluster analysis of the dataset (N=45000) consisting of the indicators for the \nforest overexploitation archetype\n\n54 \n \n \n6.4 Branching diagrams \n \nWhen performing the cluster analysis repeatedly for a consecutive number of clusters \nit is insightful to construct a ‘branching diagram’ (see figure 15) which displays how \nthe clustering structure changes when using another number of clusters. This diagram \ngrossly indicates which clusters are split or merged, and thus renders useful \ninformation on the potential relatedness of the clusters. \n \n \nBesides the above presented methods Leisch (2008, 2009) recently provide an \noverview of various visualization possibilities for centroid based clustering methods \n(neighbourhood graphs, convex cluster hulls, bar charts of cluster medoids etc.). The \nCRAN-package <<flexclust>> contains implementations of these visualization \nmethods. See also the interactive visualization toolbox for cluster analysis in the \ncontext of gene expression data <<gcExplorer>> developed by Scharl and Leisch, \n2009. Additional information can be found in literature on visualization methods for \nbioinformatics applications, like analysing gene expression microarray clusters, see \ne.g. Hibbs et al., (2005), Saraiya et al. (2005).  \n \n \nSummary \n \nA number of possibilities is given for graphically displaying different properties of clusters. It \nturned out that adequate graphical representations play a vital role in the process of \nidentifying promising further questions and next steps in a clustering oriented research \nprocess.\n\n55 \n \n7 References \n \n \nAckerman, M., Ben-David, S. (2008). Measures of Clustering Quality: A Working Set \nof Axioms for Clustering. Proceedings of Neural Information Processing Systems \n(NIPS 2008). http://books.nips.cc/papers/files/nips21/NIPS2008_0383.pdf. \n \nAckerman, M., Ben-David, S. (2009). Which Data Sets are 'Clusterable'? – A \nTheoretical Study of Clusterability. Proceedings of the Twelfth International \nConference on Artificial Intelligence and Statistics, 2009. \nhttp://www.cs.uwaterloo.ca/~shai/publications/ability_submit.pdf. \n \nAgrafiotis, D.K., Bandyopadhyay, D., Farnam, M. (2007). Radial Clustergrams: \nVisualizing the aggregate properties of hierarchical clusters. Journal. Chem. Inf. \nModel., Vol. 47, 69–75. \n \nAldenderfer, M.S., Blashfield, R.K. (1976). Cluster Analysis. Sage, Beverly Hills, \nCA. \n \nAnselin, L. (1995). \"Local indicators of spatial association – LISA\". Geographical \nAnalysis, 27, 93–115. \n \nAnselin, L. (2005). \"Exploring Spatial Data with GeoDATM: A Workbook\". Spatial \nAnalysis Laboratory. p. 138. \nhttp://www.csiss.org/clearinghouse/GeoDa/geodaworkbook.pdf. \n \nAnselin, L., Kim, Y.-W., Syabri, I. (2004b). Web-based analytical tools for the \nexploration of spatial data. Journal of Geographical Systems, 6, 197–218. \n \nBao, S., Henry, M.S. (1996). \"Heterogeneity issues in local measurements of spatial \nassociation.” Geographical Systems, 1996, Vol. 3, 1–13. \n \nBao, S., Martin, D. (1997). Integrating S-PLUS with ArcView in Spatial Data \nAnalysis: An Introduction to the S+ArcView Link, ESRI’s Users Conference, \nSan Diego, CA. \n \nBao, S. (1999). Literature Review of Spatial Statistics and Models. China Data \nCenter, http://141.211.136.209/cdc/docs/review.pdf. \n \nBao, S., Li, B. (2000). Spatial Statistics in Natural Resources, Environment and Social \nSciences (eds.), A Special Issue of the Journal of Geographic Information Science. \n \nBação, F., Lobo1,V., Painho, M. (2005). Self-organizing Maps as Substitutes for  \nK-Means Clustering. V.S. Sunderam et al. (Eds.): ICCS 2005, LNCS 3516, 476–483, \n2005. \n \nBarbakh, W., Fyfe, C. (2008). Local vs global interactions in clustering algorithms: \nAdvances over K-means. International Journal of Knowledge-based and Intelligent \nEngineering Systems, 12, 1–17.\n\n56 \n \n \nBelabbas, M.-A., Wolfe, P.J. (2009a). Spectral methods in machine learning and new \nstrategies for very large datasets. PNAS January 13, 2009, Vol. 106 (2), 369–374. \n \nBelabbas, M.-A., Wolfe, P.J. (2009b).On landmark selection and sampling in high-\ndimensional data analysis, in Philosophical Transactions, Series A, of the Royal \nSociety 367 (2009), 4295–4312. \n \nBen-Hur, A., Elisieeff, A., Guyon, I. (2002). A stability based method for discovering \nstructure in clustered data. Pac Symp Biocomput. 2002, 6–17. \n \nBen-Hur, A., Horn, D., Siegelmann, H., Vapnik, V. (2001). Support vector clustering. \nJ. Mach. Learn. Res., 2 ,125–137. \n \nBezdek, J.C., Hathaway, R.J. (2002). VAT: A Tool for Visual Assessment of \n(Cluster) Tendency, Proc. IJCNN 2002, IEEE Press, Piscataway, N.J., 2225–2230. \n \nBezdek, J.C., Hathaway, R.J., Huband, J.M. (2007). Visual Assessment of Clustering \nTendency for Rectangular Dissimilarity Matrices. IEEE Trans. On Fuzzy Systems, \nVol. 15 (5), 890–903.  \n \nBezdek, J.C., Pal, N.R. (1998). Some new indexes of cluster validity. IEEE Trans Syst \nMan Cybern B Cybern 1998, 28 (3), 301–315. \n \nBies, B., Dabbs, K., Zou, H. (2009). On Determining The Number Of Clusters – \nA Comparative Study. Paper during 2009 IMA Interdisciplinary Research Experience \nfor Undergraduates, June 28 to July 31. \nhttp://www.ima.umn.edu/~iwen/REU/paper4.pdf. \n \nBlum, A.L., Langley, P. (1997). Selection of relevant features and examples in \nmachine learning. Artificial Intelligence, Vol. 97, 245–271. \n \nBolshakova, N., Azuaje, F. (2003). Cluster validation techniques for genome \nexpression data. Signal Processing 2003, 83, 825–833. \n \nBolshakova, N., Azuaje, F., Cunningham, P. (2005a). A knowledge-driven approach \nto cluster validity assessment. Bioinformatics 2005, 21, 2546–2547. \n \nBolshakova1, N., Azuaje, F., Cunningham, P. (2005b). An integrated tool for \nmicroarray data clustering and cluster validity assessment. Bioinformatics.  \nVol. 21 (4), 451–455. \n \nBreiman, L. (2001). Random forests. Machine Learning, 45 (1), 5–32. \n \nBrock, G., Pihur, V., Datta, S., Datta, S. (2008). clValid, an R package for cluster \nvalidation. http://louisville.edu/~g0broc01/. \n \nBrys, G. (2006). Finding groups in a diagnostic plot. In: COMPSTAT 2006, \nProceedings in Computational Statistics.\n\n57 \n \nCai, W., Chen, S., Zhang, D. (2009). A simultaneous learning framework for \nclustering and classification. Pattern Recognition, Vol. 42 (7), 1248–1259.  \n \nCamastra, F. (2003). Data Dimensionality Estimation Methods: A Survey. Pattern \nRecognition, Vol. 36 (12), 2945–2954, Elsevier Science, Amsterdam, (2003). \n \nCamastra, F., Verri, A. (2005). A novel kernel method for clustering. IEEE \nTransaction on PAMI, Vol. 27, 801–805. \n \nCampbell, V., Legendre, P., Lapointe, F.-J. (2009). Assessing Congruence Among \nUltrametric Distance Matrices. Journal of Classification, Vol. 26, 103–117.  \n \nCeleux, G., Govaert, G. (1995). Gaussian parsimonious clustering models, Pattern \nRecognition, 28, 781–793. \n \nChang, W.-C. (1983). On Using Principal Components Before Separating a Mixture \nof two Multivariate Normal Distributions. Applied Statistics, 32, 267–275. \n \nCheng, R., Milligan, G.W. (1995). Mapping Influence Regions in Hierarchical \nClustering Multivariate Behavioral Research, Vol. 30. \n \nCheng, R., Milligan, G.W. (1996a). Measuring the influence of individual data points \nin a cluster analysis. Journal of Classification. Vol. 13 (2), 1432–1343.  \n \nCheng, R., Milligan, G.W. (1996b). K-means clustering with influence detection. \nEducational and Psychological Measurement, Vol. 56, 833–838. \n \nCheng, Y., Church, G.M. (2000). Biclustering of expression data. Proceedings of the \n8th International Conference on Intelligent Systems for Molecular Biology, 93–103. \n \nCristianini, N., Shawe-Taylor, J., Kandola, J. (2002). Spectral kernel methods for \nclustering. In NIPS 14, 2002. \n \nDamian, D., Orešič, M., Verheij, E., Meulman, J., Friedman, J., Adourian, A., \nMorel, N., Smilde, A., van der Greef, J. (2007). Applications of a new subspace \nclustering algorithm (COSA) in medical systems biology, Metabolomics 3.  \n \nDe Soete, G. (1986). Optimal variable weighting for ultrametric and additive tree \nclustering. Quality&Quantity, 20, 169–180.  \n \nDe Soete, G. (1988). OVWTRE: A program for optimal variable weighting for \nultrametric and additive tree fitting. Journal of Classification, 5, 101–104.  \n \nDixon, J.K. (1979). Pattern recognition with partly missing data. IEEE Transactions \non Systems, Man and Cybernetics SMC 9, 617–621. \n \nDonoho, D., Jin, J. (2008). Higher criticism thresholding: Optimal feature selection \nwhen useful features are rare and weak. Proceedings of the National Academy of \nSciences, Vol. 105, 14790–14795.\n\n58 \n \nDonoho, D., Jin, J. (2009). Feature selection by higher criticism thresholding achieves \nthe optimal phase diagram. Phil Trans R Soc A, 367, 4449–4470. \n \nDrineas, P., Mahoney, M.W. (2005). On the Nyström Method for Approximating a \nGram Matrix for Improved Kernel-Based Learning. Journal of Machine Learning \nResearch, Vol. 6, 2153–2175.  \n \nDudoit, S, Fridlyand, J. (2002a). A prediction-based resampling method for estimating \nthe number of clusters in a dataset. Genome Biology, Vol. 3 (7). \n \nDudoit, S., Fridlyand, J., Speed, T.P. (2002b). Comparison of discriminant methods \nfor the classification of tumors using gene expression data. J. Am. Stat. Assoc., 97, \n77–87.  \n \nEveritt, B.S., Dunn, G. (2001). Applied Multivariate Data Analysis. (Second Edition). \nHodder Education.  \n \nEveritt, B.S., Landau, S., Leese, M. (2001). Cluster Analysis. Fourth edition. Arnold \nPublishers. \n \nEwing, R.M., Sherry, J.M. (2001). Visualization of expression clusters using \nSammonb’s non-linear mapping. Bioinformatics, Vol. 17, 658–659. \n \nFilippone, M., Camastra, F. Masulli, F., Rovetta, S. (2007). A survey of kernel and \nspectral methods for clustering. Pattern Recognition. Vol. 41 (1), 176–190.  \n \nFlexer, A. (2001). On the use of self-organizing maps for clustering and visualization. \nIntelligent Data Analysis, 5, 373–384. \n \nFodor, I.K. (2002). A survey of dimension reduction techniques. (pdf file)  \nUS-department of Energy. https://e-reports-ext.llnl.gov/pdf/240921.pdf. \n \nFowlkes, E.B., Gnanadesikan, R., Kettenring, J.R. (1988). Variable selection in \nclustering. Journal of Classification, 5, 205–228. \n \nFraiman, R., Justerl, A., Svarc, M. (2008). Selection of Variables for Cluster Analysis \nand Classification Rules. Journal of the American Statistical Association September \n2008, Vol. 103 (483), 1294–1303. \n  \nFraley, C., Raftery, A.E. (1998). How many clusters? Which clustering method? - \nAnswers via Model-Based Cluster Analysis. Computer Journal, 41, 578–588. \n \nFraley, C., Raftery, A.E. (1999). MCLUST: Software for model-based clustering. \nJournal of Classification, 16, 297–306. \n \nFraley, C., Raftery, A.E. (2002). Model-Based Clustering, Discriminant Analysis, and \nDensity Estimation. Journal of the American Statistical Association, 97, 611–612. \n \nFraley, C., Raftery, A.E. (2003). Enhanced model-based clustering, density estimation \nand discriminant analysis software: MCLUST. Journal of Classification, 20, 263–296.\n\n59 \n \n \nFriedman, J.H., Meulman, J.J. (2004). Clustering objects on subsets of attributes (with \ndiscussion). Journal of the Royal Statistical Society Series B (Statistical \nMethodology), 66 (4), 815–849. \n \nGat-Viks, I., Sharan, R., Shamir, R. (2003). Scoring clustering solutions by their \nbiological relevance. Bioinformatics, Vol. 19 (18), 2381–2389. \n \nGeary, R. (1954). The contiguity ratio and statistical mapping. Incorporated \nStatistician, Vol. 5, 115–145.  \n \nGentleman, R.C., et al. (2004). Bioconductor: open software development for \ncomputational biology and bioinformatics, Genome Biology, 2004, 5:R80. \n \nGetis, A., Ord, J.K. (1992). The analysis of spatial association by use of distance \nstatistics. Geographical Analysis, Vol. 24 (3), 189–206. \n \nGetis, A., Ord, J.K. (1996). Local spatial statistics: an overview. In: P. Longley and \nM. Batty (eds.) “Spatial analysis: modeling in a GIS environment” (Cambridge: \nGeoinformation International), 261–277. \n \nGionis, A., Mannila, H., Tsaparas, P. (2005). Clustering Aggregation. 21st \nInternational Conference on Data Engineering (ICDE 2005). \n \nGirolami, M. (2002). Mercer Kernel-Based Clustering in Feature Space. IEEE \nTransactions on Neural Networks, 13 (3), 780–784. \n \nGirvan, M., Newman, M.E.J. (2002). Community structure in social and biological \nnetworks. PNAS June 11, 2002, Vol. 99 (12), 7821–7826. \n \nGnanadesikan, R., Kettenring, J.R., Tsao, S.L. (1995). Weighting and Selection of \nVariables for Cluster Analysis. Journal of Classification, 12, 113–136. \n \nGnanadesikan, R., Kettenring, J.R., Maloor, S. (2007). Better alternatives to current \nmethods of scaling and weighting data for cluster analysis. Journal of Statistical \nplanning and Inference, Vol. 137, 3483–3496. \n \nGoder, A., Filkov, V. (2008). Consensus Clustering Algorithms: Comparison and \nRefinement. Proceedings of the Ninth Workshop on Algorithm Engineering and \nExperiments (ALENEX) — San Francisco, January 19, 2008. Society for Industrial \nand Applied Mathematics. \n \nGordon, A.D. (1999). Classification. (2nd edition). Chapman & Hall/CRC, Boca \nRaton. Fl. \n \nGuyon, I., Elisseeff, A. (2003). An introduction to variable and feature selection. \nJournal of Machine Learning Research, Vol. 3, 1157–1182. \n \nGünter, S, Bunke, H. (2003). Validation indices for graph clustering. Pattern \nRecognition Letters, Vol. 24, 1107–1113.\n\n60 \n \n \nGuyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (eds.) (2006). Feature Extraction, \nFoundations and Applications. Series Studies in Fuzziness and Soft Computing, \nPhysica-Verlag, Springer, 2006. \n \nHalkidi, M., Batistakis, Y., Vazirgiannis, M. (2001). On clustering validation \ntechniques. Journal of Intelligent Information Systems, 17, 107–145. \n \nHan, J., Kamber, M. (2006). Data mining: concepts and techniques. Second Edition. \nMorgan Kaufmann, 2006. \n \nHandcock, M.S., Raftery, A.E., Tantrum, J. (2007). Model-based clustering for social \nnetworks (with Discussion). Journal of the Royal Statistical Society, Series A, 170, \n301–354. \n \nHandcock, M.S., Hunter, D.R., Butts, C.T., Goodreau, S.M., Morris, M. (2009). \nstatnet: Software Tools for the Representation, Visualization, Analysis and Simulation \nof Network Data. J Stat Softw. 2008, 24 (1), 1548–7660. \n \nHandl, J., Knowles, J. (2004). Multiobjective clustering with automatic determination \nof the number of clusters. Technical Report TR-COMPSYSBIO-2004-02. UMIST, \nManchester, UK.  \n \nHandl, J., Knowles, J. (2005a). Multiobjective clustering around medoids. \nProceedings of the Congress on Evolutionary Computation (CEC 2005). Vol. 1,  \n632–639. Copyright IEEE Press.  \n \nHandl, J., Knowles, J. (2005b). Improving the scalability of multiobjective clustering. \nProceedings of the Congress on Evolutionary Computation (CEC 2005). Vol. 3, \n2372–2379. Copyright IEEE Press.  \n \nHandl, J., Knowles, J. (2005c). Exploiting the trade-off – the benefits of multiple \nobjectives in data clustering. Proceedings of the Third International Conference on \nEvolutionary Multi-Criterion Optimization (EMO 2005), 547–560, LNCS 3410.  \n \nHandl, J., Knowles, J. (2006a). Multiobjective clustering and cluster validation. In \nMultiobjective machine learning edited by Yaochu Jin. Springer Series on \nComputational Intelligence 16, 21–47. \n \nHandl, J., Knowles, J. (2006b). Feature subset selection in unsupervised learning via \nmultiobjective optimization. International Journal of Computational Intelligence \nResearch, 2 (3), 217–238. \n \nHandl, J., Knowles, J. (2007). An evolutionary approach to multiobjective clustering. \nIEEE Transactions on Evolutionary Computation, 11 (1), 56–76. \n \nHandl, J., Knowles, J., Kell, D.B. (2005). Computational cluster validation in post-\ngenomic data analysis. Bioinformatics, Vol. 21 (15), 3201–3212.\n\n61 \n \nHandl, J., Knowles, J., Kell, D.B. (2007). Multiobjective optimization in \nbioinformatics and computational biology. IEEE/ACM Transactions on \nComputational Biology, Vol. 4, 279–292. \n \nHastie, T., Tibshirani, R., Friedman, J. (2001). Elements of Statistical Learning: Data \nMining, Inference and Prediction. Springer-Verlag, New York. \n \nHinton, G.E., Salakhutdinov, R.R. (2006). Reducing the dimensionality of data with \nneural networks. Science, 313, 504–507.  \n \nHothorn, T., Hornik, K., Zeileis, A. (1996). party: A Laboratory for Recursive \nPart(y)itioning. [http://CRAN.R-project.org/package=party]. R package version 0.9-\n96.  \n \nHothorn, T., Hornik, K., Zeileis, A. (2006). Unbiased Recursive Partitioning: A \nConditional Inference Framework. Journal of Computational and Graphical Statistics \n2006, 15 (3), 651–674. \n \nHubert, L., Arabie, P. (1985). Comparing Partitions. Journal of Classification, Vol. 2, \n193–218. \n \nHu, Y., Hathaway, R.J. (2008). An Algorithm for Clustering Tendency Assessment. \nWSEAS TRANSACTIONS on MATHEMATICS, Vol. 7 (7), 441–450, 2008. \n \nHuang, J.Z., Ng, M.K., Rong, H., Li, Z. (2005). Automated Variable weighting in k-\nMeans type clustering. IEEE T-on Pattern Analysis and Machine Intelligence,  \nVol. 27 (5), may 2005, 657–668. \n \nHubert, L., Schultz, J. (1976). Quadratic assignment as a general data-analysis \nstrategy. British Journal of Mathematical and Statistical Psychologie, 29, 190–241. \nhttp://machaon.karanagai.com/validation_algorithms.html. \n \nHubert, M., Vandervieren, E. (2008). An adjusted boxplot for skewed distributions, \nComputational Statistics and Data Analysis, 52, 5186–5201. \n \nHubert, M., Van der Veeken, S. (2008). Outlier detection for skewed data, Journal of \nChemometrics, 22, 235–246. \n \nHubert, L., Arabie, P. (1985). Comparing Partitions. Journal of Classification, Vol. 2, \n193–218. \n \nHurley, C.B. (2004). Clustering Visualizations of Multidimensional Data, Journal of \nComputational & Graphical Statistics, Vol. 13 (4), 788–806. \n \nHuson, D.H., Richter, D.C., Rausch, C., Dezulian, T., Franz, M., Rupp, R. (2007). \nDendroscope: An interactive viewer for large phylogenetic trees. BMC \nBioinformatics, 8, 460. \n \nIrigoien, I., Arenas, C. (2008). INCA: New statistic for estimating the number. Statist. \nMed., 27, 2948–2973.\n\n62 \n \n \nJain, A.K., Dubes, R.C. (1988). Algorithms for Clustering Data. Prentice Hall, \nEnglewood Cliffs, 1988. \n \nJacquez, G.M., Jacquez, J.A. (1999). Disease clustering for uncertain locations. In: \nDisease mapping and risk assessment for public health. A.B. Lawson, A. Biggeri, D. \nBohning, E. Lesaffre, J.-F. Viel, and R. Bertollini, eds. New York: John Wiley & \nSons. \n \nJacquez, G.M. (2008). Spatial Cluster Analysis. Chapter 22 In “The Handbook of \nGeographic Information Science”, S. Fotheringham and J. Wilson (Eds.). Blackwell \nPublishing, 395–416. \n \nJohn, G.H., Kohavi, R., Pfleger, K. (1994). Irrelevant features and the subset selection \nproblem. Volume 129. New Brunswick, NJ, USA, Morgan Kaufmann; 1994. \n \nJolliffe, I.T. (2002). Principal Component Analysis. 2nd-edition. New York: Springer-\nVerlag. \n \nJonnalagadda, S., Srinivasan, R. (2009). NIFTI: An Evolutionary Approach for \nFinding Number of Clusters in Microarray Data. BMC Bioinformatics, Vol. 10, p 40. \n \nKannan, R., Vempala, S., Vetta, A. (2004). On clusterings: Good, bad and spectral. \nJournal of the ACM, 51 (3), 497–515. \n \nKaufman, L. Rousseeuw, P.J. (1990). Finding Groups in Data. John Wiley and Sons. \nNew York. \n \nKemp, C., Tenenbaum, J.B. (2008). The discovery of structural form. Proc. Natl. \nAcad. Sci. USA 2008, 105, 10687–10692. \n \nKerr, M.K., Churchill, G.A. (2001). Bootstrapping cluster analysis: Assessing the \nreliability of conclusions from microarray experiments. PNAS, Vol. 98 (16),  \n8961–8965. \n \nKerr, G., Ruskin, H.J., Crane, M. (2007). Pattern Discovery in Gene Expression Data. \nIntelligent Data Analysis: Developing New Methodologies Through Pattern \nDiscovery and Recovery. \n \nKerr, G., Ruskin, H.J., Crane, M., Doolan, P. (2008). Techniques for Clustering Gene \nExpression Data. Computers In Biology And Medicine, 38 (3), 283–293. \n \nKettenring, J.R. (2006). The practice of cluster analysis, J. Classif., 23, 3–30. \n \nKleinberg, J. (2002). An Impossibility Theorem for Clustering. Advances in Neural \nInformation Processing Systems (NIPS) 15, 2002. \n \nKohavi, R, John, G.H. (1998). Wrappers for Feature Subset Selection. Artificial \nIntelligence, Vol. 97 (1–2), 273–324.\n\n63 \n \nKohavi, R, John, G.H. (1998). The Wrapper Approach. In “Feature Extraction, \nConstruction and Selection: a data mining perspective”, eds. Liu, H., Motoda, H. \n \nKohonen, T. (1982). Self-organized formation of topologically correct feature maps. \nBiological Cybernetics, 43, 59–69. \n \nKok, M.T.J., Lüdeke, M.K.B., Sterzel, T., Lucas, P.L., Walter, C., Janssen, P., de \nSoysa, I. (2010). Quantitative analysis of patterns of vulnerability to global \nenvironmental change. Den Haag: Netherlands Environmental Assessment Agency \n(PBL) 90 p. \n \nKrzanowski, W.J., Hand, D.J. (2009). A simple method for screening variables before \nclustering microarray data. Computational Statistics and Data Analysis, Vol. 43, \n2747–2753.  \n \nKulis, B., Basu, S., Dhillon, I.S., Mooney, R.J. (2009a). Semi-Supervised Graph \nClustering: A Kernel Approach. Machine Learning, Vol. 74 (1), 1–22, January 2009. \n \nKulis, B., Sustik, M.A., Dhillon, I.S. (2009b). Low-Rank Kernel Learning with \nBregman Matrix Divergences. Journal of Machine Learning Research, Vol. 10,  \n341–376. \n \nLaw, M.H.C., Jain, A.K. (2006). Incremental Nonlinear Dimensionality Reduction By \nManifold Learning. IEEE Transactions of Pattern Analysis and Intelligence. Vol. 28, \n377–391. \n \nLeisch, F. (2006). A toolbox for k-centroids cluster analysis. Comput. Stat. Data \nAnal., 51 (2), 526–544. \n \nLeisch, F. (2008). Visualizing cluster analysis and finite mixture models. In: Chen, C., \nHärdle, W., Unwin, A. (eds.) Handbook of Data Visualization. Springer Handbooks \nof Computational Statistics. Springer, Berlin (2008). ISBN 978-3-540-33036-3. \n \nLeisch, F. (2009). Neighborhood graphs, stripes and shadow plots for cluster \nvisualization. Statistics and Computing, 2009. to appear. \n \nLerner, B., Guterman, H., Aladjem, M., Dinstein, I. (2000). On the Initialisation of \nSammon’s Nonlinear Mapping. Pattern Analysis & Applications, Vol. 3, 61–68. \n \nLi, G., Ma, Q., Tang, H., Paterson, A. H., Xu, Y. (2009). QUBIC: a qualitative \nbiclustering algorithm for analyses of gene expression data. Nucleic Acids Res., \nAugust 1, 2009; 37(15): e101 - e101. \n \nLiaw, A., Wiener, M. (2002). Classification and Regression by randomForest.  \nR News, 2(3), 18–22. URL http://CRAN.R-project.org/doc/Rnews/. \n \nLittle, R.J.A., Rubin, D.A. (1987). Statistical analysis with missing data. John \nWiley and Sons.\n\n64 \n \nLiu, H., Yu, L. (2005). Towards integrating feature selection algorithms for \nclassification and clustering. IEEE Transactions on Knowledge and Data Engineering, \n17 (3), 1–12. \n \nLuxburg, U. von. (2007). A tutorial on spectral clustering. Statistics and Computing, \n17 (4), 395–416. \n \nLuxburg, U. von. (2010). Clustering stability: an overview. Foundations and Trends in \nMachine Learning, Vol. 2 (3), 235–274,  \nURL (30-08-2010): http://arxiv.org/abs/1007.1075. \n \nLuxburg, U. von., Belkin, M., Bousquet, O. (2008). Consistency of Spectral \nClustering. Annals of Statistics 36 (2), 555–586. \n \nMacCuish, J., Nicolaou, C., MacCuish, N.E. (2001). Ties in proximity and clustering \ncompounds. J. Chem. Inf. Comput. Sci., 41, 134–146. \n \nMadeira, S.C., Oliveira, A.L. (2004). Biclustering Algorithms for Biological Data \nAnalysis: A Survey. IEEE Transactions on Computational Biology and \nBioinformatics, 1 (1), 24–45. \n \nMahoney, M.W., Drineas, P. (2009). CUR matrix decompositions for improved data \nanalysis. PNAS January 20, 2009, Vol. 106 (3), 697–702. \n \nMakarenkov, V., Legendre, P. (2001). Optimal Variable Weighting for Ultrametric \nand Additive Trees and K-means Partitioning: Methods and Software. Journal of \nClassification, 18, 245–271.  \n \nMaruca, S.L., Jacquez, G.M. (2002). Area-based tests for association \nbetween spatial patterns. Journal of Geographic Systems, 4 (1), 69–83. \n \nMcCullagh, M. J. (2006). Detecting Hotspots in Time and Space. ISG06. \n  \nMcLachlan, G., Peel, D., Basford, K.E., Adams, P. (2000). The EMMIX software for \nthe fitting of mixtures of normal and t-components. Journal of Statistical Software, 4, \n1–14. \n \nMcQuitty, L.L. (1966). Similarity Analysis by Reciprocal Pairs for Discrete and \nContinuous Data. Educational and Psychological Measurement, 26, 825–831. \n \nMeila, M. (2007). Comparing clusterings – an information based distance. Journal of \nMultivariate Analysis, 98, 873–895. \n \nMelnykov, V. Maitra, R. (2010). Finite mixture models and model-based clustering. \nStatistics Surveys, 2010, Vol. 4, 80–116. \n \nMilligan, G.W. (1980). An examination of the effect of six types of error perturbation \non fifteen clustering algorithms. Psychometrika, 45, 325–342.\n\n65 \n \nMilligan, G.W., Cooper, M.C. (1985). An examination of procedures for determining \nthe number of clusters in a data set. Psychometrika, 50, 159–179. \n \nMilligan, G.W., Cooper, M.C. (1987). Methodology Review: Clustering Methods. \nApplied Psychological Measurement, Vol. 11 (4), 329–354.  \n \nMilligan, G.W., Mahajan, V. (1980). A note on procedures for testing the quality of a \nclustering of a set of objects. Decision Sciences, 11, 669–677.  \n \nMilligan, G.W. (1989). A validation study of a variable weighting algorithm for \ncluster analysis. Journal of Classification, 6 (1), 53–71. \n \nMilligan, G.W. (1996). Clustering validation: results and implications for applied \nanalyses. In P. Arabie, L. J. Hubert, and G. D. Soete, editors, In Clustering and \nClassication., pages 341–375. World Scientic Publishing, River Edge, NJ, 1996. \n \nMingoti, S.A., Lima, J.O. (2006). Comparing SOM neural network with Fuzzy  \nc-means, K-means and traditional hierarchical clustering algorithms. European \nJournal of Operational Research, 174, 1742–1759. \n \nMirkin, B. (2005). Cluster Analysis for Data Mining: A Data Recovery Approach. \nCRC Press. \n \nMishra, N., Schreiber, R., Stanton, I., Tarjan, R.E. (2007). Clustering Social Networks \nA. Bonato and F.R.K. Chung (Eds.): WAW 2007, LNCS 4863, pp. 56–67, 2007. \n \nMoguerza, J.M., Muñoz, A., Martin-Merino, M. (2002). Detecting the number of \nclusters using a support vector machine approach. Proc. International Conference on \nArtficial Neural Networks. Lecture Notes in Comput. Sci. 2415. 63.768. Springer, \nBerlin. \n \nMonti, S., Tamayo, P., Mesirov, J., Golub, T. (2003). Consensus clustering: A \nresampling-based method for class discovery and visualization of gene expression \nmicroarray data. Mach. Learn., 52, 91–118. \n \nMoran, P.A.P. (1948). The interpretation of statistical maps. Journal of the Royal \nStatistical Society, Series B., Vol. 10, 243–251. \n \nMorgan, B.J.T., Ray, A.P.G. (1995). Non-uniqueness and inversions in cluster \nanalysis. Applied Statistics, 44, 117–34. \n \nMoya-Anegón, F., Herrero-Solana, V., Jiménez-Contreras, E. (2006). A connectionist \nand multivariate approach to science maps: the SOM, clustering and MDS applied to \nlibrary and information science research Journal of Information Science, 32 (1) 2006, \n63–77. \n \nMurtagh, F., Hernández-Pajares, M. (1995). The Kohonen self-organizing map \nmethod: An assessment. Journal of Classification. Vol. 12 (2), 165–190.\n\n66 \n \nNardo, M., Saisana, M., Saltelli, A., Tarantola, S., Hoffman A., Giovannini, E., \n(2005). Handbook on Constructing Composite Indicators: Methodology and User \nGuide. OECD Statistics Working Papers.  \n \nNelson, T.A., Boots, B. (2008). Detecting spatial hot spots in landscape ecology. \nEcography, Vol. 31 (5), 556–566. \n \nNewman, M.E.J. (2003). The structure and function of complex networks. SIAM \nreview, 2003 – JSTOR. \n \nNewman, M.E.J. (2004). Fast algorithm for detecting community structure in \nnetworks. Phys. Rev. E 69, 066133. \n \nNewman, M.E.J., Leicht, E.A. (2007). Mixture models and exploratory analysis in \nnetworks. PNAS, Vol. 104 (23), 9564–9569. \n \nNg, A.Y., Jordan, M., Weiss, Y. (2002). On spectral clustering: Analysis and an \nalgorithm. In Advances in Neural Information Processing Systems (NIPS), 2002. \n \nNorg, I., Groenen, P. (1997). Modern multidimensional scaling theory and \napplications. New York: Springer Verlag. \n \nOlex, A.L., John, D.J., et al. (2007). Additional limitations of the clustering validation \nmethod figure of merit. 45th ACM Southeast Annual Conference, Winston-Salem, \nNC. \n \nOrd, J.K., Getis, A. (2001). Testing for local spatial autocorrelation in the presence of \nglobal autocorrelation. Journal of Regional Science, Vol. 41 (3), 411–432. \n \nPalla, G., Derényi, I., Farkas, I., Vicsek, T. (2005). Uncovering the overlapping \ncommunity structure of complex networks in nature and society. Nature, 435,  \n814–818.  \n \nPipino, L.L., Funk, J.D., Wang, R.Y. (2006). Journey to Data Quality. MIT Press Ltd, \n2006. \n \nPison, G., Struyf, A., Rousseeuw, P.J. (1999). Displaying a clustering with \nCLUSPLOT. Comput. Stat. Data Anal., 30, 381–392 \nftp://ftp.win.ua.ac.be/pub/preprints/99/Disclu99.pdf. \n \nPremo, L.S. (2004). Local spatial autocorrelation statistics quantify multi=scale \npatterns in distributional data: an example from the Maya Lowlands. Journal of \nArchaeological Science, Vol. 31, 855–866. \n \nRaftery A.E., Dean, N. (2006). Variable Selection for Model-Based Clustering. \nJournal of the American Statistical Association, Vol. 101 (473), 168–178. \n \nRahm, E., Do, H.H. (2000). Data cleaning: Problems and current approaches. IEEE \nData Engineering Bulletin, 23 (4), 3–13.\n\n67 \n \nRoth, V., Lange, T., Braun, M., Buhmann, J. (2002). A Resampling Approach to \nCluster Validation. \nhttp://informatik.unibas.ch/personen/roth_volker/PUB/compstat02.pdf \n \nRousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and \nvalidation of cluster analysis. J. Comput. Appl. Math, 20, 53–65. \n \nRousseeuw, P.J., Ruts, I., Tukey, J.W. (1999). The bagplot: A bivariate boxplot. Am. \nStat., 53 (4), 382–387. \n \nRousseeuw, P.J., Debruyne, M., Engelen, S., Hubert, M. (2006). Robustness and \noutlier detection in chemometrics, Critical Reviews in Analytical Chemistry, 36,  \n221–242. \n \nRunkler, T.A. (2000). Information Mining - Methoden, Algorithmen und \nAnwendungen intelligenter Datenanalyse. Vieweg, Wiesbaden, 2000.  \n \nSaeys, Y., Inza, I., Larrañaga, P. (2007). A review of feature selection techniques in \nbioinformatics. Bioinformatics. Vol. 23 (19), 2507–17. \n \nSaitta, S., Raphael, B., Smith, I.F.C. (2007). A Bounded Index for Cluster Validity. \nIn: P. Perner (Ed.), Machine Learning and Data Mining in Pattern Recognition, LNAI \n4571, Springer Verlag, Heidelberg, pp. 174–187, 2007. \n \nSaitta, S., Raphael, B., Smith, I.F.C. (2008). A Comprehensive Validity Index for \nClustering. accepted for publication in the Journal of Intelligent Data Analysis, 2008. \n \nSalvador, S., Chan, P. (2004). Determining the Number of Clusters/Segments in \nHierarchical Clustering/Segmentation Algorithms. Proc. 16th IEEE Intl. Conf. on \nTools with AI, 576–584, 2004. \n \nSaraiya, P., North, C., Duca, K. (2005). An insight-based methodology for evaluating \nbioinformatics vizualizations. IEEE Trans, on Visualization and Computer Graphics, \nVol. 11, 443–456. \n \nScharl, T., Leisch, F. (2009). gcExplorer: Interactive Exploration of Gene Clusters. \nBioinformatics, Vol. 25 (8), 1089–1090. \n \nSchölkopf, B., Smola, A., Müller, K.-R. (1999). Kernel Principal Component \nAnalysis, In: Bernhard Schölkopf, Christopher J. C. Burges, Alexander J. Smola \n(Eds.), Advances in Kernel Methods-Support Vector Learning, 1999, MIT Press \nCambridge, MA, USA, 327-352. ISBN 0-262-19416-3.  \n \nScholz, M., Kaplan, F., Guy, C.L., Kopka, J., Selbig, J. (2005). Non-linear PCA:  \na missing data approach. Bioinformatics, 21, 3887–3895.  \n \nSchonlau, M. (2002). The clustergram: a graph for visualizing hierarchical and non-\nhierarchical cluster analyses. The Stata Journal, 2002, 2 (4), 391–402.\n\n68 \n \nSchonlau, M. (2004). Visualizing Hierarchical and Non-Hierarchical Cluster Analyses \nwith Clustergrams. Computational Statistics: 2004, 19 (1), 95–111. \n \nShi, J., Malik, J. (2000). Normalized cuts and image segmentation. IEEE Trans. \nPAMI, 22 (8), 888–905. \n \nShi,T., et al. (2005). Tumor classification by tissue microarray profiling: random \nforest clustering applied to renal cell carcinoma. Modern Pathology., 18, 547–557. \n \nShi, T., Horvath, S. (2006). Unsupervised Learning with Random Forest Predictors. \nJournal of Computational and Graphical Statistics. Vol. 15 (1), 118–138(21). \n \nSietz, D., Lüdeke, M.K.B., Walther, C. (2011). Categorisation of typical vulnerability \npatterns in global drylands. Global Environmental Change, 21, 431–440. \n \nSilver, M. (1995). Scales of Measurement and Cluster Analysis: An Application \nConcerning Market Segments in the Babyfood market. The Statistician, Vol. 44 (1), \n101–112.  \nSmyth, C.W., Coomans, D.H. (2006). Parsimonious Ensembles for Regression. The \n38th Symposium on the Interface of Statistics, Computing Science and Applications: \nMassive Data Sets and Streams Interface Foundation of North America, Pasadena, \nCalifornia 54 – 54.  \nSmyth, C.W., Coomans, D.H., Everingham, Y.L. (2006a). Clustering noisy data in a \nreduced dimension space via multivariate regression trees. Pattern Recognition,  \nVol. 39, 424–431.  \nSmyth, C.W., Coomans, D.H., Everingham, Y.L., Hancock, T.P. (2006b). Auto-\nassociative Multivariate Regression Trees for Cluster Analysis. Chemometrics and \nIntelligent Laboratory Systems, Vol. 80, 120–129.  \nSmyth, C.W., Coomans, D.H. (2007). Predictice weighting for cluster ensembles. \nJournal of Chemometrics, Vol. 21, 364–375. \nSpaans, M., Heiser, W.J. (2005). Instability of hierarchical cluster analysis due to \ninput order of the data: The PermuCLUSTER solution. Psychological Methods,  \n10 (4), 468–476. \n \nSteinley, D. (2004). Properties of the Hubert-Arabie adjusted Rand index. \nPsychological Methods, 9, 386–396. \n \nSteinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of \nMathematical and Statistical Psychology, 59, 1–34. \n \nSteinley, D. (2008). Stability analysis in K-means clustering. British Journal of \nMathematical and Statistical Psychology, 61, 255–273. \n \nSteinley, D., Brusco, M.J. (2007). Initializing K-means batch clustering: A critical \nevaluation of several techniques. Journal of Classification, 24, 99–121.\n\n69 \n \n \nSteinley, D., Brusco, M.J. (2008a). A new variable weighting and selection procedure \nfor K-means cluster analysis. Multivariate Behavioral Research, Vol. 43, 77–108. \n \nSteinley, D., Brusco, M.J. (2008b). Selection of Variables in Cluster Analysis: An \nEmpirical Comparison of Eight Procedures. Psychometrika, Vol. 73 (1), 125–144. \n \nStrehl, A., Ghosh, J. (2002). Cluster ensembles – a knowledge reuse framework for \ncombining multiple partitions. Journal of Machine Learning Research, 3, 583–617. \n \nStrobl, C., Boulesteix, A.L., Kneib, T., Augustin, T, Zeileis, A. (2008). Conditional \nvariable importance for random forests. BMC Bioinformatics 2008, 9, 307. \n \nStrobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T. (2007). Bias in Random Forest \nVariable Importance Measures: Illustrations, Sources and a Solution. BMC \nBioinformatics 2007, 8, 25. \n \nSvetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P. (2003). \nRandom Forest: A Classification and Regression Tool for Compound Classification \nand QSAR Modeling. Journal of Chemical Information and Computer Sciences, 43, \n1947–1958. \n \nTan, J., Zhang, J., Li, W. (2010). An Improved Clustering Algorithm Based on \nDensity Distribution Function. Computer and Information Science, Vol. 3 (3), August \n2010, 23–29. URL (31-08-2010): \nhttp://ccsenet.org/journal/index.php/cis/article/viewFile/6891/5426. \n \nTanay, R., Sharan, R., Shamir, R. (2002). Discovering statistically significant \nbiclusters in gene expression data. Bioinformatics Vol. 18 (9), S136–S144. \n \nTenenbaum, J.B., de Silva, V., Langford, J.C. (2000). A global geometric framework \nfor nonlinear dimensionality reduction. Science, Vol. 290, 2319–2323. \n \nTian, T., James, G., Wilcox, R. (2009). A Multivariate Adaptive Stochastic Search \nMethod for Dimensionality Reduction in Classification. Annals of Applied Statistics, \n4, 339–364. \n \nTibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a \ndataset via gap statistic. Journal of Royal Statistical Society B 2001, 63, 411–423. \n \nTibshirani, R., Walther, G. (2005). Cluster Validation by Prediction \nStrength. Journal of Computational & Graphical Statistics, 14, 511–528. \n \nTsai, C.Y., Chiu, C.C. (2008). Developing a feature weight self-adjustment \nmechanism for a K-means clustering algorithm. Computational Statistics & Data \nAnalysis, Vol. 52 (10), 4658–4672. \n \nvan der Laan, M. (2006). Statistical Inference for Variable Importance. International \nJournal of Biostatistics, 2 (1), 1008–1008.\n\n70 \n \nVarshavsky, R., Gottlieb, A., Linial, M., Horn, D. (2006). Novel unsupervised feature \nfiltering of biological data. Bioinformatics, Vol. 22, e507–e513. \n \nVarshavsky, R., Gottlieb, A., Horn, D., Linial, M. (2007). Unsupervised feature \nselection under perturbations: meeting the challenges of biological data. \nBioinformatics, Vol. 23 (24), 3343–3349. \n \nVesanto, J. (1999). SOM-based data visualization methods, Intelligent Data Analysis, \n3 (2), 111–126. \n \nVesanto, J., Alhoniemi, E. (2000). Clustering of the Self-Organizing Map. \nIEEE Trans. On Neural Networks, Vol. 11, 586–600. \n \nVinh, N.X., Epps, J., Bailey, J. (2009). Information Theoretic Measures for \nClusterings Comparison: Is a Correction for Chance Necessary? Proceedings of the \n26th International Conference on Machine Learning, 2009. \n \nXing, E.P. (2003). Feature Selection in Microarray Analysis, in D.P. Berrar, W. \nDubitzky and M. Granzow (Eds.), A Practical Approach to Microarray Data Analysis, \nKluwer Academic Publishers, 2003. \n \nXu, R., Wunsch, D. (2008). Clustering. IEEE Press Series on Computational \nIntelligence. John Wiley and Sons. \n \nYan, D., Huang, L., Jordan, M.I. (2009). Fast approximate spectral clustering. \nInternational Conference on Knowledge Discovery and Data Mining \nProceedings of the 15th ACM SIGKDD international conference on Knowledge \ndiscovery and data Paris, France Pages 907–916.  \n \nYeung, K.Y., Haynor, D.R., Ruzzo, W.L. (2001). Validating clustering for analysis \nfor clustering gene expression data. Bioinformatics, Vol. 17 (4), 309–318. \n \nYeung, K.Y., Ruzzo, W.L. (2001). Principal component analysis for clustering gene \nexpression data. Bioinformatics, Vol. 17 (9), 763–774. \n \nYiang, M.K.A., Kumar, A. (2005). A comparative analysis of an extended SOM \nnetwork and K-means analysis. Journal International Journal of Knowledge-Based \nand Intelligent Engineering Systems, Vol. 8 (1), 9–15. \n \nYu, L. (2007). Feature Selection for Genomic Data Analysis. In H. Liu, editor, \nComputational Methods for Feature Selection, Chapman and Hall/CRC Press, 2007. \n \nWagstaff, K.L., Laidler, V. (2005). Making the Most of Missing Values: Object \nClustering with Partial Data in Astronomy. Astronomical Data Analysis Software and \nSystems XIV; ASP Conference Series 2005, P 2.1.25. \n \nWaller, N.G., Kaiser, H.A., Illian, J.B., Manry, M. (1998). Cluster analysis with \nKohonen neural networks. Psychometrika, Vol. 63, 5–22.\n\n71 \n \nWinters-Hilt, S., Yelundur, A., McChesney, C., Landry, M. (2006). Support Vector \nMachine Implementations for Classification & Clustering. BMC Bioinformatics 2006, \n7(Suppl 2):S4 doi:10.1186/1471-2105-7-S2-S4. \n \nWinters-Hilt, S., Merat, S. (2007). SVM clustering. BMC Bioinformatics 2007,  \n8 (Suppl 7):S18 doi:10.1186/1471-2105-8-S7-S18. \n \nWu, C.-J., Kasif, S. (2005). GEMS: a web server for biclustering analysis of \nexpression data. Nucleic Acids Research 2005 33(Web Server Issue):W596-W599.  \n \nWu, K.L., Yang, M.S. (2005). A cluster validity index for fuzzy clustering, Pattern \nRecognition Lett., Vol. 26, 1275–1291. \n \nWu, K.L., Yang, M.S., Hsieh, J.N. (2009). Robust cluster validity indexes. Pattern \nRecognition. Vol. 42 (11), 2541–2550.  \n \nZadeh, R.B., Ben-David, S. (2009). A Uniqueness Theorem for Clustering. \nProceedings of UAI 2009.\n\n72 \n \nAppendix A: The R software environment \n \nR is a software environment for data manipulation, calculation, and graphical display, and \nserves both as an environment and a programming language. R is available as Free Software \nin source code form under the terms of the Free Software Foundation’s GNU General Public \nLicense. R runs on a wide variety of platforms (Unix, Linux,Windows, MacOS, FreeBSD). \nSources and binaries of R can be downloaded at http://www.r-project.org. Installation of R is \nvery simple and a variety of packages can be added directly from the web site (e.g. Brock et \nal., 2008). R has a very active development community and many resources can be found \nincluding user guides, manuals, script samples, newsgroups, and mailing lists (e.g. Venables \net al., 2002). Further an extensive amount of publications like Paradis (2002) or Maindonald \n(2008) exists. R is a command line application. Its integrated object oriented language allows \nfor efficient data manipulation. Whereas use of R does require programming, scripts can be \ndeveloped and used to automate analyses and provide additional functionality. Graphical user \ninterface (GUI)s have been developed for certain applications to avoid user programming \n(see, for example, Rcommander). \nR has an amazing variety of functions for cluster analysis, which is illustrated at the web-page \nhttp://cran.r-project.org/web/views/Cluster.html. In this background document we will present \na number of examples implemented in R. See also appendix A, which illustratively highlights \nsome functionality of R for performing cluster analysis. \n \nCiting R: \nR Development Core Team (2005). R: A language and environment for statistical computing. R \nFoundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL: http://www.R-\nproject.org. \n \nBrock, Pihur, Datta Su., Datta So, clValid: An R Package for Cluster Validation, Journal of Statistical \nSoftware, Volume 25, Issue 4, 2008 \n \nMaindonald, Using R for Data Analysis and Graphics - Introduction, Code and Commentary, Centre \nfor Mathematics and Its Applications, Australian National University. 2008 \n \nParadis, E., R for Beginners, Montpellier, 2002 \n \nVenables, Smith and the R Development Core Team An Introduction to R, Network Theory Limited, \nBristol, 2002\n\n73 \n \nAppendix B: Cluster analysis in R27 \n \nR has an amazing variety of functions for performing cluster analysis. In this appendix three \nof the many approaches will be described: hierarchical agglomerative, partitioning, and model \nbased. While there are no best solutions for the problem of determining the number of clusters \nto extract, several approaches are given below.  \n \nData preparation  \nPrior to clustering data, you may want to remove or estimate missing data and rescale \nvariables for comparability. \n \n# Prepare Data \nmydata <- na.omit(mydata) # listwise deletion of missing \nmydata <- scale(mydata) # standardize variables  \n \nPartitioning \nK-means clustering is the most popular partitioning method. It requires the analyst to specify \nthe number of clusters to extract. A plot of the within groups sum of squares by number of \nclusters extracted can help determine the appropriate number of clusters. The analyst looks for \na bend in the plot similar to a scree test in factor analysis. See Everitt & Hothorn (pg. 251).  \n \nDetermine number of clusters \n \n# Determine number of clusters \nwss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) \nfor (i in 2:15) wss[i] <- sum(kmeans(mydata,  \n   centers=i)$withinss) \nplot(1:15, wss, type=\"b\", xlab=\"Number of Clusters\", \n  ylab=\"Within groups sum of squares\")  \n \n \n \n \n \n                                                     \n27 This appendix is taken from the information about QuickR (see \nhttp://www.statmethods.net/advstats/cluster.html). See also \nhttp://inference.us/SolutionPlatform/Documents/R/Cluster%20Analysis.pdf\n\n74 \n \nK-Means cluster analysis \n \n# 5 cluster solution \nfit <- kmeans(mydata, 5) # 5 cluster solution \n \n# get cluster means  \naggregate(mydata,by=list(fit$cluster),FUN=mean) \n \n# append cluster assignment \nmydata <- data.frame(mydata, fit$cluster)  \n \nA robust version of K-means based on mediods can be invoked by using pam( ) instead of \nkmeans( ). The function pamk( ) in the fpc package is a wrapper for pam that also prints the \nsuggested number of clusters based on optimum average silhouette width.  \n \nHierarchical agglomerative \n \nThere are a wide range of hierarchical clustering approaches, and Ward's method described \nbelow is a popular one.  \n \nWard hierarchical clustering \n \n# Ward Hierarchical Clustering \n \n# distance matrix \nd <- dist(mydata, method = \"euclidean\")  \n \nfit <- hclust(d, method=\"ward\")  \n \n# display dendogram  \nplot(fit)  \n \n# cut tree into 5 clusters \ngroups <- cutree(fit, k=5)  \n \n# draw dendogram with red borders around the 5 clusters  \nrect.hclust(fit, k=5, border=\"red\")  \n \n \nThe pvclust( ) function in the pvclust package provides p-values for hierarchical clustering \nbased on multiscale bootstrap resampling. Clusters that are highly supported by the data will\n\n75 \n \nhave large p values. Interpretation details are provided Suzuki28. Be aware that pvclust \nclusters columns, not rows. Transpose your data before using.  \n \n \nWard hierarchical clustering with bootstrapped p values \n \n# Ward Hierarchical Clustering with Bootstrapped p values \n \nlibrary(pvclust) \nfit <- pvclust(mydata, method.hclust=\"ward\", \n   method.dist=\"euclidean\") \n \n# dendogram with p values \n \nplot(fit)  \n \n# add rectangles around groups highly supported by the \ndata \npvrect(fit, alpha=.95)  \n \n \n \n \nModel based approaches \n \nModel based approaches assume a variety of data models and apply maximum likelihood \nestimation and Bayes criteria to identify the most likely model and number of clusters. \nSpecifically, the Mclust( ) function in the mclust package selects the optimal model \naccording to BIC for EM initialized by hierarchical clustering for parameterized Gaussian \nmixture models. (phew!). One chooses the model and number of clusters with the largest BIC. \nSee help(mclustModelNames) to details on the model chosen as best.  \n \n \n                                                     \n28 See http://www.is.titech.ac.jp/~shimo/prog/pvclust/\n\n76 \n \nModel based clustering \n \n# Model Based Clustering \nlibrary(mclust) \nfit <- Mclust(mydata) \n \n# plot results  \nplot(fit, mydata)  \n \n# display the best model  \nprint(fit)\n\n77\n\n78 \n \nPlotting cluster solutions  \n \nIt is always a good idea to look at the cluster results. \n \nK-Means clustering with 5 clusters \n \n# K-Means Clustering with 5 clusters \nfit <- kmeans(mydata, 5) \n \nCluster plot against 1st 2 principal components \n \n# Cluster Plot against 1st 2 principal components \n \n# vary parameters for most readable graph \nlibrary(cluster)  \nclusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,  \n   labels=2, lines=0)\n\n79 \n \n \n \nCentroid plot against 1st 2 discriminant functions \n \n# Centroid Plot against 1st 2 discriminant functions \nlibrary(fpc) \nplotcluster(mydata, fit$cluster)  \n \n \n \nValidating cluster solutions \n \nThe function cluster.stats() in the fpc package provides a mechanism for comparing the \nsimilarity of two cluster solutions using a variety of validation criteria (Hubert's gamma \ncoefficient, the Dunn index and the corrected rand index)  \n \ncomparing 2 cluster solutions \n \n# comparing 2 cluster solutions \nlibrary(fpc) \ncluster.stats(d, fit1$cluster, fit2$cluster)  \n \nwhere d is a distance matrix among objects, and fit1$cluster and fit$cluster are integer \nvectors containing classification results from two different clusterings of the same data.\n\n80 \n \nExample R-script for clustering \n \nThe following R-Script is divided into four functions, called “consistency”, “sPIKcentres”, \n“initial” and “clus_graphs”. In the first function calls the loop for the overall repeating of the \npair wise dissimilarity calculation and the loop for the size of the clustered partition. Further it \nperforms the two of clusterings. The second function makes the dissimilarity calculation \nitself. The “initial” function is responsible for initialization of kmeans with hclust. And the \nlast function delivers graphical representations of the cluster result.  \nIn the last part of the script the user settings have to be chosen. The script can be used for \ncalculation of the consistency measure and for the clustering of the subsequent best number of \nclusters. \nThe format of data has to be: rows represent the objects and columns represent the features of \nthe objects. \n \n## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \nconsistency <- function(Xdata,NmaxCluster,master,cm)  {  \n \n \n  Ndata <- dim(Xdata)[1]              # total number of datapoints \n  Clust_cl <- matrix(0,Ndata,2)     # storing the cluster-class results for the two trial-clusterings \n  X0 <- matrix(0,NmaxCluster,master)           # storing info on every run for consist.meas. \n  C0 <- matrix(0,NmaxCluster,1)                    # Init.matrix with average consist.meas. \n  G0 <- matrix(0,Ndata,1)  \n \n# Matrix for best cluster result \n  ResMat <- list(MeanC=C0,SpecR=X0,Gold=G0) # global list for returning after calculation \n  ifelse(cm,whl<-2,whl<-1) \n \n  ifelse(cm,NminCluster<-2,NminCluster<-NmaxCluster) \n \n  for (iOuter in 1:master) {    \n# outer-loop for comparing pairs of clusterings  \n    for (iClus in NminCluster:NmaxCluster) {   \n# Number of Clusters to be analysed \n      for (iInner in 1:whl) {  \n        N_sel <- max(NmaxCluster,round(Ndata/200,0)) \n        ss  <- sample(1:Ndata)     \n  \n# random permutation of data set            \n        sss <- ss[1:N_sel]      \n \n# First N_sel indices of random permutation \n        Xdata_sel <- Xdata[sss,]   \n        while( length(unique(rowSums(Xdata_sel)))<iClus ) { \n          ss  <- sample(1:Ndata)   ;  sss <- ss[1:N_sel]   ;   Xdata_sel <- Xdata[sss,] }   \n        centro    <- initial(Xdata_sel,iClus)      \n        indRand <- sample(1:Ndata)  \n \n# reshuffling  \n        Xdata_shuffle <- Xdata[indRand,]           \n# shuffled data \n        cl_kmeans <- kmeans(Xdata_shuffle,centro,iter.max=50)  # clustering with centro initializatin        \n        Clust_cl[indRand,iInner] <- cl_kmeans$cluster  # assign classes as indexed by non-shuffled data \n      }    \n      ifelse(cm , {  \n \n# Evaluate dissimilarities for the clustering-pairs \n        ResMat$SpecR[iClus,iOuter] <- sPIKcentres(Xdata,Clust_cl[,1],Clust_cl[,2],Iheur=1)  \n        } , { \n        for (j in 1:iClus) { \n \n# withinclustersum ~~~~~~~~~~~~~ \n          clu_diff <- 0  \n          clu_diff <- Xdata_shuffle[which(cl_kmeans$cluster==j),]-(matrix(1,cl_kmeans$size[j],1)  \n \n%*%colMeans(Xdata_shuffle[which(cl_kmeans$cluster==j),]))  \n          ResMat$SpecR[iClus,iOuter] <- ResMat$SpecR[iClus,iOuter] + sum(clu_diff*clu_diff) } \n        ifelse(ResMat$SpecR[iClus,iOuter]==min(ResMat$SpecR[iClus,1:iOuter]) , gold <- Clust_cl[,1]         \n        } \n)   }   }  \n        ifelse (cm , { for (iClus in 2:NmaxCluster) { ResMat$MeanC[iClus,] <-  \n \nwith(ResMat,mean(SpecR[iClus,1:master])) } },# average-value for consistency measure     \n \n \n{ ResMat$Gold <- gold } ) \n \nreturn(ResMat) \n}     # end function consistency \n \n## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \nsPIKcentres <-   function(dataCl,clust1,clust2,Iheur=1)  {\n\n81 \n \n             \n  Ncl1 <- max(clust1)      # maximum number of clusterclasses   \n  Ncl2 <- max(clust2)      # maximum number of clusterclasses \n  Nclmin <- min(Ncl1,Ncl2) # minimum of Ncl1 and Ncl2 \n  \n  ## Determine cluster centers       --> # matrix of cluster-centres for clustering 1 and 2 \n  cent1=rbind() ; for (i in 1:Ncl1) { ifelse(length(which(clust1==i))<2 , cent1 <-  \nrbind(cent1,dataCl[which(clust1==i),]) , cent1 <-   \nrbind(cent1,colMeans(dataCl[which(clust1==i),])) )}     \n  cent2=rbind() ; for (i in 1:Ncl2) { ifelse(length(which(clust2==i))<2 , cent2 <-  \nrbind(cent2,dataCl[which(clust2==i),]) , cent2 <-  \nrbind(cent2,colMeans(dataCl[which(clust2==i),])) )}     \n                  \n  ## Determine the distance matrix  of cluster-centers \n  Distmat <- matrix(0,Ncl1,Ncl2) \n  Distmat <- as.matrix(dist(rbind(cent1,cent2)))[1:Ncl1,(1:Ncl2)+Ncl1]  \n  ## Determination of association on basis of distances between clusters \n  match.listb <- array(0,length<-Ncl2)     # initialising list for renaming clusters \n  xft_tmp <- Distmat                       # storing Distmat in intermediate matrix \n  xft_max <- max(xft_tmp)+1        # setting an upperlimit to values of xft_tmp \n  for (d2 in 1:Nclmin) { \n    cc <- which(xft_tmp==min(xft_tmp),arr.ind=T)[1,2]    # in which column is minimum (ref to clu1) \n    rr <- which(xft_tmp==min(xft_tmp),arr.ind=T)[1,1]    # in which row is minimum (ref to clu2) \n    match.listb[cc] <- rr   ## the cc-th cluster of clus.2  corresponds the to rr-th  cluster of the clus.1  \n    xft_tmp[rr,] <- xft_max ; xft_tmp[,cc] <- xft_max }   \n  match.listb[which(match.listb==0)] <-  max.col(-t(Distmat[,which(match.listb==0)])) \n  clust2A <-  match.listb[clust2]    # second clustering in terms of its association with the first clus.          \n  res <- length(which(clust2A==clust1))/(length(clust1))  # count of fraction of replicates                  \n  return(res) } \n \n## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \ninitial <- function(Xdata_sel,k) { # function for initializing Kmeans \n \n  geo_dist <- (dist(Xdata_sel))               \n# distance matrix of part of data set \n  cl_hcl   <- hclust(geo_dist,method=\"ward\") # hclust with method: ward \n  ser      <- as.vector(cutree(cl_hcl,k))     \n# cut the tree into k clusters \n  cluster  <- list()                          \n# initializing to empty list \n  for (i in 1:k) { cluster[[i]] <- which(ser==i) } \n  centro <- matrix(ncol=ncol(Xdata_sel),nrow=k)   # storing cluster-centers \n  for (i in 1:length(cluster)){ \n    for (j in 1:ncol(Xdata_sel)){ \n      centro[i,j] <- mean(Xdata_sel[cluster[[i]],j]) \n    }  }  \n   return(centro)  } \n \n## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \nclus_graphs <- function(gold,clu,clu_dim) { \n \n \n  ## worldmap \n  world <- matrix(scan(\"~/AT_CLUSTERUNG/R-Script+Data/geo_maske.dat\"),ncol=1) ## land mask \n  for (z in 1:clu_dim[1]) {world[clus_dat[z,1]] <- gold[z]}              \n  x11(11,8) ; par(mar=c(2,2,2,1)) \n  is.na(world)<-which(world==0,arr.ind=T)       ## all zeros out \n  z.a <- matrix(world,720,360)[,360:1] \n  for (i in 0:(clu-1)) z.a[(i*20):(i*20+20),51:70]<-i+1 \n  farb<-c(rgb(0,0,0),rgb(1,0.6,0),rgb(1,1,0.3),rgb(0.5,0.5,0.5),rgb(0,1,0),rgb(0.5,0,0.5)  \n \n,rgb(1,0,0.3),rgb(0,0,1),rgb(0.2,1,1),rgb(1,0.5,0)) \n  farb <- farb[c(9,4,7,3,8,6,1,5,2,10)] \n  image(1:720,1:360,z.a,col=c(grey(0.9),farb[1:clu]),xlim=c(0,720),ylim=c(50,360), \nmain=paste(\"run.ident: \",round(min(clus_res$SpecR[clu,]),4),sep=\"\"))\n\n82 \n \n  x11(11,4);par(mfrow=c(1,clu),mar=c(2.5,1.8,1.4,0.3))  \n  size <- array(0,clu) \n  for (j in 1:clu) {size[j]<-length(which(gold==j))} \n  for (k in 1:clu) { \n    bpdata <- as.data.frame(clus_dat[,feat]) \n    bpdata[which(gold!=k,arr.ind=T),]<-NA \n    bpdata <- na.omit(bpdata) \n    boxplot(bpdata, whisklty=0, staplelty=0, col=farb[k], outline=F, main=paste(\"C\",k,\": \", \nsize[k]))->boxinfo \n    for (dd in 1:ncol(bpdata)) { \n      cen <- quantile(bpdata[,dd],  probs=c(5,95)/100) \n      segments(dd,boxinfo$stats[4,dd],dd,as.numeric(cen[2]),col=\"black\",lwd=1,lty=3) \n      segments(dd,boxinfo$stats[2,dd],dd,as.numeric(cen[1]),col=\"black\",lwd=1,lty=3) \n      points(dd,as.numeric(cen[1]),col=\"black\",pch=1) \n      points(dd,as.numeric(cen[2]),col=\"black\",pch=1) \n    } \n     mean.cl <- c(colMeans(bpdata,na.rm=T)) \n     points(c(1:length(feat)),mean.cl,pch=1,col=9,cex=1.4)   \n  } } \n############################################################# \n## PARAMETERS THAT HAVE TO BE DEFINED BY USER ~~~~~~~~ \n## \n  namIndicat <- \"choose name\"                     \n  namIndDir =   \"choose directory\" \n  colIndFile <- 9 \n  featurenames <- c(\"choose list of feature names\") \n  feat <- c(3:9)    \n## feature columns - for clustering ! \n  NmaxCluster <- 8            ## choose as upper boundary for consistency measure calculation or already  \n \n \n \n## as value for best cluster result  \n  cm = T  \n \n## consistency measure calculation or only best cluster number clustering \n## \n########################## \n \nnamIndFile = paste(namIndDir,namIndicat,sep=\"\")    ## reading data ~~~~~~~~~~~~~~~~~ \nclus_dat <- matrix(scan(namIndFile,sep=\"\"),ncol=colIndFile,byrow=T) \nclu_dim  <- dim(clus_dat)   \n \nis.na(clus_dat) <- which(clus_dat==-9999,arr.ind=T) ## erase missing values ~~~~~~~~~~~ \nclus_dat <- na.omit(clus_dat) \n \nx11(7,4);par(mar=c(2.1,4,2.3,0.5),mfrow=c(3,3))  \n## Histogramm of Cluster Data ~~~~~ \nfor (i in feat)  hist(clus_dat[,i],main=featurenames[i]) \n \nifelse(cm,master<-200,master<-50) \nclus_res <- consistency(clus_dat[,feat],NmaxCluster,master,cm) ## Clustering ~~~~~~~~~~ \n \n## Ploting of Result ~~~~~~~~~~~~~~ \nif(cm) {x11(6,4);plot(c(2:NmaxCluster),clus_res$MeanC[2:NmaxCluster],cex.main=0.9,xlab=\"# \nCluster\",ylab=paste(master,\"-Loops\"),panel.first=grid())} else { \n \nclus_graphs(clus_res$Gold,NmaxCluster,clu_dim)}\n\n83 \n \nAppendix C: Data for comparing clustering methods \n \n(see http://www.ima.umn.edu/~iwen/REU/REU_cluster.html#code) \nMatlab code for generating random datasets  \n \n• An example `.m' file that creates a 2D dataset with 3 clusters. It can also be \nmodified to generate other artificial data (with different numbers of clusters, \ndimensions, and underlying distributions).  \n• The following matlab package contains a file called \"generate_samples.m\" for \ngenerating hybrid linear models. It is part of the larger GPCA package. In order to \navoid intersection of subspaces (so that standard clustering could be applied) one \nneeds to set the parameter avoidIntersection = TRUE (and also have affine \nsubspaces instead of linear).  \n \n \nOther data and data repositories  \n \n• Clustering datasets at UCI Repository  \n• Complete UCI Machine Learning Repository  \n• Yale Face Database B  \n• Some processed face datasets saved as Matlab data can be found here. Two \nmatrices, X and Y, are included. If you plot Y(1:3,:) you will see three clearly \nseparated clusters. The first 64 points are in one cluster, the next 64 points in \nanother cluster, etc.. The original files are on the Yale Face Database B webpage \n(above). The folder names are yaleB5_P00, yaleB8_P00, yaleB10_P00. They \nhave been processed following the steps described in Section 4.2.2 of the \nfollowing paper. The matlab code used for processing them is here.  \n• Here is an example of spectral clustering data. It contains points from 2 noisy \ncircles: after loading the `.mat' file type \"plot(X(:,1),X(:,2),'LineStyle','.');\" to see \nthem. You can embed them into 2D space for clustering with EmbedCircles.m. \nNote that changing sigma in this file will lead to different problems. \n• See also http://dbkgroup.org/handl/generators/\n\n84 \n \n \nAppendix D: On determining variable importance for \nclustering \n \nA plethora of methods has been proposed to select informative subsets of variables/features in \nthe context of clustering analysis, as illustrated by recent literature on feature/variable \nselection (cf. Saeys et al., 2007, Steinley and Brusco, 2008b, Varshavsky et al. 2006, 2006). \n \nBelow we discuss three straightforward (univariate) methods which can be applied easily to \nexpress variable importance in a clustering context. In presenting the methods, we restrict \nourselves to continuous variables. \nWe notice beforehand that the proposed techniques are univariate and consider each variable \nseparately, thereby ignoring variable dependencies. This may lead to worse clustering \nperformance when compared to other more advanced feature selection techniques (see e.g. \nSaeys et al. 2007).  \n \n \n \nA. ANOVA-based method (for complete cluster-partitioning) \n \nThis method is based on comparing what a specific variable/feature contributes to the within-\ncluster variability as compared to the between cluster variability. The resulting importance-\nindex is expressed as the ratio BSS(j)/WSS(j) (see also Dudoit et al., 2002), defined by \n \n \n \n \nwhere BSS refers to the between sums of squares variability and WSS to the within sum of \nsquares variability. The ratio is used as an indication of the contribution of the variable j to the \noverall clustering.  \nHere j refers to the features/variables, k to the clusters, and i to the Nk objects within the k-th \ncluster. \n)\n(\n,\nj\nx\ni\nk\nrefers to the value of the j-th variable (feature/component) of object i in cluster \nk; \n)\n(\n.. j\nx\nrefers to the j-th component of the overall mean (population mean), while \n)\n(\n. j\nxk\n \nrefers to the j-th component of the cluster-mean of the k-th cluster.  \n \nVariables with the highest BSS(j)/WSS(j) are considered to have the largest ‘explanatory \nperformance’ in respect to the ‘unexplained one’, and therefore are labeled as more important. \nSee also the following textbox, which puts some caution in using these kind of indicators. \n \n \n \n \n \n \n \n \n \n \n \n)1(\n))\n(\n)\n(\n(\n))\n(\n)\n(\n(\n)\n(\n)\n(\n1\n1\n2\n.\n,\n1\n1\n2\n..\n.\n\n\n=\n=\n=\n=\n−\n−\n=\nn\nk\nN\ni\nk\ni\nk\nn\nk\nN\ni\nk\nk\nk\nj\nx\nj\nx\nj\nx\nj\nx\nj\nWSS\nj\nBSS\n\n85 \n \n \n \nRemark: On the relation with ANOVA: \n (a) Note that the total sums of squares can be written as the sum of the sums of squares of all \nvariables/components, and be split into a within- and between-cluster part: \n(\n)\n(\n)\n\n\n\n\n\n\n\n\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n+\n=\n=\n+\n=\n=\n\n\n\n\n\n\n\n\n\n\n\n\n−\n+\n−\n=\n=\n−\n+\n−\n=\n−\n+\n−\n=\n=\n−\n=\n=\np\nj\np\nj\nn\nk\nk\nk\np\nj\nn\nk\nN\ni\nN\ni\nk\nk\ni\nk\np\nj\nn\nk\nN\ni\nk\nk\ni\nk\np\nj\nn\nk\nN\ni\nk\nk\ni\nk\np\nj\nn\nk\nN\ni\ni\nk\np\nj\nj\nBSS\nj\nWSS\nj\nBSS\nj\nWSS\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nTSS\nTSS\nk\nk\nk\nk\nk\n1\n1\n1\n1\n1\n1\n1\n2\n..\n.\n2\n.\n,\n1\n1\n1\n2\n..\n.\n2\n.\n,\n1\n1\n1\n2\n..\n.\n.\n,\n1\n1\n1\n2\n..\n,\n1\n)\n(\n)\n(\n)\n(\n)\n(\n))\n(\n)\n(\n(\n))\n(\n)\n(\n(\n))\n(\n)\n(\n(\n))\n(\n)\n(\n(\n))\n(\n)\n(\n)\n(\n)\n(\n(\n))\n(\n)\n(\n(\n)\n(\n \nwhere BSS(j) refers to the explained part and WSS(j) to the unexplained part of the sums of squares. \nThe k-means method is intended to minimize the total within-sum of squares WSS (= Σj WSS(j)) \n(unexplained) and thus in fact maximizes the in-between differences BSS ((= Σj BSS(j)) (explained). \nThis however does not imply that the various components WSS (j) are minimized individually (or, \nequivalently, the BSS(j) are maximized individually), since trade-offs between the various  WSS (j) can \nbe involved in minimizing their sum. \n \n(b) The ratio BSS(j)/WSS(j) is in fact directly related to the F-ratio in the context of an ANOVA for the \nspecific j-th variable \n)\n(\n,\nj\nx\ni\nk\n. The F-ratio is \n)\n(\n/)\n(\nj\nMSS\nj\nMSS\nwithin\nbetween\n where the various mean-\nsum of squares are defined as \n)1\n/(\n)\n(\n)\n(\n−\n=\nn\nj\nBSS\nj\nMSS between\n and \n)\n/(\n)\n(\n)\n(\nn\nN\nj\nWSS\nj\nMSS within\n−\n=\n where \n\n=\n=\nn\nk\nk\nN\nN\n1\n. \nThe F-ratio test is applied to test whether the underlying cluster-means \n)\n(\n. j\nk\nμ\nof \n)\n(\n,\nj\nx\ni\nk\nare all equal \nfor k=1, …, n, in which case F should be nearly equal to 1. Notice that BSS(j)/WSS(j)=(n-1)/(N-n) × F. \n \n(c) One should however be careful to interpret this ratio completely in terms of ANOVA, since the \nunderlying assumptions – concerning independence, normality and equal variance - for ANOVA are \ntypically not valid in a clustering context where the clusters have been determined deliberately so as to \nminimize the within sum-of-squares (cf. Milligan and Mahajan (1980). Milligan and Cooper (1987)). \nCompare also Hartigan (1975) and Aldenderfer and Blashfield (1976) who illustrate the statistical \ninappropriateness of the use of (M)ANOVAs for indicating existence of clusters. \n \n \nB. t-test based method (cluster-wise) \n \nAnother way to express the variable importance of the j-th variable in a specific cluster is by \nusing the t-statistic, in fact checking to what extend the mean-value of the specific variable - \nwhen constrained to this cluster - differs from the overall mean-value. The corresponding \nimportance index can be expressed as29:  \n \n                                                     \n29 As implemented in the TwoStep cluster method in SPSS.\n\n86 \n \n)\n2\n(\n)\n(\nˆ\n)\n(\n)\n(\n)\n(\n..\n.\nj\ns\nj\nx\nj\nx\nj\nt\nk\nk\nk\n−\n=\n \n \n \nwhere \n)\n(\nˆ\nj\nsk\n is the standard deviation, defined as: \n \n)\n3\n(\n)1\n(\n))\n(\n)\n(\n(\n)\n(\nˆ\n1\n2\n.\n,\n−\n−\n= \n=\nk\nN\ni\nk\ni\nk\nk\nN\nj\nx\nj\nx\nj\ns\nk\n \n \nThe idea is that the importance of a variable for a cluster can be measured by the absolute \nvalue of this t-statistic, where variables with larger absolute t-statistics are considered as more \nimportant then variables for which the t-statistic is smaller. This measure is therefore initially \nrelated to a specific cluster (cluster-wise). A measure for the overall importance of the j-th \nvariables for all clusters can e.g. be obtained by summing the absolute value |tk(j)| for all \nclusters k=1, …, n. Another possibility is to consider the maximum-value of the |tk(j)| over all \nclusters k=1, …, n., as a measure for the variable importance. See also Gat-Viks et al. (2003) \nwho apply an ANOVA based test of equality of means amongst the cluster members. \n \n \nC. ‘Fraiman’ index (for complete cluster-partitioning) \n \nFraiman et al. (2008) propose to ‘blind’ (subset of) variables, by fixing them at their mean-\nvalue, and to repeat the clustering analysis subsequently. Then the pairwise agreement (e.g. \nby means of the adjusted Rand index introduced by Hubert and Arabie (1985)) is determined \nbetween the partition thus obtained and the original partition with all variables fully included. \nThis index serves as an indication for the importance of the blinded variable(s). The adjusted \nRand index is a value between 0 and 1, where large values (near 1) mean that there is a large \nagreement between the partitions with and without blinding the specific variable. To identify \nthe most important variables one therefore should look for variables with small Fraiman-\nindices. \n \nFraiman-measure to identify the importance of the different variables for the total cluster \npartition (low values indicate high importance). \n \nFraiman el al. 2008 show that this univariate procedure will falter if there are strong \ncorrelations between variables, since the effects of omitting one variable will be compensated \nby the other (non-blinded) related variable. This will typically result in a large agreement of \nthe clustering partitions in the blinded and non-blinded case.\n\n87 \n \nTherefore, in case of dependencies Fraiman et al. (2008) propose an alternative measure, \nwhere the blinded variable is not replaced by its marginal mean, but by its conditional mean \nover the set of other (non-blinded) variables.  \n \n \nIntermezzo: Promising alternatives \n \n“Ensemble learning” methods that generate many classifiers and aggregate their results have \nbeen proposed during the last decade as efficient methods for analyzing the structure in data. \nEspecially the procedure of random forests (RF), which uses a multitude of regression trees \non different bootstrap samples of the data (cf. Breiman (2001)) is a popular and user-friendly \nmethod. This method renders a measure for the variable importance of the involved \n(predictor) variables, and gives also a measure of the internal structure of the data (proximity \nof different data points to one another).  \nAlthough this method was first established for classification and regression problems (i.e. \nforms of supervised learning) the random-forest idea can also be applied for clustering \npurposes (unsupervised learning). The trick for this is to distinguish two datasets: the original \ndataset is called “class 1”, while a synthetic dataset, using information on the marginal \ndistributions of the original data, is constructed which is called “class 2”. Next one uses the \nrandom-forest machinery to classify the combined data with a random forest. The underlying \nidea is that real data points that are similar to one another will tend to be classified in the same \nterminal node of the tree, as measured by the proximity matrix that can be returned using the \nRF-technique. Thus the proximity matrix can be taken as a similarity measure30, which can be \napplied for dividing the original matrix into groups for visual exploration on basis of \nclustering or multi-dimensional scaling. See the example in Liaw and Wiener (2002) as a \nwork-out how to perform this analysis with the randomForest package in R.  \nAlong similar lines this method has been further applied and analysed by Horvath and Shi in a \nseries of papers (Shi et al. 2005, 2006). They underline the attractiveness of the method since \nit enables handling mixed variable types, is invariant to monotonic transformation of the input \nvariables and is robust to outlying observations. Moreover the RF-based dissimilarity easily \ndeals with a large number of variables. \n \nThe above reframing of clustering in terms of random forest procedure offers a link to recent \ninteresting literature (Strobl et al. 2007, 2008) on measuring the importance of variables in a \nrandom forest context explicitly accounting for the (conditional) effects of correlated \nvariables. These results suggest ways to do this also for clustering, but this will not be worked \nout here. See also R-software like part(y)itioning (Hothorn et al. 2006) which can be applied \nin this context. \n \nAnother interesting related approach which deserves further exploration is offered by Questier \net al. (2005), Smyth et al. (2006a) who put forward an extension of classification and \nregression trees, namely multivariate regression trees31, for (supervised and unsupervised) \nfeature selection as well as for cluster analysis. The idea is to use the original data (x) as \nexplanatory variables (x) and also as response variables (y=x), giving rise to so-called Auto-\nAssociative Multivariate Regression Trees. The suitability of this approach for clustering is \nfurther explored in Smyth et al. (2006b), while in Smyth et al. (2007) proposals are given to \nenhance the performance of the method by weighing the resulting cluster ensemble \nappropriately on basis of the prediction quality of the individual model. Also suggestions are \n                                                     \n30 Concerning this similarity measure provided by the random forest method, one should realize that \nideally the choice of the (dis)similarity measure ideally should be determined by the kind of patterns \none hopes to find, which makes that there are situations where other dissimilarities are preferable. \n31 R-software has been developed for multivariate regression trees, namely MVPART\n\n88 \n \ngiven for determining the variable importance and the number of clusters. For R-software on \nmultivariate regression trees see the CRAN package mvpart32.  \n \n \n \nAppendix E: Commonly used internal validation \nindexes \n \n \nIn the sequel we present various internal validation indices (see also Günter, S, Bunke, H., \n2003): \n \n• \nSilhouette index: this composite index reflects the compactness and separation of clusters. \nA larger Silhouette index indicates a better overall quality of the clustering result \n(Kaufman & Rousseeuw, 1990). \nThe Silhouette index (SI) calculates for each point a width depending on its membership \nin any cluster. This silhouette index is then the average of the silhouette widths of all \npoints/objects: \n\n=\n−\n=\nN\ni\ni\ni\ni\ni\nb\na\na\nb\nN\nSI\n1\n)\n,\nmax(\n)\n(\n1\n \nwhere bi is the minimum of the average distances between the specific point i and the \npoints in the other clusters, and ai is the average distance between the point i and all other \npoints in the cluster where i is member of. The values s(i)=[b(i)-a(i)]/max[a(i),b(i)] vary \nbetween -1 and 1, where values close to -1 mean that the point is on average closer to \nanother cluster than the one it belongs to, in fact indicating that the object i is \n‘misclassified’. Values close to 1 mean that the average distance to its own cluster is \nsignificantly smaller than to any other cluster, indicating that object i is ‘well classified’. \nWhen the width is near zero it is not clear whether the object should have been assigned \nto its current cluster or to the neighbouring cluster. The higher the silhouette index, the \nmore compact and separated are the clusters. Kaufman and Rousseeuw, 1990, give \nguidance for the desirable size of the silhouette width; they consider a reasonable \nclassification to be characterized by an average silhouette width above 0.5. Small \nsilhouette width below 0.2 should be interpreted as a lack of substantial cluster structure. \n• \nDavies-Bouldin index: This measure tries to maximize the between-cluster distance while \nminimizing the distance between the cluster centroid and the other points. It expresses the \naverage similarity between each cluster and its most similar one. Small values correspond \nto clusters that are compact and have well-separated centres. Therefore its minimum value \ndetermines the optimal number of clusters. \n• \nCalinski-Harabasz index: This index measures the between-cluster isolation and the \nwithin-cluster compactness, in terms of: \n1\n)\n(\n1\n)\n(\n)\n(\n−\n−\n=\nK\nS\nTrace\nK\nS\nTrace\nK\nCH\nW\nB\n \nwith N being the number of objects and SB and SW being the between and within-class \nscatter matrix \n\n\n=\n=\n=\n−\n−\n=\n−\n−\n=\nK\ni\nT\ni\ni\ni\nB\nK\ni\nN\nj\ni\ni\nj\nj\nij\nW\nm\nm\nm\nm\nN\nS\nm\nx\nm\nx\nS\n1\n1\n1\n)\n)(\n(\n;\n)\n)(\n(\nγ\n \nwhere Г={γij} is a partition matrix, with γij =1 if xj belongs to cluster i and 0 otherwise, \nwhere moreover \n1\n1\n=\n\n=\nK\ni\nij\nγ\n for all j. M=[m1, …, mK] is the cluster prototype or centroid \n                                                     \n32 http://cran.nedmirror.nl/web/packages/mvpart/index.html\n\n89 \n \nmatrix, and \n\n=\n=\nN\nj\nj\nij\ni\ni\nx\nN\nm\n1\n1\nγ\nis the mean for the i-th cluster with Ni objects. The optimal \nnumber of clusters is determined by maximizing the CH-index. \n• \nDunn index: this index is defined as the ratio between the minimum distance between two \nclusters and the size of the largest cluster. Depending on the choice of the distance \nmeasure and the size of the cluster, various Dunn indices can be defined. Maximizing this \nindex reflects to a certain extent the maximization of the inter-cluster-distances while \nsimultaneously minimizing the intra-cluster distances.  \n• \nRMSSTD index (Root Mean Square Standard Deviation): This index is designed for \nhierarchical clustering, but can equally well be used for any clustering algorithm, and \nmeasures the homogeneity of the formed clusters (or the variance of clusters) at each step \nof the hierarchical clustering algorithm. A lower RMSSTD value indicates better \nclustering. \n• \nC index: This index (Hubert and Schultz, 1976) is defined as follows: \nmin\nmax\nmin\nS\nS\nS\nS\nC\n−\n−\n=\n \n \nwhere S is the sum of distances over all pairs of objects from the same cluster. Let r be \nthe number of those pairs. Then Smin is the sum of the r smallest distances if all pairs of \nobjects are considered (i.e. also objects that can belong to different clusters). Similarly \nSmax is the sum of the r largest distances out of all pairs. Hence a small value of C \nindicates a good clustering. \n• \nMaulik-Bandyopadhyay index: This index is a combination of three terms \n                     \np\nk\nk\nk\nD\nE\nE\nk\nMB\n\n\n\n\n\n\n⋅\n⋅\n=\n1\n1\n \nwhere the intra-cluster distance is defined by \n\n=\n∈\n−\n=\nk\ni\nc\nx\ni\nk\ni\nz\nx\nE\n1\n and the inter-cluster \ndistance by \nj\ni\nk\nj\ni\nk\nz\nz\nD\n−\n=\n=1\n,\nmax\n, where zi is the centre of cluster ci. p is chosen to be \ntwo and the number of clusters k is determined by maximizing MBk. \n• \nThe Cophenetic correlation coefficient (CPCC) is an index to validate hierarchical \nclustering structures, and is based on the proximity matrix P={pij}, of the data X. It \nmeasures the degree of similarity between P and the cophenetic matrix Q={qij}, the \nelements of which express the proximity level where pairs of data points are grouped in \nthe same cluster. \nCPCC is defined as: \n\n\n\n\n\n\n−\n⋅\n\n\n\n\n\n−\n−\n=\n\n\n\n−\n=\n+\n=\n−\n=\n+\n=\n−\n=\n+\n=\n1\n1\n1\n2\n2\n1\n1\n1\n2\n2\n1\n1\n1\n1\n1\n1\nN\ni\nN\ni\nj\nQ\nij\nN\ni\nN\ni\nj\nP\nij\nN\ni\nN\ni\nj\nQ\nP\nij\nij\nq\nM\np\nM\nq\np\nM\nCPCC\nμ\nμ\nμ\nμ\n \nWhere μP and μQ are the means of P and Q: \n\n\n−\n=\n+\n=\n−\n=\n+\n=\n=\n=\n1\n1\n1\n1\n1\n1\n1\n;\n1\nN\ni\nN\ni\nj\nij\nQ\nN\ni\nN\ni\nj\nij\nP\nq\nM\np\nM\nμ\nμ\n\n90 \n \nwith M=N(N-1)/2. The value of CPCC lies in the range of [-1,1] with an index value \nclose to 1 indicating a significant similarity between P and Q. However for group average \nlinkage (UPGMA) even large CPCC values (such as 0.9) cannot assure sufficient \nsimilarity between the two matrices. \nRemark: Also for Fuzzy clustering internal validation indices have been proposed, such as \nthe partition coefficient (PC) and partition entropy (PE), the (extended) Xie-Beni index and \nthe Fukuyama-Sugeno index, c.f. Pal-Bezdek (1995), Hammah and Curran (2000), Wu and \nYang, 2005; cf. section 10.4.3 in Xue and Wunsch (2008). Wang and Zhang (2007) \nperformed an extensive evaluation of the fuzzy clustering indices, while Zhang et al. 2008 \ntested a newly proposed index. They conclude that cluster validation is a very difficult task \nand that ‘no matter how good your index is, there is a dataset out there waiting to trick it (and \nyou)’ (Pal and Bezdek (1997)). Wu et al. (2009) recently analyse the robustness of the cluster \nindices for noise and outliers, and propose ways to robustify them.\n\nPIK Report-Reference: \nNo. 1 \n3. Deutsche Klimatagung, Potsdam 11.-14. April 1994 \nTagungsband der Vorträge und Poster (April 1994) \nNo. 2 \nExtremer Nordsommer '92 \nMeteorologische Ausprägung, Wirkungen auf naturnahe und vom Menschen beeinflußte \nÖkosysteme, gesellschaftliche Perzeption und situationsbezogene politisch-administrative bzw. \nindividuelle Maßnahmen (Vol. 1 - Vol. 4) \nH.-J. Schellnhuber, W. Enke, M. Flechsig (Mai 1994) \nNo. 3 \nUsing Plant Functional Types in a Global Vegetation Model \nW. Cramer (September 1994) \nNo. 4 \nInterannual variability of Central European climate parameters and their relation to the large-\nscale circulation \nP. C. Werner (Oktober 1994) \nNo. 5 \nCoupling Global Models of Vegetation Structure and Ecosystem Processes - An Example from \nArctic and Boreal Ecosystems \nM. Plöchl, W. Cramer (Oktober 1994) \nNo. 6 \nThe use of a European forest model in North America: A study of ecosystem response to \nclimate gradients \nH. Bugmann, A. Solomon (Mai 1995) \nNo. 7 \nA comparison of forest gap models: Model structure and behaviour \nH. Bugmann, Y. Xiaodong, M. T. Sykes, Ph. Martin, M. Lindner, P. V. Desanker, \nS. G. Cumming (Mai 1995) \nNo. 8 \nSimulating forest dynamics in complex topography using gridded climatic data \nH. Bugmann, A. Fischlin (Mai 1995) \nNo. 9 \nApplication of two forest succession models at sites in Northeast Germany \nP. Lasch, M. Lindner (Juni 1995) \nNo. 10 \nApplication of a forest succession model to a continentality gradient through Central Europe \nM. Lindner, P. Lasch, W. Cramer (Juni 1995) \nNo. 11 \nPossible Impacts of global warming on tundra and boreal forest ecosystems - Comparison of \nsome biogeochemical models \nM. Plöchl, W. Cramer (Juni 1995) \nNo. 12 \nWirkung von Klimaveränderungen auf Waldökosysteme \nP. Lasch, M. Lindner (August 1995) \nNo. 13 \nMOSES - Modellierung und Simulation ökologischer Systeme - Eine Sprachbeschreibung mit \nAnwendungsbeispielen \nV. Wenzel, M. Kücken, M. Flechsig (Dezember 1995) \nNo. 14 \nTOYS - Materials to the Brandenburg biosphere model / GAIA \nPart 1 - Simple models of the \"Climate + Biosphere\" system \nYu. Svirezhev (ed.), A. Block, W. v. Bloh, V. Brovkin, A. Ganopolski, V. Petoukhov, \nV. Razzhevaikin (Januar 1996) \nNo. 15 \nÄnderung von Hochwassercharakteristiken im Zusammenhang mit Klimaänderungen - Stand  \nder Forschung \nA. Bronstert (April 1996) \nNo. 16 \nEntwicklung eines Instruments zur Unterstützung der klimapolitischen Entscheidungsfindung \nM. Leimbach (Mai 1996) \nNo. 17 \nHochwasser in Deutschland unter Aspekten globaler Veränderungen - Bericht über das DFG-\nRundgespräch am 9. Oktober 1995 in Potsdam \nA. Bronstert (ed.) (Juni 1996) \nNo. 18 \nIntegrated modelling of hydrology and water quality in mesoscale watersheds \nV. Krysanova, D.-I. Müller-Wohlfeil, A. Becker (Juli 1996) \nNo. 19 \nIdentification of vulnerable subregions in the Elbe drainage basin under global change impact \nV. Krysanova, D.-I. Müller-Wohlfeil, W. Cramer, A. Becker (Juli 1996) \nNo. 20 \nSimulation of soil moisture patterns using a topography-based model at different scales \nD.-I. Müller-Wohlfeil, W. Lahmer, W. Cramer, V. Krysanova (Juli 1996) \nNo. 21 \nInternational relations and global climate change \nD. Sprinz, U. Luterbacher (1st ed. July, 2n ed. December 1996) \nNo. 22 \nModelling the possible impact of climate change on broad-scale vegetation structure - \nexamples from Northern Europe \nW. Cramer (August 1996)\n\nNo. 23 \nA methode to estimate the statistical security for cluster separation \nF.-W. Gerstengarbe, P.C. Werner (Oktober 1996) \nNo. 24 \nImproving the behaviour of forest gap models along drought gradients \nH. Bugmann, W. Cramer (Januar 1997) \nNo. 25 \nThe development of climate scenarios \nP.C. Werner, F.-W. Gerstengarbe (Januar 1997) \nNo. 26 \nOn the Influence of Southern Hemisphere Winds on North Atlantic Deep Water Flow \nS. Rahmstorf, M. H. England (Januar 1977) \nNo. 27 \nIntegrated systems analysis at PIK: A brief epistemology \nA. Bronstert, V. Brovkin, M. Krol, M. Lüdeke, G. Petschel-Held, Yu. Svirezhev, V. Wenzel \n(März 1997) \nNo. 28 \nImplementing carbon mitigation measures in the forestry sector - A review \nM. Lindner (Mai 1997) \nNo. 29 \nImplementation of a Parallel Version of a Regional Climate Model \nM. Kücken, U. Schättler (Oktober 1997) \nNo. 30 \nComparing global models of terrestrial net primary productivity (NPP): Overview and key results \nW. Cramer, D. W. Kicklighter, A. Bondeau, B. Moore III, G. Churkina, A. Ruimy, A. Schloss, \nparticipants of \"Potsdam '95\" (Oktober 1997) \nNo. 31 \nComparing global models of terrestrial net primary productivity (NPP): Analysis of the seasonal \nbehaviour of NPP, LAI, FPAR along climatic gradients across ecotones \nA. Bondeau, J. Kaduk, D. W. Kicklighter, participants of \"Potsdam '95\" (Oktober 1997) \nNo. 32 \nEvaluation of the physiologically-based forest growth model FORSANA \nR. Grote, M. Erhard, F. Suckow (November 1997) \nNo. 33 \nModelling the Global Carbon Cycle for the Past and Future Evolution of the Earth System \nS. Franck, K. Kossacki, Ch. Bounama (Dezember 1997) \nNo. 34 \nSimulation of the global bio-geophysical interactions during the Last Glacial Maximum \nC. Kubatzki, M. Claussen (Januar 1998) \nNo. 35 \nCLIMBER-2: A climate system model of intermediate complexity. Part I: Model description and \nperformance for present climate \nV. Petoukhov, A. Ganopolski, V. Brovkin, M. Claussen, A. Eliseev, C. Kubatzki, S. Rahmstorf \n(Februar 1998) \nNo. 36 \nGeocybernetics: Controlling a rather complex dynamical system under uncertainty \nH.-J. Schellnhuber, J. Kropp (Februar 1998) \nNo. 37 \nUntersuchung der Auswirkungen erhöhter atmosphärischer CO2-Konzentrationen auf \nWeizenbestände des Free-Air Carbondioxid Enrichment (FACE) - Experimentes Maricopa \n(USA) \nT. Kartschall, S. Grossman, P. Michaelis, F. Wechsung, J. Gräfe, K. Waloszczyk, \nG. Wechsung, E. Blum, M. Blum (Februar 1998) \nNo. 38 \nDie Berücksichtigung natürlicher Störungen in der Vegetationsdynamik verschiedener \nKlimagebiete \nK. Thonicke (Februar 1998) \nNo. 39 \nDecadal Variability of the Thermohaline Ocean Circulation \nS. Rahmstorf (März 1998) \nNo. 40 \nSANA-Project results and PIK contributions \nK. Bellmann, M. Erhard, M. Flechsig, R. Grote, F. Suckow (März 1998) \nNo. 41 \nUmwelt und Sicherheit: Die Rolle von Umweltschwellenwerten in der empirisch-quantitativen \nModellierung \nD. F. Sprinz (März 1998) \nNo. 42 \nReversing Course: Germany's Response to the Challenge of Transboundary Air Pollution \nD. F. Sprinz, A. Wahl (März 1998) \nNo. 43 \nModellierung des Wasser- und Stofftransportes in großen Einzugsgebieten. Zusammenstellung \nder Beiträge des Workshops am 15. Dezember 1997 in Potsdam \nA. Bronstert, V. Krysanova, A. Schröder, A. Becker, H.-R. Bork (eds.) (April 1998) \nNo. 44 \nCapabilities and Limitations of Physically Based Hydrological Modelling on the Hillslope Scale \nA. Bronstert (April 1998) \nNo. 45 \nSensitivity Analysis of a Forest Gap Model Concerning Current and Future Climate Variability \nP. Lasch, F. Suckow, G. Bürger, M. Lindner (Juli 1998) \nNo. 46 \nWirkung von Klimaveränderungen in mitteleuropäischen Wirtschaftswäldern \nM. Lindner (Juli 1998)\n\nNo. 47 \nSPRINT-S: A Parallelization Tool for Experiments with Simulation Models \nM. Flechsig (Juli 1998) \nNo. 48 \nThe Odra/Oder Flood in Summer 1997: Proceedings of the European Expert Meeting in \nPotsdam, 18 May 1998 \nA. Bronstert, A. Ghazi, J. Hladny, Z. Kundzewicz, L. Menzel (eds.) (September 1998) \nNo. 49 \nStruktur, Aufbau und statistische Programmbibliothek der meteorologischen Datenbank am \nPotsdam-Institut für Klimafolgenforschung \nH. Österle, J. Glauer, M. Denhard (Januar 1999) \nNo. 50 \nThe complete non-hierarchical cluster analysis \nF.-W. Gerstengarbe, P. C. Werner (Januar 1999) \nNo. 51 \nStruktur der Amplitudengleichung des Klimas \nA. Hauschild (April 1999) \nNo. 52 \nMeasuring the Effectiveness of International Environmental Regimes \nC. Helm, D. F. Sprinz (Mai 1999) \nNo. 53 \nUntersuchung der Auswirkungen erhöhter atmosphärischer CO2-Konzentrationen innerhalb des \nFree-Air Carbon Dioxide Enrichment-Experimentes: Ableitung allgemeiner Modellösungen \nT. Kartschall, J. Gräfe, P. Michaelis, K. Waloszczyk, S. Grossman-Clarke (Juni 1999) \nNo. 54 \nFlächenhafte Modellierung der Evapotranspiration mit TRAIN \nL. Menzel (August 1999) \nNo. 55 \nDry atmosphere asymptotics \nN. Botta, R. Klein, A. Almgren (September 1999) \nNo. 56 \nWachstum von Kiefern-Ökosystemen in Abhängigkeit von Klima und Stoffeintrag - Eine \nregionale Fallstudie auf Landschaftsebene \nM. Erhard (Dezember 1999) \nNo. 57 \nResponse of a River Catchment to Climatic Change: Application of Expanded Downscaling to \nNorthern Germany \nD.-I. Müller-Wohlfeil, G. Bürger, W. Lahmer (Januar 2000) \nNo. 58 \nDer \"Index of Sustainable Economic Welfare\" und die Neuen Bundesländer in der \nÜbergangsphase \nV. Wenzel, N. Herrmann (Februar 2000) \nNo. 59 \nWeather Impacts on Natural, Social and Economic Systems (WISE, ENV4-CT97-0448) \nGerman report \nM. Flechsig, K. Gerlinger, N. Herrmann, R. J. T. Klein, M. Schneider, H. Sterr, H.-J. Schellnhuber \n(Mai 2000) \nNo. 60 \nThe Need for De-Aliasing in a Chebyshev Pseudo-Spectral Method \nM. Uhlmann (Juni 2000) \nNo. 61 \nNational and Regional Climate Change Impact Assessments in the Forestry Sector \n- Workshop Summary and Abstracts of Oral and Poster Presentations \nM. Lindner (ed.) (Juli 2000) \nNo. 62 \nBewertung ausgewählter Waldfunktionen unter Klimaänderung in Brandenburg \nA. Wenzel (August 2000) \nNo. 63 \nEine Methode zur Validierung von Klimamodellen für die Klimawirkungsforschung hinsichtlich \nder Wiedergabe extremer Ereignisse \nU. Böhm (September 2000) \nNo. 64 \nDie Wirkung von erhöhten atmosphärischen CO2-Konzentrationen auf die Transpiration eines \nWeizenbestandes unter Berücksichtigung von Wasser- und Stickstofflimitierung \nS. Grossman-Clarke (September 2000) \nNo. 65 \nEuropean Conference on Advances in Flood Research, Proceedings, (Vol. 1 - Vol. 2) \nA. Bronstert, Ch. Bismuth, L. Menzel (eds.) (November 2000) \nNo. 66 \nThe Rising Tide of Green Unilateralism in World Trade Law - Options for Reconciling the \nEmerging North-South Conflict \nF. Biermann (Dezember 2000) \nNo. 67 \nCoupling Distributed Fortran Applications Using C++ Wrappers and the CORBA Sequence  \nType \nT. Slawig (Dezember 2000) \nNo. 68 \nA Parallel Algorithm for the Discrete Orthogonal Wavelet Transform \nM. Uhlmann (Dezember 2000) \nNo. 69 \nSWIM (Soil and Water Integrated Model), User Manual \nV. Krysanova, F. Wechsung, J. Arnold, R. Srinivasan, J. Williams (Dezember 2000)\n\nNo. 70 \nStakeholder Successes in Global Environmental Management, Report of Workshop, \nPotsdam, 8 December 2000 \nM. Welp (ed.) (April 2001) \nNo. 71 \nGIS-gestützte Analyse globaler Muster anthropogener Waldschädigung - Eine sektorale \nAnwendung des Syndromkonzepts \nM. Cassel-Gintz (Juni 2001) \nNo. 72 \nWavelets Based on Legendre Polynomials \nJ. Fröhlich, M. Uhlmann (Juli 2001) \nNo. 73 \nDer Einfluß der Landnutzung auf Verdunstung und Grundwasserneubildung - Modellierungen \nund Folgerungen für das Einzugsgebiet des Glan \nD. Reichert (Juli 2001) \nNo. 74 \nWeltumweltpolitik - Global Change als Herausforderung für die deutsche Politikwissenschaft \nF. Biermann, K. Dingwerth (Dezember 2001) \nNo. 75 \nAngewandte Statistik - PIK-Weiterbildungsseminar 2000/2001 \nF.-W. Gerstengarbe (Hrsg.) (März 2002) \nNo. 76 \nZur Klimatologie der Station Jena \nB. Orlowsky (September 2002) \nNo. 77 \nLarge-Scale Hydrological Modelling in the Semi-Arid North-East of Brazil \nA. Güntner (September 2002) \nNo. 78 \nPhenology in Germany in the 20th Century: Methods, Analyses and Models \nJ. Schaber (November 2002) \nNo. 79 \nModelling of Global Vegetation Diversity Pattern \nI. Venevskaia, S. Venevsky (Dezember 2002) \nNo. 80 \nProceedings of the 2001 Berlin Conference on the Human Dimensions of Global Environmental \nChange “Global Environmental Change and the Nation State” \nF. Biermann, R. Brohm, K. Dingwerth (eds.) (Dezember 2002) \nNo. 81 \nPOTSDAM - A Set of Atmosphere Statistical-Dynamical Models: Theoretical Background \nV. Petoukhov, A. Ganopolski, M. Claussen (März 2003) \nNo. 82 \nSimulation der Siedlungsflächenentwicklung als Teil des Globalen Wandels und ihr Einfluß auf \nden Wasserhaushalt im Großraum Berlin \nB. Ströbl, V. Wenzel, B. Pfützner (April 2003) \nNo. 83 \nStudie zur klimatischen Entwicklung im Land Brandenburg bis 2055 und deren Auswirkungen \nauf den Wasserhaushalt, die Forst- und Landwirtschaft sowie die Ableitung erster Perspektiven \nF.-W. Gerstengarbe, F. Badeck, F. Hattermann, V. Krysanova, W. Lahmer, P. Lasch, M. Stock, \nF. Suckow, F. Wechsung, P. C. Werner (Juni 2003) \nNo. 84 \nWell Balanced Finite Volume Methods for Nearly Hydrostatic Flows \nN. Botta, R. Klein, S. Langenberg, S. Lützenkirchen (August 2003) \nNo. 85 \nOrts- und zeitdiskrete Ermittlung der Sickerwassermenge im Land Brandenburg auf der Basis \nflächendeckender Wasserhaushaltsberechnungen \nW. Lahmer, B. Pfützner (September 2003) \nNo. 86 \nA Note on Domains of Discourse - Logical Know-How for Integrated Environmental Modelling, \nVersion of October 15, 2003 \nC. C. Jaeger (Oktober 2003) \nNo. 87 \nHochwasserrisiko im mittleren Neckarraum - Charakterisierung unter Berücksichtigung \nregionaler Klimaszenarien sowie dessen Wahrnehmung durch befragte Anwohner \nM. Wolff (Dezember 2003) \nNo. 88 \nAbflußentwicklung in Teileinzugsgebieten des Rheins - Simulationen für den Ist-Zustand und für \nKlimaszenarien \nD. Schwandt (April 2004) \nNo. 89 \nRegionale Integrierte Modellierung der Auswirkungen von Klimaänderungen am Beispiel des \nsemi-ariden Nordostens von Brasilien \nA. Jaeger (April 2004) \nNo. 90 \nLebensstile und globaler Energieverbrauch - Analyse und Strategieansätze zu einer \nnachhaltigen Energiestruktur \nF. Reusswig, K. Gerlinger, O. Edenhofer (Juli 2004) \nNo. 91 \nConceptual Frameworks of Adaptation to Climate Change and their Applicability to Human \nHealth \nH.-M. Füssel, R. J. T. Klein (August 2004)\n\nNo. 92 \nDouble Impact - The Climate Blockbuster ’The Day After Tomorrow’ and its Impact on the \nGerman Cinema Public \nF. Reusswig, J. Schwarzkopf, P. Polenz (Oktober 2004)  \nNo. 93 \nHow Much Warming are we Committed to and How Much Can be Avoided? \nB. Hare, M. Meinshausen (Oktober 2004) \nNo. 94 \nUrbanised Territories as a Specific Component of the Global Carbon Cycle \nA. Svirejeva-Hopkins, H.-J. Schellnhuber (Januar 2005) \nNo. 95 \nGLOWA-Elbe I - Integrierte Analyse der Auswirkungen des globalen Wandels auf Wasser, \nUmwelt und Gesellschaft im Elbegebiet \nF. Wechsung, A. Becker, P. Gräfe (Hrsg.) (April 2005) \nNo. 96 \nThe Time Scales of the Climate-Economy Feedback and the Climatic Cost of Growth \nS. Hallegatte (April 2005) \nNo. 97 \nA New Projection Method for the Zero Froude Number Shallow Water Equations \nS. Vater (Juni 2005) \nNo. 98 \nTable of EMICs - Earth System Models of Intermediate Complexity \nM. Claussen (ed.) (Juli 2005) \nNo. 99 \nKLARA - Klimawandel - Auswirkungen, Risiken, Anpassung \nM. Stock (Hrsg.) (Juli 2005) \nNo. 100 \nKatalog der Großwetterlagen Europas (1881-2004) nach Paul Hess und Helmut Brezowsky \n6., verbesserte und ergänzte Auflage \nF.-W. Gerstengarbe, P. C. Werner (September 2005) \nNo. 101 \nAn Asymptotic, Nonlinear Model for Anisotropic, Large-Scale Flows in the Tropics \nS. Dolaptchiev (September 2005) \nNo. 102 \nA Long-Term Model of the German Economy: lagomd_sim \nC. C. Jaeger (Oktober 2005) \nNo. 103 \nStructuring Distributed Relation-Based Computations with SCDRC \nN. Botta, C. Ionescu, C. Linstead, R. Klein (Oktober 2006) \nNo. 104 \nDevelopment of Functional Irrigation Types for Improved Global Crop Modelling \nJ. Rohwer, D. Gerten, W. Lucht (März 2007) \nNo. 105 \nIntra-Regional Migration in Formerly Industrialised Regions: Qualitative Modelling of Household \nLocation Decisions as an Input to Policy and Plan Making in Leipzig/Germany and \nWirral/Liverpool/UK \nD. Reckien (April 2007) \nNo. 106 \nPerspektiven der Klimaänderung bis 2050 für den Weinbau in Deutschland (Klima 2050) - \nSchlußbericht zum FDW-Vorhaben: Klima 2050 \nM. Stock, F. Badeck, F.-W. Gerstengarbe, D. Hoppmann, T. Kartschall, H. Österle, P. C. Werner, \nM. Wodinski (Juni 2007) \nNo. 107 \nClimate Policy in the Coming Phases of the Kyoto Process: Targets, Instruments, and the Role \nof Cap and Trade Schemes - Proceedings of the International Symposium, February 20-21, \n2006, Brussels \nM. Welp, L. Wicke, C. C. Jaeger (eds.) (Juli 2007) \nNo. 108 \nCorrelation Analysis of Climate Variables and Wheat Yield Data on Various Aggregation Levels \nin Germany and the EU-15 Using GIS and Statistical Methods, with a Focus on Heat Wave \nYears \nT. Sterzel (Juli 2007) \nNo. 109 \nMOLOCH - Ein Strömungsverfahren für inkompressible Strömungen - Technische Referenz 1.0 \nM. Münch (Januar 2008) \nNo. 110 \nRationing & Bayesian Expectations with Application to the Labour Market \nH. Förster (Februar 2008) \nNo. 111 \nFinding a Pareto-Optimal Solution for Multi-Region Models Subject to Capital Trade and  \nSpillover Externalities \nM. Leimbach, K. Eisenack (November 2008) \nNo. 112 \nDie Ertragsfähigkeit ostdeutscher Ackerflächen unter Klimawandel \nF. Wechsung, F.-W. Gerstengarbe, P. Lasch, A. Lüttger (Hrsg.) (Dezember 2008) \nNo. 113 \nKlimawandel und Kulturlandschaft Berlin \nH. Lotze-Campen, L. Claussen, A. Dosch, S. Noleppa, J. Rock, J. Schuler, G. Uckert  \n(Juni 2009) \nNo. 114 \nDie landwirtschaftliche Bewässerung in Ostdeutschland seit 1949 - Eine historische Analyse vor \ndem Hintergrund des Klimawandels \nM. Simon (September 2009)\n\nNo. 115 \nContinents under Climate Change - Conference on the Occasion of the 200th Anniversary of the \nHumboldt-Universität zu Berlin, Abstracts of Lectures and Posters of the Conference, \nApril 21-23, 2010, Berlin \nW. Endlicher, F.-W. Gerstengarbe (eds.) (April 2010) \nNo. 116 \nNach Kopenhagen: Neue Strategie zur Realisierung des 2°max-Klimazieles \nL. Wicke, H. J. Schellnhuber, D. Klingenfeld (April 2010) \nNo. 117 \nEvaluating Global Climate Policy - Taking Stock and Charting a New Way Forward \nD. Klingenfeld (April 2010) \nNo. 118 \nUntersuchungen zu anthropogenen Beeinträchtigungen der Wasserstände am Pegel \nMagdeburg-Strombrücke \nM. Simon (September 2010) \nNo. 119 \nKatalog der Großwetterlagen Europas (1881-2009) nach Paul Hess und Helmut Brezowsky \n7., verbesserte und ergänzte Auflage \nP. C. Werner, F.-W. Gerstengarbe (Oktober 2010) \nNo. 120 \nEnergy taxes, resource taxes and quantity rationing for climate protection \nK. Eisenack, O. Edenhofer, M. Kalkuhl (November 2010) \nNo. 121 \nKlimawandel in der Region Havelland-Fläming \nA. Lüttger, F.-W. Gerstengarbe, M. Gutsch, F. Hattermann, P. Lasch, A. Murawski, \nJ. Petraschek, F. Suckow, P. C. Werner (Januar 2011) \nNo. 122 \nAdaptation to Climate Change in the Transport Sector: A Review \nK. Eisenack, R. Stecker, D. Reckien, E. Hoffmann (Mai 2011) \nNo. 123 \nSpatial-temporal changes of meteorological parameters in selected circulation patterns \nP. C. Werner, F.-W. Gerstengarbe (November 2011) \nNo. 124 \nAssessment of Trade-off Decisions for Sustainable Bioenergy Development in the Philippines: \nAn Application of Conjoint Analysis  \nL. A. Acosta, D. B. Magcale-Macandog, W. Lucht, K. G. Engay, M. N. Q. Herrera, \nO. B. S. Nicopior, M. I. V. Sumilang, V. Espaldon (November 2011) \nNo. 125 \nHistorisch vereinbarte minimale mittlere Monatsabflüsse der Elbe im tschechisch-deutschen \nGrenzprofil bei Hřensko/Schöna – Eine Analyse der Niedrigwasseraufhöhung im Grenzprofil \ninfolge des Talsperrenbaus im tschechischen Einzugsgebiet der Elbe \nM. Simon, J. Böhme (März 2012) \nNo. 126 \nCluster Analysis to Understand Socio-Ecological Systems: A Guideline \nP. Janssen, C. Walther, M. Lüdeke (September 2012)"},"speaker_notes":"PIK  Report\nNo. 126\nFOR\nPOTSDAM INSTITUTE\nCLIMATE IMPACT RESEARCH (PIK)\nCLUSTER ANALYSIS TO UNDERSTAND\nSOCIO-ECOLOGICAL SYSTEMS:\nA GUIDELINE\nPeter Janssen, Carsten Walther, Matthias Lüdeke\n\nHerausgeber:\nProf. Dr. F.-W. Gerstengarbe\nTechnische Ausführung:\nU. Werner\nPOTSDAM-INSTITUT\nFÜR KLIMAFOLGENFORSCHUNG\nTelegrafenberg\nPostfach 60 12 03, 14412 Potsdam\nGERMANY\nTel.:\n+49 (331) 288-2500\nFax:\n+49 (331) 288-2600\nE-mail-Adresse:pik@pik-potsdam.de\nAuthors:\nDipl.-Phys. Carsten Walther\nDr. Matthias Lüdeke\nPotsdam Institute for Climate Impact Research\nP.O. Box 60 12 03, D-14412 Potsdam, Germany\nDr. Peter Janssen *\nNetherlands Environment Assessment Agency (PBL)\nP.O. Box 1, 3720 BA Bilthoven, The Netherlands\nE-Mail: peter.janssen@pbl.nl\n* corresponding author\nPOTSDAM, SEPTEMBER 2012\nISSN 1436-0179\nThis report is the result of a joint study between the PBL Netherlands Environmental\nAssessment Agency and PIK.\n\n3 \n \nAbstract \n \nIn coupled human-environment systems where well established and proven general \ntheories are often lacking cluster analysis provides the possibility to discover \nregularities – a first step in empirically based theory building. The aim of this report is \nto share the experiences and knowledge on cluster analysis we gained in several \napplications in this realm helping to avoid typical problems and pitfalls. In our \ndescription of issues and methods we will highlight well-known main-stream methods \nas well as promising new developments, referring to pertinent literature for further \ninformation, thus offering also some potential new insights for the more experienced.  \nThe following aspects are discussed in detail: data-selection and pre-treatment, \nselection of a distance measure in the data space, selection of clustering method,  \nperforming clustering (parameterizing the algorithm(s), determining the number of \nclusters etc.) and the interpretation and evaluation of results. We link our description – \nas far as tools for performing the analysis are concerned - to the R software \nenvironment and its associated cluster analysis packages. We have used this public \ndomain software, together with own tailor-made extensions, documented in the \nappendix.\n\n4 \n \n  \n \n \nContents \n \n \n1. Introduction \n \n \n \n \n \n \n \n  5 \n2. Data selection and pre-treatment  \n \n \n \n \n  8 \n3. Selection of a distance measure in the data space \n \n \n20 \n4. Selection of clustering method \n \n \n \n \n \n24 \n5. How to measure the validity of a cluster? \n \n \n \n39 \n6. Graphical representation of the results \n \n \n \n \n49 \n7. References \n \n \n \n \n \n \n \n55 \n \nAppendix A: The R software environment \n \n \n \n \n72 \nAppendix B: Cluster analysis in R \n \n \n \n \n \n73 \nAppendix C: Data for comparing clustering methods \n \n \n83 \nAppendix D: On determining variable importance for clustering \n \n84 \nAppendix E: Commonly used internal validation indexes   \n \n88 \n \n \n \n \n \n \n \n \n \nAcknowledgements \nThis report is the result of a joint study between the PBL Netherlands Environmental \nAssessment Agency and PIK, as part of a wider research effort to quantitatively \nanalyse patterns of vulnerability. The authors like to acknowledge the feedback and \ninputs of Henk Hilderink, Marcel Kok, Paul Lucas, Diana Sietz, Indra de Soysa and \nTill Sterzel provided during the course of the study.\n\n5 \n \n1 Introduction \n \nCluster analysis is a general methodology for exploration of datasets when no or little \nprior information is available on the data’s inherent structure. It is used to group data \ninto classes (groups or clusters) that share similar characteristics, and is widely used \nin behavioural and natural scientific research for classifying phenomena or objects \nunder study without predefined class-definitions. In particular in coupled human-\nenvironment systems where well established and proven general theories are still \nlacking cluster analysis provides the possibility to discover regularities – a first step in \nempirically based theory building. A recent example is the application for assessing \nthe vulnerability of human wellbeing against global change (Sietz et al., 2011 and \nKok et al., 2010). The aim of this report is to share the experiences and knowledge on \ncluster analysis we gained in these applications helping to avoid typical problems and \npitfalls.  \nA broad collection of clustering methods has been proposed in areas as statistics, data \nmining, machine learning, bioinformatics, and many textbooks and overview papers \nillustrate the variety of methods as well as the vigorous interest in this field over the \nlast decade with the growing availability of computer power for analysing extensive \ndatasets or data objects involving many attributes (i.e. finding clusters in high-\ndimensional space, where the data points can be sparse and highly skewed). Books on \ncluster analysis, there are many: e.g. Aldenderfer and Blashfield (1976), Jain and \nDubes (1988), Kaufman and Rousseeuw (1990), Gordon (1999), Hastie et al. (2001), \nEveritt, Landau and Leese, 2001, Mirkin (2005); Xu and Wunsch (2009). The same \nholds for overview papers, see e.g. Jain, Murty and Flynn (1999), Omran, \nEngelbrecht, Salman (2005), Xu and Wunsch (2005), Wunsch and Xu (2008).  \n \nIn this report we will highlight the major steps in the cluster analysis process, and link \nit – as far as tools for performing the analysis are concerned - to the R software \nenvironment and its associated cluster analysis packages (see appendix A and B). We \nhave used this public domain software, together with own tailor-made extensions, to \nperform cluster analysis for identifying patterns of vulnerability to global \nenvironmental change (Kok et al. 2010), as part of a joint study of the PBL \nNetherlands Environmental Assessment Agency, PIK and the Norwegian University \nof Science and Technology. Examples from this study will be used as illustrative \nmaterial in the present report. \n \nBeyond this specific background, the report is set up in more general terms, and can \nbe used by novices in the field of cluster analysis, as well as by people who have \nalready some working experience with the method but want to extend their ability to \nperform cluster analyses.  \n \nIn our description of issues and methods we will highlight well-known main-stream \nmethods as well as promising new developments, referring to pertinent literature for \nfurther information, thus offering also some potential new insights for the more \nexperienced. We do not extensively consider cluster analysis methods which \nexplicitly account for spatial and/or temporal aspects of the data, but only briefly \ntouch upon them.\n\n6 \n \n1.1 Outline of the report \n \nOur exposition is for an important part based on the excellent book of Everitt, Landau \nand Leese, 2001 on clustering and on Han and Kamber’s book on data mining, which \ncontains a concise chapter on cluster analysis (Han and Kamber, 2006, chapter 7). In \ndiscussing cluster analysis we will divide the clustering-process into a number of \nlogical steps: \n \n• Data-selection and pre-treatment: In its generality this concerns the selection of \ndata of interest for the problem at hand and the treatment of missing values and \noutliers. Optionally it also involves dimension-reduction by selecting variables or \nextracting relevant features from the data, the use of data transformations to bring \nthe data values to a more even scale and the standardization of data to make them \nmutually more comparable. These forms of data-processing can influence the \noutcomes of the clustering to a large extent, and should therefore be chosen with \ndue consideration. \n• Selection of a distance measure in the data space: In order to express the \nsimilarity or dissimilarity between data points a suitable distance measure (metric) \nshould be chosen. It forms the basis for performing the clustering to identify \ngroups which are tightly knit, but distinct (preferably) from each other \n(Kettenring, 2006). Often Euclidean distance is used as a metric, but various other \ndistance measures can be envisioned as well.  \n• Selection of clustering method: The extensive – and ever-growing - literature on \nclustering illustrates that there is no such thing like an optimal clustering method. \nWe will group the multitude of methods into a restricted number of classes, and \nwill especially focus on two commonly used classes, one which is based on \nhierarchically performing the clustering, while the other consists of constructively \npartitioning the dataset into a number of clusters, using the k-means method. The \nother classes will be briefly discussed with due reference to literature for further \ninformation. \n• Performing clustering: This involves parameterising the selected clustering \nalgorithm(s) (e.g. choosing starting points for the partitioning method), \ndetermining the number of clusters, and computing the resulting clustering \npartition for these settings. Especially the issue of determining the number of \nclusters is an important one, and we will highlight a general approach which we \napplied for our vulnerability assessment study. \n• Interpretation and evaluation of results: This concerns in the first place a \ndescription of the clustering in terms of cluster characteristics. Moreover - in order \nto use the clustering results - the characteristics and meaning of the various \nclusters have to be interpreted in terms of content matters, which often involve a \nprocess of knowledge building, hypothesis setting and testing, going back and \nforth from the clustering results to the underlying knowledge base. \nFinally, evaluation includes also a study of the sensitivity of the clustering results \nfor the various choices during the various steps of the cluster analysis, e.g. \nconcerning the data selection and pre-treatment, selection of clustering method \netc. Also the effects of uncertainties and errors in the data should be addressed in \nthis step.\n\n7 \n \nThe various steps are described in more detail in the following chapters. In the \nappendices more detailed information is given on the R software and on some specific \nclustering issues. \n \n \nClustering in various contexts (according to Han and Kamber, 2006): \nAs a branch of statistics, cluster analysis has been extensively studied, with a focus on \ndistance-based cluster analysis. Cluster analysis tools based on k-means, k-medoids, \nhierarchical clustering and several other methods have been build into many software \npackages for statistical analysis such as S-Plus, SPSS and SAS. Also dedicated software \n(e.g. Wishart’s CLUSTAN (http://www.clustan.com/index.html), Matlab Statistics \ntoolbox) and public-domain packages abound (see the various R-packages on clustering). \nIn the machine learning context, clustering is an example of unsupervised learning, which \ndoes not rely on predefined classes and class-labeled training data. It is a form of learning \nby observation, rather than learning by examples as in supervised learning (as e.g. in data-\nclassification).  \nIn the data mining field efforts have focused on finding methods for efficient and effective \nanalysis of large databases. Issues as the scalability of clustering methods, the ability to \ndeal with mixed numerical and categorical data, complex shapes and types of data,  high-\ndimensionality, the ability to deal with noisy data, to incorporate domain knowledge, to \neasily deal with updates of the databases, insensitivity to the order of input records, are \nimportant requirements for the clustering methods.\n\n8 \n \n2 Data selection and pre-treatment \n \n \nThe main theme in cluster analysis is to identify groups of individuals or objects (i.e. \n‘cases’ or ‘entities’) that are similar to each other but different from individuals or \nobjects in other groups. For this purpose data on the individuals or objects have to be \ncollected, and it is obvious that the data should be characteristic, relevant and of good \nquality to enable a useful analysis. \n \n2.1 Data-collection: Some important issues  \n \nThis means in the first place that an adequate number of objects/cases/individuals \nshould be available in the dataset to study the phenomena of interest (e.g. identifying \nsituations that show a similar reaction pattern under certain environmental stresses; \nidentifying subgroups of patients with a diagnosis of a certain disease, on basis of a \nsymptom checklist and results from medical tests; identifying people with similar \nbuying patterns in order to successfully tailor marketing strategies etc.).  \nMoreover the researcher should choose the relevant variables/features which \ncharacterize the objects/cases/individuals on basis of which the groups should be \nsubdivided in homogeneous subgroups. Milligan, 1996 strongly advices to be on the \nparsimonious side and ‘select only those variables that are believed to help \ndiscriminate the clustering in the data’. Adding ‘ only one or two irrelevant variables \ncan dramatically interfere with cluster recovery’ (Milligan, 1996).  \nFor further analysis one must also decide - amongst others - whether to transform or \nstandardize the variables in some way so that they all contribute equally to the \ndistance or similarity between cases. \nFurthermore data quality will be another important issue which involves various \naspects as e.g. accuracy, completeness, representativeness, consistency, timeliness, \nbelievability, value added, interpretability, traceability and accessibility of the data, \npresence of noise and outliers, missing values, duplicate data etc. (cf. Pipino, Funk, \nWang (2006)).  \n \n2.2 Data-collection: Type of data \n \nAn important distinction when considering the data that has been collected on the \n‘objects’ and their ‘attributes’1 (i.e. properties or characteristics of an object; e.g. eye \ncolour of a person, length, weight) is the (measurement) scale which has been used in \nexpressing these attributes: \n \n− Nominal scale: In fact this is not really a scale because numbers are simply used \nas identifiers, or names, e.g. in coding a (no, yes) response as (0,1). The numbers \nas such are mostly meaningless in any quantitative sense (e.g. ID numbers, eye \ncolour, zip codes).  \n                                                     \n1 Concerning terminology: ‘attributes’ are also referred to as variables, features, fields, characteristics. \nA collection of attributes describes an ‘object’. An object is also known as record, point, case, sample, \nentity or instance. These terms are often used interchangeably.\n\n9 \n \n− Ordinal scale: The numbers have meaning only in relation to one another, e.g. the \nscales (1, 2, 3), (10, 20, 30) and (1, 20, 300) are in a sense equivalent from an \nordinal viewpoint. Examples of ordinal scale attributes are rankings, grades, or \nexpressing height in {tall, medium, short}-categories. \n− Interval scale: This scale is used to express data in a (continuous) measurement \nscale where the separation between numbers has meaning. A unit of measurement \nexists and the interpretation of the numbers depends on this unit (compare \ntemperature in Celsius or in Fahrenheit). \n− Ratio scale: This is a measurement scale where an absolute zero exist and a unit of \nmeasurement, such that the ratio between two numbers has meaning (e.g. distance \nin meters, kilometres, miles or inches). \n \nThe first two scales refer more to qualitative variables, and the latter to quantitative \nvariables2. In practice, the attributes characterizing an object can be of mixed type.  \n \nAnother distinction can be made between ‘discrete’ and ‘continuous’ attributes, where \nthe first category refers to variables having a finite or countably infinite set of values \n(e.g. zip-code), and can often be represented as integer variables (1, 2, 3, …). Binary \nattributes, taking on the values 0, 1, or “No”, “Yes” are a special case of discrete \nattributes. Continuous attributes can take values over a continuous range, and have \nreal numbers as attribute values. Notice that in practice real values can only be \nmeasured and represented using a finite number of digits. \n \n2.3 Data pre-processing \n \nSince real data can be incomplete (missing attribute values), noisy (errors or outliers) \nand inconsistent (e.g. duplicates with different values), data pre-processing is an \nindispensable part of the cluster analysis. The major tasks involved in data pre-\nprocessing are: \n \n− [A] Data cleaning: Filling in missing values, smoothing noisy data, identifying or \nremoving outliers, correcting inconsistencies and resolving redundancies caused \nby integration or merging of data from various sources/databases. \n− [B] Data integration: Integration of multiple databases, files or data cubes (data \nstructures commonly used to describe time series of image data).  \n− [C] Data transformation: Putting data in form(at)s which are appropriate for \nfurther analysis. This includes normalization and performing summary or \naggregation operations on the data, for instance.  \n− [D] Data reduction: Obtaining reduced representation in volume of the data that \nproduce the same or similar analytical results. \n− [E] Data discretization: Especially for numerical data this denotes a specific \nform of data reduction. \n− [F] Cluster tendency: Determining whether there are clusters in the data. \n− [G] Cluster visualisation: Using graphical techniques can greatly enhance the \nanalysis of the underlying cluster/group-structure in the data. \n                                                     \n2 We restrict our attention to data which have numerical values, and don’t consider symbolic objects. \nSee e.g. Ravi and Gowda (1999) for cluster analysis of this category of objects.\n\n10 \n \n \n \nIn the sequel we will outline these activities in more detail: \n \n2.3.1 Data cleaning  \n \nVarious techniques for performing data-cleaning can be used, of which we only \nbriefly discuss the way missing data and outliers can be handled. Additional dedicated \nmethods for data cleaning originating from the data warehouse literature can e.g. be \nfound in Rahm and Do (2000). \n \n(i) Handling missing data  \nValues can be missing since information is not collected or attributes are not \napplicable in all cases (e.g. annual income for children). One obvious way of handling \nmissing data is simply eliminating the corresponding data objects, and analysing only \nthat part of the dataset which is complete (called marginalization by Wagstaff and \nLaidler, 2005). This strategy does not lead to the most efficient use of the data and is \nrecommended only in situations where the number of missing values is very small. \nAnother option (called imputation) to deal with missing data is to replace the missing \nvalues by a global constant (e.g. ‘unknown’, a new class) or by an estimate, e.g. the \nmean, median, a most probable value; cf. various forms of data-imputation (e.g. mean, \nprobabilistic or nearest neighbourhood imputation3, as presented in Wagstaff and \nLaidler, 2005). \n \nJain and Dubes (1988, page 19-20)) recommend - on basis of experimental results of \nDixon (1979) - to use an imputation approach which redefines the distance between \ndata points xi and xk which contain missing values as follows: First define the distance \ndj between the two points along the j-the feature as dj=0, if xij or xkj is missing, and xij-\nxkj otherwise, then the distance between xi and xk is defined as: \n\n−\n=\n2\nj\no\nik\nd\nm\nm\nm\nd\n \nwhere mo is the number of features missing in xi or xk or both, and m is the total \nnumber of features. \nik\nd  as defined above is the squared Euclidean distance in case \nthere are no missing values. \n \nWagstaff and Laidler (2005) notice that in some applications imputation and \nmarginalization is not suitable since the missing values are physically meaningful and \nshould not be supplemented or discarded. They implemented an algorithm, called \nKSC (K-means with soft constraints) that is dealing with the whole data set including \nthe partially measured objects. \nAdditional information on dealing with missing values can be found in Little & Rubin \n(1987). \n \n(ii) Smoothing noisy data \nNoisy data are caused by (random) error or variance in a measured variable, as well as \nincorrect attribute values due to faulty data collection instruments, data entry and \ntransmission problems, inconsistencies in naming convention etc. In case of noisy \n                                                     \n3 ‘Mean imputation’ involves filling the missing values with the mean of the remaining ones, while \n‘probabilistic imputation’ consists of filling it with a random value drawn from the distribution of the \nfeature. ‘Nearest neighborhood imputation’ replaces it with value(s) from the nearest neighbor.\n\n11 \n \ndata one can decide to filter/smooth them first in order to partially remove some of the \neffects of the noise. E.g. binning, which consists of first sorting the data and then \npartitioning them into (equal frequency) bins and subsequently smoothing them by \nreplacing them by their bin means, medians or bin boundaries, is a simple way of \nfiltering the data. More advanced approaches, like using e.g. regression analysis, \ntrend-detection or noise-filtering (applying e.g. moving averages) can also be invoked \nto partially remove noise from the data. \n \n(iii) Handling outliers \nOutliers are data values that are extremely large or small relative to the rest of the \ndata. Therefore they are suspected to misrepresent the population from which they \nwere collected. Outliers may be the result of errors in measurements, model-results, \ndata-coding and transcription, but may also point to (often unexpected) true extreme \nvalues, indicating more variability in the population than was expected. Therefore, in \ntreating outliers one has to be cautious not to falsely remove outliers when they \ncharacterize important features (e.g. hotspots) of the phenomenon at hand; it is \nobvious that the decision to discard an outlier should not be based solely on a \nstatistical test but should also be taken on basis of scientific and quality assurance \nconsiderations.  \nThe first step in handling outliers consists of the detection of outliers (see also \nRousseeuw et al. 2006). Though detecting outliers can partly be based on process-\ninformation and combined computer and human inspection of graphical \nrepresentations of the data, one often relies on statistical techniques. Hubert and Van \nder Veeken (2008) recently proposed a statistical technique which is especially suited \nfor detecting outliers in skew distributed multivariate data and is also related to the \nadjusted boxplot for skew distributed data (Hubert and Vandervieren (2008)). Though \nseveral more refined robust estimators and outlier detection methods exist which are \ntypically geared to specific classes of skewed distributions, their approach is very \nuseful when no prior information about the data distribution is available, or when an \nautomatic and fast outlier detection method is required. In the CRAN-package \n<<robustbase>>4 functionality is available for this form of outlier detection (function \n<<adjOutlyingness>>) as well as for the adjusted box-plot determination (function \n<<adjbox>>). \nThe second step involves the pre-treatment of outlier-values before performing cluster \nanalysis. In general three general strategies can be applied: (a) using the outlying data \npoints in the subsequent analysis, accounting for their effects on the outcomes; (b) \ntrimming: removing the outlier data from the data set, and not incorporating them in \nthe dataset for the subsequent cluster analysis; (c) winsorising: replacing the outlying \nvalues by a truncated variant, e.g. a specific percentile (e.g. the 1th or 99th percentile) \nof the dataset, or an associated cut off-value of the skewed boxplot (Hubert and Van \nder Veeken, 2008). These truncated data points are included in the cluster analysis. \n \nThe above procedure is in fact centred around detecting outlying values with respect \nto an (supposedly) underlying distribution of the attribute-dataset, before the cluster \nanalysis takes place. There is however also the issue of detecting outliers with respect \nto the obtained partition of the objects into clusters, i.e. after the cluster analysis has \nbeen performed: \n                                                     \n4 Cf. http://cran.r-project.org/web/packages/robustbase/\n\n12 \n \n− Irigoien and Arenas (2008) recently proposed a geometrically inspired method for \ndetecting potential atypical outlying data-points.  \n− Also the Silhouette statistic proposed by Rousseeuw (1987) can be used as an \nindication of the outlyingness of a point in a cluster. It measures how well a \ncertain data point/object, say i, is matched to the other points/objects in its own \ncluster, versus how well matched it would be, if it were assigned to the next \nclosest cluster. The Silhouette of i is expressed as s(i)=[b(i)-a(i)]/max[a(i),b(i)], \nwhere a(i) denotes the average distance between the i-th point and all other points \nin its cluster, and b(i) is the average distance to points in the “nearest” clusters \nwith nearest being defined as the cluster minimizing b(i). s(i) is a value between -\n1 and +1, and large (positive) values indicate strong clustering, while negative \nvalues indicate that clustering is bad. See e.g. Figure 1 which gives an example of \na Silhouette plot, as well as the associated 2-dimensional projection of the cluster \npoints. The Silhouette statistic can e.g. be calculated with the function \n<silhouette> in the CRAN-package <<cluster>>5. \n \n \nSilhouette width si\n0.0\n0.2\n0.4\n0.6\n0.8\n1.0\nSilhouette plot of pam(x = iris.x, k = 3)\nAverage silhouette width :  0.55\nn = 150\n3  clusters  Cj\nj :  nj | avei∈Cj  si\n1 :   50  |  0.80\n2 :   62  |  0.42\n3 :   38  |  0.45\n   \n-3\n-2\n-1\n0\n1\n2\n3\n-3\n-2\n-1\n0\n1\n2\nCLUSPLOT( iris.x )\nComponent 1\nComponent 2\nThese two components explain 95.81 % of the point variability.\n \n \nFigure 1: An example of a Silhouette plot for a cluster analysis with three clusters. The plot \nexpresses the (ordered) silhouette values for the points in the three clusters. It shows that \nmost points in the first cluster have a large silhouette value, greater than 0.6, indicating that \nthe cluster is somewhat separated from neighbouring clusters. The second and third cluster \ncontain also several points with low silhouette values indicating that those two clusters are \nnot well separated, as exemplified in the 2-dimensional cluster plot in the right frame. \n \nThe R-commands for constructing these results are: \n \n## Partitioning iris-data (data frame) into 3 clusters, \n## and displaying the silhouette plot.   \n## Moreover a 2-dimensional projection of the partitioning is given. \n \nlibrary(cluster)       # Load the package cluster \ndata(iris)             # Load the famous (Fisher’s or Anderson’s) iris-dataset  \niris.x <- iris[, 1:4]  # Select the specific datacolumns: i.e. Sepal.Length, \nSepal.Width, Petal.Length, Petal.Width \n \npr3 <- pam(iris.x, 3)  # Perform the clustering by the PAM-method with 3 clusters \nsi<-silhouette(pr3)    # Compute the Silhouette information for the given \nclustering \n                                                     \n5 Cf. http://cran.r-project.org/web/packages/cluster/\n\n13 \n \n \nplot(si, col = c(\"red\", \"green\", \"blue\")) # draw a silhouette plot with \nclusterwise coloring \n \nclusplot(iris.x, pr3$clustering, shade=TRUE,color = TRUE, col.clus= c(\"red\", \n\"green\", \"blue\")) # draw a 2-dimensional clustering plot for the given clustering \n \n \nFor more information on outlier-detection and analysis we refer to section 7.11 in Han \nand Kamer, 2006, who distinguish 4 different approaches to outlier analysis: statistical \ndistribution-based, distance-based, density-based local outlier detection and the \ndeviation-based approach. \n \n2.3.2 Data integration  \n \nWhen integrating multiple data-sources (databases, files or data-cubes) redundant data \ncan occur, since e.g. the same attribute or object may have different names in different \nsources, or one attribute may be a ‘derived’ attribute in another source (e.g. annual \nvalues, instead of monthly values). Correlation analysis can e.g. be used to point at \npotential redundancies in the data, while additional post-processing (e.g. data-\nreduction; see later) can be used to alleviate their effects.  \nIn data integration one should also be aware of potential data value conflicts which \ncan occur when attribute values from different sources are different e.g. due to \ndifferent representations or scales. These problems can be avoided by carefully \nperforming and checking the data integration. \n \n2.3.3 Data transformation  \n \nData transformation first of all includes normalization of the data to bring them into a \nform which is more amenable for the subsequent analysis. It is well-known that \nmeasurement scale can have a large effect in performing cluster analyses, as \nillustrated in Figure 5 of Kaufman and Rousseeuw, 1990 or in Silver, 1995. Therefore \nit is considered important to bring the data into a form which is less dependent on the \nchoice of measurement/representation scale. A typical standardization (the \n“(min,max)-range standardization”) which is used for this purpose consists of \ndetermining the range of values6 and redefinining the value of X(i) by:  \n(X(i)-min)/(max-min), thus obtaining values between 0 and 1, where 0 and 1 refer to \nthe extreme values (i.e. min and max7). Other statistical transformations, like the Z-\ntransform - which replaces X(i) by (X(i)-mean)/stdev, with mean being the average \nvalue, and stdev the standard deviation of all data-values X(i) - are also conceivable, \nbut are considered less apt when performing cluster-analysis (cf. Milligan and Cooper, \n1988, Kettenring, 2006).  \n \nRemark: Though the (min,max) standardization has the function of transforming the variables \ninto a comparable format, some caution is due in using it. E.g. in situations where certain \nvariables are already measured in a commensurable scale, applying this additional \nstandardization can result in an artificial rescaling of the variables which obscures their actual \ndifferences. E.g. when the actual min-max ranges differ (e.g. the actual values for variable A \n                                                     \n6 This can e.g. be the actual range, consisting of the actual maximum-minimal value of the current data, \nor the maximal feasible range one can think of (i.e. beyond the actual data-sample). \n7 The min and max can here refer to the actual minimum and maximum of the dataset at hand, but can \nalso refer to the feasible minimum and maximum which can realistically be expected, and which can be \nsmaller (for the minimum) or larger (for the maximum) than the actual ones.\n\n14 \n \nrange from .2 to .25, while those of variable B range from .015 to .8), rescaling on basis of the \nactual min-max range will result for both variables in values running from 0 to 1 which \nrenders a very different (and erroneous) view on their difference. In this situation one could \nargue for not automatic rescaling these variables, but proceed with the unscaled version. \nHowever, one can as easily argue against this, by stating that the use of an unscaled version \nfor these variables will result in an unfair bias towards other variables which have been re-\nscaled into the complete (0, 1) range by applying the (min,max) standardization. What choices \nwill be made in the end will depend on what is considered important. This situation in fact \nasks for a sensitivity analysis to study what effects the applied alternative standardization \noptions can possibly have on the clustering results. \n \nAnother issue concerns the use of non-linear transformations on the variables to bring \nthem into a form which e.g. fits more to the underlying assumptions: e.g. a right-\nskewed distribution could possibly be transformed into approximately Gaussian form \nby using logarithmic or square-root transformation, to make the data more amenable \nto statistical techniques which are based on normality assumptions. In analyzing these \ntransformed data one should however realize that re-interpretation of the obtained \nresults in terms of the original untransformed data requires due care, since means and \nvariances of the transformed data render biased estimates when transformed back to \nthe original scale. Therefore, if the nonlinear transformations of the data are expected \nto have no noticeable benefits for the analysis, it is usually better to use the original \ndata with a more appropriate statistical analysis-technique (e.g. robust regression in \ncase one wants to relate variables to each other). \n \n2.3.4 Data reduction \n \nIn situations where the dataset is very large, data reduction is in order to reduce run \ntime and storage problems in performing cluster analysis. The challenge is to obtain a \nreduced representation of the dataset that is much smaller in volume but produces the \nsame (or almost the same) analytical results. Various reduction strategies are in order \nto achieve this: \n \n(i) Aggregation: consists of combining two or more attributes (or objects) into a single \nattribute (or object), thus resulting in a reduced number of attributes or objects. One \nshould strive to find aggregations which make sense, and highlight important aspects \nof the problem at hand. This can also involve a change of scale (e.g. cities aggregated \ninto regions, states, countries; daily, weekly, monthly averages), and can render more \n‘stable’ data (less variability), however at the price of losing information on the more \ndetailed scale. \n \n(ii) Sampling: Instead of processing the complete dataset one can decide to process \npart of the dataset which is obtained by selecting a restricted (random) sample. In this \nprocess one has to be sure that the selected sample accurately represents the \nunderlying cluster- or populations structure in which one is interested. \n \n(iii) Feature selection: Feature Selection consists of identifying and removing features \n(or equivalently attributes, variables) which are redundant (e.g. duplicating much of \nthe information in other features) or irrelevant (e.g. containing no information that is \nuseful for the data mining task at hand, e.g. identifiers of objects). Apart from brute \nforce approaches which try all possible feature subsets, more advanced techniques can \nbe invoked as e.g. filter and wrapper approaches to find the best subset of attributes\n\n15 \n \n(see the extensive literature on these topics in machine learning and data-mining, e.g. \nBlum and Langley, 1997, Kohavi and John, 1997; see also Xing, 2003, Guyon and \nElisseeff, 2003, Guyon et al., 2006, Handl and Knowles, 2006, Liu, Yun, 2005, Saeys \net al. 2007). This last class of techniques can be implemented in a forward (stepwise \nforward selection) or a backward (stepwise backward elimination) fashion, similar to \nstepwise regression. See also table 5 in Jain et al. (2000) where a number of feature \nselection methods are briefly discussed in the context of statistical pattern recognition.  \n \nA number of (recent) publications more specifically address feature (or variable, \nattribute) selection for cluster analysis:  \n• Friedman and Meulman (2004) proposed, in the context of hierarchical clustering \nmethods, a method to cluster objects on subsets of attributes. It is based on the \nidea that subsets of variables which contribute most to each cluster structure may \ndiffer between the clusters. Software is available in R to perform this analysis \n(COSA; see http://www-stat.stanford.edu/~jhf/COSA.html). Damian et al. (2007) \ndescribe applications of this algorithm in medical systems biology. \n• Raftery and Dean (2006), in the context of model-based clustering, propose a \nvariable selection method, which consistently yields more accurate estimates of \nthe number of groups and lower classification error rates, as well as more \nparsimonious clustering models and easier visualization of results. See the CRAN-\npackage <<clustvarsel>>8 for related software.  \nFor interesting further developments see the recent paper of Maugis et al. (2008, \n2009). Methods which especially focus on situations with very many variables \n(high-dimensional data), are furthermore presented in McLachlan et al. 2002, \nTadesse et al. (2005), Kim et al. (2006). See also Donoho and Jin (2008, 2009) for \nthe related case of discriminant analysis (i.e. supervised classification).  \n• Steinley and Brusco (2008b) compared various procedures for variable selection \nproposed in literature, and concluded that a novel variable weighting and selection \nprocedure proposed by Steinley and Brusco (2008a) was most effective. \n• Mahoney and Drineas (2009) recently proposed so called CUR matrix \ndecompositions, i.e., low-rank matrix decompositions that are explicitly expressed \nin terms of a small number of actual columns and/or actual rows of the original \ndata matrix as a means for improved data-analysis, which can be usefully applied \nin clustering. \n• Donoho and Jin (2008, 2009) address optimal feature selection in the context of \nclassification and discriminant analysis in case that useful features are rare and \nweak. Their idea of using a thresholding strategy for feature Z-scores can be \nextended to cluster analysis applications. \n• Fraiman et al. (2008) recently introduced two procedures for variable selection in \ncluster analysis and classification, where one focuses on detecting ‘noisy’ non-\ninformative variables, while the other also deals with multi-colinearity and general \ndependence. The methods are designed to be used after a ´satisfactory´ grouping \nprocedure has already been carried out, and moreover presuppose that the number \nof clusters is known and that the resulting clusters are disjoint. The main \nunderlying idea is to study which effect the blinding of subsets of variables (by \nfreezing their values to their marginal or conditional mean) has on the clustering \nresults as compared to the clustering the full variable set. To enable analysis for \nhigh-dimensional data a heuristic forward-backward algorithm is proposed to \n                                                     \n8 Cf. http://cran.r-project.org/web/packages/clustvarsel/\n\n16 \n \nconsecutively search (in a non-exhaustive way) for an appropriate variable \nselection. The performance of Fraiman’s methods in simulated and real data \nexamples is quite encouraging, and at points it also outperformed Steinley and \nBrusco (2008a) method. \n• Krzanowski and Hand (2009) recently proposed a simple F-test like criterion to \nevaluate whether the ratio of the between-group and the within-group sum of \nsquares for each specific variable is significantly greater than what would be \nexpected in a single homogeneous population (i.e. if no clustering would be \ninvolved). On basis of this easily computable test they expect to make an \nappropriate pre-selection/reduction of the variables for clustering applications \nwith very many variables involved. This is especially the case for applications like \nthe genetic characterization of diseases by microarray techniques, where typically \nvery many gene expression levels p are involved as compared to subjects n (e.g. \nvalues of n are in the hundreds, while values of p are in the thousands). More \nspecialized approaches for these high dimensional situations are more \ncomputationally demanding and more specifically bound to specific cluster \nanalysis techniques like mixture model-based approaches (cf. McLachlan et al. \n2002, Tadesse et al. (2005), Kim et al. (2006)). \n \nIn appendix D, we highlight some simple alternatives related to the latter two methods \nthat can be straightforwardly used for performing this feature selection, and give some \nexamples of their use. \n \nComplementary to variable selection one can also consider the use of variable \nweighting to express the relative (ir)relevance of features or variables (Gnanadesikan, \nKettenring and Tsao, 1995). De Soete, (1986, 1988) initially has developed optimal \nschemes for ultrametric and additive tree clustering (see also Milligan, 1989), and \nMakarenkov and Legendre (2001) have extended these9 also for K-means partitioning \nmethods. For k-means type clustering Huang et al., 2005 propose a procedure that \nautomatically updates variable weights based on the importance of the variables in \nclustering. Small weights reduce the effects of insignificant or noisy variables. As a \nfurther improvement on Huang’s procedure, Tsai and Chiu (2008) recently proposed a \nweight self-adjustment (FWSA) mechanism for K-means to simultaneously minimize \nthe separations within clusters and maximize the separations between clusters. They \ndiscuss the benefits of their method on basis of synthetic and experimental results. \nGnandesikan et al. (2007) recently proposed simple methods for weighting (and also \nfor scaling) of variables. \n \n (iv) Dimension Reduction/Feature Extraction: For reducing the dimensionality of the \ndataset, various methods can be applied which use (non-linear) transformations to \ndiscover useful and novel features/attributes from the original ones (cf. Jain et al. \n1999, 2000, Law and Jain, 2006, Camastra, 2003, Fodor, 2002). E.g. principal \ncomponent analysis (PCA) (Jolliffe, 2002) is a classical technique to reduce the \ndimensionality of the data set by transforming to a new set of variables which \nsummarizes the main features of the data set. Though primarily defined as a linear \nfeature extraction technique, suitable non-linear variants (kernel PCA) have been \ndeveloped in the last decades (see Schölkopf et al. 1999). PCA is often used as a \npreliminary step to clustering analysis in constraining attention to a few variables. But \n                                                     \n9 For downloading this software see http://www.bio.umontreal.ca/casgrain/en/labo/ovw.html\n\n17 \n \nits use can be problematic as illustrated by Sneath, 1980, Chang, 1983. These \nreferences show that clusters embedded in a high-dimensional data-space will not \nautomatically be properly represented by a smaller number of orthogonal components \nin a lower dimensional subspace. Yeung and Russo, 2001 also demonstrate that \nclustering with the PC’s (Principal Components) instead of the original variables does \nnot necessarily improve cluster quality, since the first few PC’s (which contain most \nof the variation in the data) do not necessarily capture most of the cluster structure.  \nIn addition to PCA, alternative techniques can be envisioned for the task of dimension \nreduction, like factor analysis, projection pursuit, independent component analysis, \nmulti-dimensional scaling (MDS10), Sammon’s projection11, IsoMap, Support Vector \nMachines, Self-Organizing Maps etc. (cf. De Backer et al. 1998, Jain et al. 2000, \nFodor, 2000, Tenenbaum et al. (2000)). However, the same caveats as mentioned \nbefore for the PCA remain active. Moreover one should realize that feature extraction \n- unlike feature selection - typically results in transformed variables, consisting of \n(non)linear combinations of the original features, for which the original meaning has \nbeen lost. This can be an impediment in interpreting the results of the subsequent \nclustering in terms of the original variables. \nIn R the packages12 <<kernlab>> and <<MASS>> deal with several of these \ncomputational techniques. \n \n(v) Mapping data to a new space \nIn order to highlight specific dynamics in the data, techniques like using Fourier \ntransforms or wavelet transforms can be used to map the data into a new space, where \nfurther analysis can take place (cf. § 2.5.3. in Han and Kamber, 2006). Underlying \nrationale is that in the novel space less dimensions are needed to characterize the \ndataset to a sufficient extend, thus achieving data reduction. \n                                                     \n10 MDS (multidimensional scaling) represents the similarity (or dissimilarity) among pairs of objects in \nterms of distances between points in a low-dimensional (Euclidean) space, and offers a graphical view \nof the dissimilarities of the objects in terms of these distances: the more dissimilar two objects are, the \nlarger the distance between these objects in Euclidean space should be (Norg and Groenen, 1997). \n11 Sammon’s nonlinear mapping is a projection method for analysing multivariate data. The method \nattempts to preserve the inherent structure of the data when the patterns are projected from a higher-\ndimensional space to a lower-dimensional space by maintaining the distances between patterns under \nprojection. Sammon’s mapping has been designed to project high-dimensional data onto one to three \ndimensions. See Lerner et al. (2000) for information on initialising Sammon’s mapping. \n \n12 Cf. http://cran.r-project.org/web/packages/kernlab and http://cran.r-project.org/web/packages/MASS/\n\n18 \n \n \n \n2.3.5 Data discretisation  \n \nBy dividing the range of continuous attributes into intervals one can reduce the \nnumber of values. Reduction of data can also be established by replacing low level \nconcepts by higher level concepts (e.g. replacing numeric values for the attribute ‘age’ \nby categories as young, middle-aged or senior). Techniques like binning, histogram \nanalysis, clustering analysis, entropy-based discretisation and segmentation by natural \npartitioning can be applied for this purpose (cf. § 2.6 in Han and Kamber, 2006) \n \n2.3.6 Cluster tendency  \n \nOne difficulty of cluster algorithms is that they will group the data into clusters even \nwhen there are none. Later we will discuss the possibilities of validating the results of \na clustering but here we present a number of ways by which the user can estimate a \npriori whether data contains structure.  \n \n \nFigure 2: Artificial data set (left), image-plot (R-function) of the distance matrix of this data \nset (centre), image-plot of the data set after applying VAT-algorithm (right). \n \nIn the VAT-algorithm Bezdek, Hathaway and Huband (2002) represent each pair of \nobjects by their distance. The emerging dissimilarity matrix is subsequently ordered \nand visualized by grey levels (0 if distance is zero and 1 for the maximum distance) \n(Figure 2, right). See also Bezdek, Hathaway and Huband (2007) where a technique is \npresented for the visual assessment of clustering tendency on basis of dissimilarity \nmatrices.  \nHu and Hathaway (2008) further developed this idea beyond the pure graphical \ninterpretation of the result. They implemented several tendency curves that average \nthe distances in the dissimilarity matrix. The peak-values in the tendency curves can \nthen be used as a signal for cluster structures and for automatic detection of the \nnumber of clusters.  \n \n \nFigure 3: Artificial data set with uniformly distributed values (left) – h=0.5, Artificial raster \ndata set (centre) – h=0.1, data with three artificial normally distributed clusters (right) – h=1.\n\n19 \n \n \nAnother possibility to check whether there are clusters in the data or not is the \nHopkins-Index, which is described in Runkler (2000) or Jain&Dubes (1988). The \nlatter reference proposes to use hypothesis tests of randomness for getting insight into \nthe data structure. Also tests like quadrate analysis, inter-point distance and structural \ngraphs can be employed. \n \n \n \n \n \n \n \n \n \n \n \n2.3.7 Visualizing clusters  \n \nA number of graphical techniques for visualizing and identifying clusters in one or \ntwo dimensions can be employed, such as histograms, scatter plots and kernel density \nestimators. For multivariate data with more than two dimensions one can e.g. use \nscatterplot matrices, but these only project two-dimensional marginal views and do \nnot necessarily reflect the true nature of the structure in the p-dimensional dataspace. \nAn alternative approach is to project the multivariate data into one or two dimensions \nin a way that the structure is preserved in some sense as fully as possible. A common \nway (although not necessarily the most appropriate) is principal component analysis. \nOther methods like exploratory projection pursuit, multidimensional scaling, support \nvector machines are also potential candidates for visualization of clusters. See e.g. \nchapter 2 and section 8.6 in Everitt et al. 2001, and chapter 9 in Xu and Wunsch, 2009 \nfor more information. Also graphical techniques for exploring the structure in \nmultivariate datasets, like co-plots or trellis graphics (see e.g. chapter 2 in Everitt and \nDunn, 2001) can offer useful insights for cluster analysis. R offers various \npossibilities to generate such plots. In chapter 6 some of these will be discussed. \n \nSummary \n \nIn this chapter we extensively highlighted what issues and decisions are involved in \nselecting and pre-processing data of interest for the problem at hand. This not only \ninvolves the treatment of missing values and outliers, but also a judicious selection of \nvariables or features of interest (e.g. removing redundancies, avoiding overly strong \ndependencies) for the subsequent cluster analysis, as well as adequate data \ntransformations to bring the data values to a more even and comparable scale. \nPreliminary checks on whether the data indeed contain clusters, and whether some \ngroup structure is visible will also render important information for the next steps in \nthe actual clustering. Finally, since data-processing can influence the outcomes of the \nclustering, it will be important at the end to study the sensitivity of the identified \nclusters for feasible alternative choices in data selection and pre-treatment. \n \nWhich datasets are ‘clusterable’? \nAckerman and Ben-David (2009) theoretically assess several notions of clusterability \ndiscussed in literature and propose a new notion which captures the robustness of the \nresulting clustering partition to perturbations of the cluster centres. They discover that \nthe more clusterable a data set is, the easier it is (computationally) to find a close-to-\noptimal clustering of that data, even showing that near-optimal clustering can be \nefficiently computed for well clusterable data. In practice it is however usually a \ncomputer-intensive problem (NP-hard) to determine the clusterability of a given dataset.\n\n20 \n \n3 Selection of a distance measure in the data space \n \nA central issue in clustering objects is knowledge on how ‘close’ these objects are to \neach other, or how far away. This reflects itself in the choice of the distance measure \nor the (dis)similarity measure on the objects. \n \nIn case that the distances between the objects are ‘directly available’, as e.g. in \nsurveys where people are asked to judge the similarity or dissimilarity of a set of \nobjects, the starting point of the clustering is a n-by-n proximity matrix, which stores \nthe (dis)similarities between the pairs of objects (i.e. d(i,j) is the dissimilarity between \nobjects i and j, with i, j = 1,…, n).  \n \nIf distances are not directly available, information on the objects is typically available \non their features/attributes. The typical starting point for a cluster analysis is then a \ndata-matrix in the form of a table or n-by-p matrix that represents the n objects (rows) \nwith their associated p attributes (columns). In discussing how this data-matrix can be \ntransformed into a dissimilarity matrix, we assume that after the previous step high-\nlighted in section 2 (i.e. “data pre-treatment”) the data space is in a definite form, and \ndoes not need additional normalization or weighing. This means e.g. that the \napplication of weights to individual features to express differences in relevance has \nalready been established. Moreover it presupposes that care has been exerted not to \ninclude non-informative features in our data, since they can trash the clustering by \ndisturbing or masking the useful information in the other features/ variables. \n \n3.1 The binary data case \n \nIn case that all the attributes are binary (say 0 or 1, or no/yes), the similarity between \nobjects is typically expressed in terms of the counts in the matches and mismatches \nthe p features for two objects are compared.  \n \n \n \nObject j \n \n \nOutcome \n1 \n0 \nTotal \nObject i \n1 \na \nb \na+b \n0 \nc \nd \nc+d \n \nTotal \na+c \nb+d \np \n \nTable 1: Counts of binary outcomes for two objects \n \nA number of similarity measures have been proposed, and a more extensive list can be \nfound in Gower and Legendre (1986).\n\n21 \n \n \n \n \n \nMeasure \nSimilarity-measure \nS1 \nMatching coefficient \nS(i,j)=(a+d)/(a+b+c+d) \nS2 \nJaccard coefficient (Jaccard, 1908) \nS(i,j)=a/(a+b+c) \nS3 \nRogers and Tanimoto (1960) \nS(i,j)=(a+d)/[(a+2(b+c)+d)] \nS4 \nSokal and Sneath (1963) \nS(i,j)=a/[a+2(b+c)] \nS5 \nGower and Legendre (1986) \nS(i,j)=(a+d)/[a+.5*(b+c)+d] \nS6 \nGower and Legendre (1986) \nS(i,j)= a/[a+.5*(b+c)] \n \nTable 2: Similarity measures for binary data, cf. table 3.3 in Everitt et al. (2001) \n \nNotice that some of these similarity measures do not count zero-zero matches (i.e. d). \nIn cases where both outcomes of binary variables are equally important (e.g. as in \ngender: male/female) it is logical to include zero-zero-matches when expressing the \nsimilarity between objects. However, in more asymmetric situations where the \npresence of a feature (e.g. an illness) is considered more important than the absence, it \nis advisable to exclude the zero-zero matches (i.e. the d) when assessing the similarity \nof objects, since these could dominate the similarity between objects, especially if \nthere are many attributes absent in both objects (i.e. d is large, corresponding to a,b,c). \nWhen co-absences are considered informative, the simple matching coefficient S1 is \nusually employed, while Jaccard’s coefficient S2 is typically used if co-absences are \nnon-informative. S3 and S5 are examples of symmetric coefficients treating positive \nand negative matches in the same way, but assigning different weights to matches and \nnon-matches. Sokal and Sneath (1963) argue that there are no fixed rules regarding \nthe inclusion or exclusion of negative or positive matches, and that each dataset \nshould be considered on its merits. The choice of the specific similarity measure can \ninfluence the cluster analysis, since the use of different similarity coefficients can \nresult in widely different distance values, as is e.g. the case for S1 and S2. Gower and \nLegendre show that S2, S4 and S6 are monotonically related, as are S1, S3 and S5.  \n \n3.2 The categorical data case \n \nCategorical data where the attributes have more than two levels (e.g. eye colour) \ncould be dealt with similarly as binary data, when regarding each level of an attribute \nas a single binary variable. This is however not an attractive approach since many \n‘negative ’matches (i.e. d) will inevitably be involved. A far better approach is to \nassign a score sijk of zero or one to each attribute k, depending on whether the two \nobjects i and j are the same on that attribute. These scores are then averaged over all p \nattributes to give the required similarity coefficient as: \n\n=\n=\np\nk\nijk\nij\ns\np\ns\n1\n1\n \n \nNotice that this similarity coefficient is a generalisation of the matching coefficient S1 \nfor binary data.\n\n22 \n \n3.3 The continuous data case \n \nWhen all the attribute values are continuous, the proximities between objects is \nexpressed in terms of a distance-measure in the dataspace. Often Euclidean distance is \nused: \n2\n2\n2\n2\n2\n1\n1\n)\n(\n)\n(\n)\n(\n)\n,\n(\njp\nip\nj\ni\nj\ni\nx\nx\nx\nx\nx\nx\nj\ni\nd\n−\n+\n+\n−\n+\n−\n=\n\n \n \nbut various other distance measures can be applied as well, as the Manhattan or city-\nblock distance: \n \n|\n|\n|\n|\n|\n|\n)\n,\n(\n2\n2\n1\n1\njp\nip\nj\ni\nj\ni\nx\nx\nx\nx\nx\nx\nj\ni\nd\n−\n+\n+\n−\n+\n−\n=\n\n \n \nor the general Minkowski distance (q ≥1) \n \n(\n)\nq\nq\njp\nip\nq\nj\ni\nq\nj\ni\nx\nx\nx\nx\nx\nx\nj\ni\nd\n/\n1\n2\n2\n1\n1\n)\n(\n)\n(\n)\n(\n)\n,\n(\n−\n+\n+\n−\n+\n−\n=\n\n \n \n \nWe assume that missing values have been treated, e.g. by replacing them by the mean-\nvalue over the non-missing part, or by redefining the distance measure accordingly. \n \nAlso the correlation between the p-dimensional observations of the ith and jth objects \ncan be used to quantify dissimilarities between them, as in:  \n\n\n\n−\n−\n−\n−\n=\n−\n=\n=\nk\nj\njk\nk\ni\nik\np\nk\nj\njk\ni\nik\nij\nij\nm\nx\nm\nx\nm\nx\nm\nx\nwhere\nj\ni\nd\n2\n2\n1\n)\n(\n)\n(\n)\n)(\n(\n;\n2\n)\n1(\n)\n,\n(\nρ\nρ\n \n \nwith mi and mj the corresponding averages over the p attribute-values. This measure is \nhowever considered contentious as a measure for dissimilarity since it does not \naccount for relative differences in size between observations (e.g. x1=(1,2,3) and \nx2=(3,6,9) have correlation 1, although x1 is three times x2). Moreover the averages are \ntaken over different attribute values, which is problematic if their scales are different. \nBut in situations where attributes have been measured on the same scale, and refer to \nrelative profile (e.g. for classifying animals or plants absolute sizes of organism or \nparts are often considered less important than their shapes), correlation measures can \nbe also used to express dissimilarities. Further information can be found in section 3.3 \nin Everitt et al. (2001), Gower and Legendre (1986) and Calliez and Kuntz (1996). \n \n3.4 The mixed data case \n \nWhen the attribute values are mixed, i.e. containing both continuous and categorical \ndata values, a similarity measure can be constructed from weighing and averaging the \nsimilarities for the separate attribute values, as proposed by Gower (1971):\n\n23 \n \n\n\n=\n=\n=\np\nk\nijk\np\nk\nijk\nijk\nij\nw\ns\nw\ns\n1\n1\n \n \nwhere sijk is the similarity between the ith and the jth object as measured by the kth \nfeature, and wijk is typically one or zero depending on whether or not the comparison \nis considered valid. E.g. wijk can be set to zero if the outcome of the kth feature is \nmissing for either or both of the objects i and j, or if the kth feature is binary and it is \nthought appropriate to exclude negative matches. For binary variables and for \ncategorical variables with more than two categories the component similarities, sijk, \ntake value one when the two objects have the same value and zero otherwise. For \ncontinuous variables the similarity measure is defined as: \nk\njk\nik\nij\nR\nx\nx\ns\n−\n−\n= 1\n \n \nwhere Rk is the range of observations for the kth attribute (i.e. the city-block distance \nis used after scaling the kth variable to unit range). \n \n3.5 The proximity between groups of objects \n \nThe proximity between the individual objects can be used as a basis to construct \nexpressions for the proximity between group of objects. Various options exist for this: \ne.g. taking the smallest dissimilarity between any two objects, one from each group, \nleads to a nearest-neighbour distance and is also the basis for the hierarchical \nclustering technique applying ‘single linkage’. \nThe opposite is to define the inter-group distance as the largest distance between two \nobjects, one from each group and renders the furthest-neighbour distance which is the \nbasis for the ‘complete linkage’ hierarchical clustering technique. An in-between \napproach is taking the average dissimilarity, which leads to a form of ‘group average’ \nclustering when applied to hierarchical clustering methods. Cf. Everitt et al. 2001, \nsection 3.5, where also alternative ways to express inter-group distances are proposed \nwhich are based on group summaries for continuous as well as for categorical data. \n \nSummary \n \nIn order to express the similarity or dissimilarity between data points a suitable \ndistance measure (metric) should be chosen. It forms the basis for performing the \nclustering to identify groups which are tightly knit, but distinct (preferably) from each \nother. Often Euclidean distance is used as a metric, but various other distance \nmeasures can be envisioned as well.\n\n24 \n \n4 Selection of clustering method \n  \nThe extensive (and ever-growing) literature on clustering illustrates that there is no \nsuch thing like an optimal clustering method, an observation which is further \nunderpinned by theoretical insights from Kleinberg (2002); see also Zadeh and Ben-\nDavid (2009). From the multitude of methods we will consider a number of classes of \nmethods, giving most attention to traditional methods based on performing the \nclustering hierarchically and methods that constructively partition the dataset into a \nnumber of clusters (section 4.1 and 4.2), while describing the other methods only \nbriefly (section 4.3-4.6). We will finish this chapter with a brief discussion on which \nmethod to choose (section 4.7). \n \n4.1 Hierarchical methods \n \nA hierarchical clustering method groups data objects into a tree of clusters. It does so \nin an iterative way by constructing clusters from joining (agglomerative) or dividing \n(divisive) the clusters obtained in a previous iteration. Agglomerative methods start \nthis iterative process from the initial situation where each data point is considered as a \nseparate cluster, and form the hierarchical composition in a bottom up fashion by \nmerging the clusters. Divisive methods start with the mega-cluster consisting all data \npoints, and work in a top-down fashion by splitting the clusters subsequently. \nMerging or splitting is done on basis of the mutual distances between the clusters. A \nnumber of linkage-rules can be applied to express the distance between clusters. For \nexample the “simple”-rule (‘single-linkage’) always takes the smallest of all possible \ndistances between the data points within two different clusters; the “complete”-rule \n(‘complete-linkage’) chooses the largest of all distances, while the “average”-rule is \nbased on the average distance (‘average-linkage’). A popular linkage-rule is the \n“Ward’s” method which merges clusters that produce the least within-cluster \nvariance. All the information on the process of merging can be represented in a tree \n(dendrogram) which can be cut at a selected point (number of clusters), revealing a \nsuitable cluster structure for the data. A more formal method for determining the \nnumber of clusters, based on detecting the ‘knee’ in an associated clustering \nevaluation graph, is proposed in Salvador and Chan (2004) and favourably compared \nwith two alternative methods.  \n \nHierarchical clustering methods have a large computational complexity (O(n2)), \nwhere n is the number of data points or objects, which constrains their application \nusually to small and medium data size. In building the dendrogram, non-uniqueness \nand inversions can occur due to ties in data and due to the order of the dataset, cf. \nMorgan and Ray (1995), MacCuish et al. (2001) and Spaans and Heiser (2005).  \n \nThe linkage-rule in hierarchical clustering can be tuned to the data, and thus also non-\nspherical clusters can be identified. One should however be aware that applying \nhierarchical clustering can lead to very different results on the same dataset, \ndependent on the linkage rule used: the single linkage strategy tends to produce \nunbalanced and elongated clusters, especially in large data sets, since separated \nclusters with ‘noise’ points between them tend to be joined together (‘chaining’);\n\n25 \n \ncomplete linkage leads to compact clusters with equal diameters; average linkage \ntends to join clusters with small variances and is an intermediate between single and \ncomplete linkage; Ward’s method assumes that the objects can be represented in \nEuclidean space and tends to find spherical clusters of similar size. It is sensitive to \noutliers. See e.g. table 4.1 in Everitt et al. (2001) and Kaufman and Rousseeuw for \nmore information on the effects of linkage rules. \n \nIn their pure form hierarchical methods suffer from the fact that is not possible to \nadjust a merge or a split decision which was taken in a previous iteration. This rigidity \nis useful since it restricts computational costs in preventing a combinatorial number of \ndifferent choices, but it may lead to low-quality clusters if the merge or split decisions \nturn out to be not well-chosen. To improve this one can try to integrate hierarchical \nclustering with other clustering techniques, leading to multi-phase clustering. Three \nsuch methods are discussed in more details in Han and Kamber (2006). The first, \ncalled BIRCH, applies tree structures to partition the objects into ‘microclusters’ and \nthen performs ‘macroclustering’ on them using another clustering method such as \niterative relocation. The second method, called ROCK, merges clusters based on their \ninterconnectedness, and is a hierarchical clustering algorithm for categorical data. The \nthird method, called Chameleon, explores dynamic modelling in hierarchical \nclustering. \n \nIn R hierarchical clustering can be invoked by the general function hclust(); various \nmore specific hierarchical clustering techniques have also been implemented, e.g. the \nmethods proposed in Kaufman and Rousseeuw (1990) (see the R-package \n<<cluster>>): \n• DIANA() for divisive clustering \n• MONA() for clustering binary data., using the monothetic divisive algorithm. \n• AGNES() for agglomerative clustering, providing six methods for the \nagglomeration process: \n \nOther R-packages with hierarchical clustering methods are <<ctc>> (function “xcluster()”); \n<<amap>> (function “hcluster()” and “hclusterpar()”). \n \n \nFigure 4: Example of hierarchical clustering: clusters are consecutively merged with the most \nnearby clusters. The length of the vertical dendogram-lines reflect the nearness.\n\n26 \n \n4.2 Partitioning methods \n \nPartitioning algorithms divide a data set into a number of clusters, typically by \niteratively minimizing some criterion expressing the distances between the data points \nand prototypical elements of a cluster (e.g. cluster-centroids). \n Usually the square error criterion is used, defined as  \n\n=\n∈\n−\n=\nk\ni\nC\nx\ni\ni\nm\nx\nE\n1\n2           \nwhere E is the sum of the square error for all objects in the data set; x is the point in \nthe space representing a given object, and mi is the mean of cluster Ci. I.e. for each \nobject in each cluster the distance from the object to its cluster centre is squared and \nthe distances are summed. This criterion tries to make the resulting k clusters as \ncompact and as separate as possible. The number of clusters k is usually \npredetermined, but it can also be part of a search procedure using an explicit error-\nfunction. \nWhen using the popular k-means partitioning algorithm one starts with k initial cluster \ncentroids. The data points are then assigned to the nearest centroid. Subsequently the \nnew center is determined as the average of all points within the cluster thus obtained \nand again all points are re-assigned to their nearest centroid. This procedure is \nrepeated until a convergence is reached (e.g. points no longer change position), see \nFigure 5. \n \n \nFigure 5: Example of the iterative cluster-partitioning by K-means. Starting with an initial \nguess of the centroids (a), consecutively the data points are grouped to the nearest \ncentroids (b), and the new centroids are determined as the centres of these groups. In the \nnext step (c) the points are regrouped to the nearest (new) centroid. This process is \nrepeated until the groups don’t change anymore.  \n \nk-means has a computational complexity of order O(kn), where n is the number of \ndata points, and is therefore also suitable for large datasets (n large). Its outcomes are \nsensitive for the initialization of the iterative search process and an appropriate \ninitialization is therefore of concern. E.g. Milligan (1980) proposes an initialisation on \nbasis of Hierarchical clustering with Ward’s method on a small random subset of the \nlarge dataset; Arthur and Vassilvitskii (2001) recently proposed a smart seeding\n\n27 \n \ntechnique for initializing k-means. See Steinley & Brusco (2007) on various strategies \nfor initializing k-means. \n \nAnother shortcoming of k-means is that it does not perform well for non-spherical and \nnon-well separated clusters, or for clusters of very different sizes. Moreover it is \nsensitive to noise and outlier data points since a small number of such data can \ndrastically influence the mean value/center points.  \n \nThere are quite some variants of the k-means method (see e.g. Steinley (2006)), which \nhave been developed to improve the weak points. E.g. when clustering categorical \ndata, the means of the clusters are not suitable representatives, and k-means has been \nreplaced by the k-modes method (Chaturvedi, Green and Carroll (2001)) which uses \nnew dissimilarity measures to deal with categorical objects and a frequency-based \nmethod to update modes of clusters. For data with mixed numeric and categorical \nvalues k-means and k-modes can be integrated. \n \nTo deal with the sensitivity to outliers Kaufman and Rousseeuw (1990) proposed k-\nmedoids clustering by the PAM-approach (Partitioning Around Medoids; see the \nfunction pam() in the R-package <<cluster>>)). The main difference to k-means is \nthe choice of representative objects as cluster centres instead of the arithmetic mean. \nIn the same way as above after choosing k representative medoids the objects of the \ndata set will be assigned to the nearest representative medoids. In fact the partitioning \nmethod is performed by minimizing the sum of the dissimilarities between each object \nand its corresponding representative point, i.e. using the absolute error criterion which \nis less sensitive to outliers \n \n\n=\n∈\n−\n=\nk\ni\nC\nx\ni\ni\no\nx\nE\n1\n          \nwhere oi is the representative medoid, being the most centrally located object of its \ncluster. In an iterative way the set of representative medoids will be calculated \nfollowed by a new assignment of the objects and so on. A nice feature in connection \nwith PAM is the Silhouette plot (in R: silhouette () or by plotting the PAM-Result). \nThis plot illustrates how well an object lies within a cluster or merely at the edge of \nthe cluster (Rousseeuw (1987).  \n \nThe computational complexity of PAM is in the order O(k(n-k)2, which makes \ncomputation very costly for large values of n and k. For these situations Kaufman and \nRousseeuw constructed a method called CLARA (Clustering LARge Applications). In \nthe first step a small portion of the dataset is chosen as a representative of the \ncomplete dataset. Using PAM on this small sample, medoids are determined, which \nare subsequently used to assign each object of the complete dataset to a specific \ncluster or medoid. CLARA draws multiple small samples from the complete dataset, \napplies PAM on each sample and returns its best clustering as the output. The \ncomputational complexity is of the order O(ks2+k(n-k)), where s is the size of the \nsubsample. The effectiveness of CLARA is dependent on the sample sizes and - in \ncase that the best medoids of the selected subsample do not cover the best overall \nmedoids - CLARA will never find the best clustering. The quality and scalability of \nCLARA can be enhanced by allowing for an extra randomization in the iterative search \nfor new medoids, leading to the so-called k-medoids algorithm CLARANS (Clustering \nLarge Applications based upon RANdomized Search) proposed by Ng and Han\n\n28 \n \n(1994), and improved by Ester, Kriegel and Xu (1995). CLARANS also enables the \ndetection of outliers and has a computational complexity of about O(n2). Its clustering \nquality is dependent on the sampling method used. See also section 7.4.2 in Han and \nKamber (2006).  \n \nAnother way to generalize k-means is to explicitly consider other clustering criteria \nfor an optimal partitioning of the clusters. In chapter 5 of Everitt, 2001 some \nalternatives are presented to minimizing the total within-cluster sums of squares, \nwhich underlies k-means (i.e. trace W), and which are less sensitive to scale changes \nin the observed data and which can also tackle clusters of different shapes (than \nspherical) and sizes. \n \nAlso k-means can be generalized by considering it as a special case of model-based \nclustering, which applies a mixture of normal distributions to describe the underlying \nprobability density of the dataset (see section 4.5). \n \nOther extensions of k-means - as e.g. X-means (Pelleg and Moore, 1999, \nIshioka,2005), G-means and PG-means (Hamerly and Elklan, 2003; Feng and \nHamerly,2006), PW-K-means (Tseng,2007) - focus especially on the automatic \nestimation of the number of clusters, where the X-means variant implements Bayesian \nInformation criterion to tackle the choice of dimension. See also Tseng (2007) who \nproposes the use of penalty terms and weighting (PW-K-means) to extend K-means \nfor clustering with scattered objects and prior information. See Bies et al. (2009) for a \nrecent comparison study of X-means, G-means and some other methods for \nestimating the number of clusters. \n \n \nTo identify non-convex clusters, extensions as kernel k-means and spectral clustering \nhave been put forward, which enable identifying clusters that are non-linearly \nseparable in input space (see e.g. Schölkopf et al.,1999, Girolami, 2002, Camastra and \nVerri,2005, Filipone et al. 2007, Chang et al., 2008). See also section 4.6.  \n \nFinally, the sensitivity to initial conditions in K-means is a well-known problem for \nwhich many initialization strategies have been proposed (see e.g. Arthur and \nVassilvitskii, 2001, Steinley and Brusco, 2007). Barbakh and Fyfe (2008) propose a \nnew family of algorithms to solve the problem of sensitivity to initial conditions in K-\nmeans, by applying alternative performance functions which incorporate global \ninformation.  \n \n \n4.3 Density-based methods \n \nDensity-based clustering methods have been developed to discover clusters with \narbitrary shape. These methods typically regard clusters as dense regions of \nobjects/points in the dataspace that are separated by regions of low density \n(representing noise). DBSCAN grows clusters according to a density-based \nconnectivity analysis. OPTICS is an extension of DBSCAN, producing a cluster \nordering obtained from a wide range of parameter settings. DENCLUE clusters \nobjects based on a set of density distribution functions. It has a solid mathematical\n\n29 \n \nfoundation, allowing compact mathematical description of arbitrarily shaped clusters \nin high dimensional datasets. It generalizes various clustering methods, including \npartitioning and hierarchical methods, and applies a computationally efficient \ncalculation by applying a tree-based access structure. However the method requires \ncareful selection of the density parameters and noise threshold that may significantly \ninfluence the quality of the clustering results. For a concise description of these \nmethods we refer to Han and Kamber, 2006. See Tan et al., 2010 for a recent proposal \nfor improvements of density-based clustering algorithms. \n \n4.4 Grid-based methods \n \nThis approach uses a multi-resolution grid data structure. For this purpose it quantizes \nthe data space into a finite number of cells, forming the grid structure. \nThe main advantage of the approach is its fast processing time, which depends only \non the number of cells in each dimension of the quantized space, and not on the \nnumber of data objects. Approaches as STING, WaveCluster and CLIQUE are various \nexamples of this approach and can be found in section 7.7 and 7.9 of Han and \nKamber, 2006. \n \n4.5 Model-based clustering methods \n \nModel-based clustering methods hypothesize a model for each of the clusters and find \nthe best fit of the data to the given model. The clusters are determined by constructing \na density function reflecting the spatial distribution of the data points. Often also the \nnumber of clusters can be automatically determined on basis of statistical criteria \ntaking account of noise and outlier effects (see the textbox below).  \nIn fact the k-means method can be viewed as a special case of model-based clustering \nfor a Gaussian mixture model with equal mixture weights and equal isotropic \nvariances (see Celeux and Govaert, 1992). As noticed before, this directly offers a \nfruitful alley for generalization of k-means and finding more suitable forms of \nclustering non-spherical clusters and large datasets. Celeux and Govaert (1995), \npropose a generalization of k-means which enables the clustering of non-spherical \nmodels (Biernacki et al.,2006). The MIXMOD- software that they developed to \nanalyse multivariate datasets as mixtures of Gaussian populations, for clustering and \nclassification purposes, can be downloaded from http://www-math.univ-\nfcomte.fr/mixmod/index.php. Another popular package is the EMMIX-software \nwhich was developed by McLachlan et al. (2000). Related is also the R-package  \n<<mclust>> developed by (McLachlan, Fraley and Raftery (2002), Fraley and \nRaftery (2007). See also Samé et al. (2007), Maugis et al. (2009) which discuss the \napplication in variable selection; see also Li (2005), Yeung (2001). \nEstablishing such a probabilistic framework for clustering also suggests the use of \nseveral information criteria to automatically determine the number of clusters, like \nAkaike’s first information criterion, Schwartz Bayesian information criterion, and the \nintegrated classification-likelihood (see textbox below). See also Fraley and Raftery \n(1998) and Tibshirani et al. (2001) paper on the use of the gap statistic for estimating \nthe number of clusters (the R-package <<clusterSim>> provides functionality to \ncalculate this statistic).\n\n30 \n \nInformation criteria for k-means \n \nTo view k-means in a statistical context it is assumed that the underlying density for the points in the \ndata space can be expressed as a mixture of K equally weighted Gaussian distribution having mean μk \nand common variances σ2: \n\n\n\n\n\n\n\n\n−\n−\n=\n\n=\n2\n2\n1\n2\n2\n2\n1\nexp\n2\n1\n1\n)\n,\n|\n(\nσ\nμ\nπσ\nσ\nk\nj\nK\nk\nj\nx\nK\nM\nx\nP\n \nIn fact the μk refers to the centres of the resulting clusters k=1, …, K, while the variance σ2 refers to \nthe within-cluster variances,  \n\n=\n∈\n−\n=\nK\nk\nC\nj\nk\nj\nk\nx\nN\n1\n2\n2\n1\nμ\nσ\n \nwhere N is the number of data points. The associated likelihood of the complete dataset D={xj} is \nequal to, under the assumption of independence: \n)\n,\n|\n(\n)\n,\n|\n(\n2\n2\nσ\nσ\n∏\n=\nj\nj M\nx\nP\nM\nD\nP\n \nBy assigning each data point xj to the mixture component kj having highest probability, the \nclassification likelihood of the data point xj  is equal to: \n \n\n\n\n\n\n\n\n\n\n\n−\n−\n=\n2\n2\n2\n2\n2\n1\nexp\n2\n1\n)\n,\n|\n(\nσ\nμ\nπσ\nσ\nj\nk\nj\nj\nc\nx\nM\nx\nP\n \nK-means can be viewed as an attempt to maximize the joint negative classification log-likelihood of \nthe data: \n\n\n\n\n\n\n\n\n\n\n−\n+\n⋅\n−\n=\n=\n=\n\n\n∏\nj\nk\nj\nj\nj\nc\nj\nj\nc\nc\nj\nx\nK\nM\nx\nP\nM\nx\nP\nM\nD\nP\n2\n2\n2\n2\n2\n2\n)\n2\nln(\n2\n1\n)\n,\n|\n(\nln\n))\n,\n|\n(\nln(\n))\n,\n|\n(\nln(\nσ\nμ\nπσ\nσ\nσ\nσ\n \nIn the light of this interpretation a number of information criteria can be proposed to estimate the \noptimal number of clusters (see the appendix in Goutte et al. 2001): \n− \nAkaike’s first information criterion:  \n)1\n(\n2\n)\n,\n|\n(\nln(\n2\n2\n+\n⋅\n⋅\n−\n⋅\n=\np\nK\nM\nD\nP\nAIC\nc\nσ\n \n where (Kp+1) is the number of free parameters in the underlying mixture model with K components \n(i.e. K times the number of parameters in the mean μk and the variance σ2) \n− \nSchwartz Bayesian information criterion: \n \n)\nln(\n)1\n(\n)\n,\n|\n(\nln(\n2\n2\nN\np\nK\nM\nD\nP\nBIC\nc\n⋅\n+\n⋅\n−\n⋅\n=\nσ\n  \n− \nThe integrated completed likelihood (Goutte et al., 2001): \n \n)\n2\n3\nln(\n2\n)\n2\n2\nln(\n2\n)\nln(\n)1\n(\n)\n,\n|\n(\nln(\n2\n1\n1\n1\n2\n\n\n=\n=\n=\n+\n⋅\n+\n+\n+\n⋅\n−\n⋅\n+\n⋅\n−\n⋅\n=\nK\nk\nN\nj\nN\ni\nc\nk\nj\nK\ni\nN\np\nK\nM\nD\nP\nICL\nσ\n \nwith p being the number of attributes, N the number of data points where \n\n=\nk\nk\nN\nN\nwith Nk being \nthe number of data points in cluster Ck. The number of clusters Kopt  rendering the highest value of \nthe information criterion is chosen in the end as the number of clusters K.  \nThe AIC is known to overestimate the number of clusters, especially if the clusters are non-sperical, \nwhile the BIC is known to asymptotically estimate the ‘true’ model structure  in case that the \nunderlying Gaussian mixture model is an adequate model. The ICL takes into account that the \nunderlying mixture model might not be an adequate model for classifying the data points accordingly. \nSee (Goutte et al. 2001) for further details and  references.\n\n31 \n \n \nFor a good recent overview paper on finite mixture models and model-based \nclustering methods see Melnikov and Raita (2010). We notice that other approaches \nalso can be listed in the category of model-based approaches, like COBWEB which is \na conceptual learning algorithm taking concepts as a model for clusters and \nperforming an associated probabilistic analysis. SOM (or self-organized feature map; \nsee next section) is a neural-network-based algorithm that maps high-dimensional \ndata into a 2-D or 3-D feature map, which renders useful data visualization and can be \nused subsequently as a basis for clustering.  \n \n4.6 Clustering methods: Miscellanea  \n \nBelow we briefly discuss various alternative methods which have been developed for \nspecific application situations. \n \nSOM \nThe self-organizing map (SOM) due to Kohonen (1982) is a well-known neural \nnetwork method for unsupervised learning and thus can be suitably applied for cluster \nanalysis. The network classifies the data points according to internally generated \nallocation rules, which it learns from the data. SOM’s goal is to represent all points in \nthe original (often high-dimensional) data space by points in a low-dimensional one \n(usually 2-D or 3-D), such that the topology (distance and proximity relations) is \npreserved as much as possible. The method is particularly useful when a nonlinear \nmapping is inherent in the data, and it is an appropriate tool for clustering and data-\nvisualisation of high dimensional data spaces. \nSee Murtagh and Hernandez-Pejaras (1995), Flexer (2000), Vesanto (1999), Vesanto \nand Alhoniemi (2000) and Bacao et al. (2005) for further information. Waller et al. \n(1998) compared SOM with two partitioning and three hierarchical methods for more \nthan 2500 datasets and showed that SOM was similar to or better in performance than \nthe other methods. Moya-Anegón et al. (2005) compared SOM to Multi Dimensional \nScaling (MDS) and Ward’s method for analysing co-citations in the context of \nscientometrics and illustrated the complementarity of the various methods. See also \nYiang and Kumar (2005) for further results on comparison of SOM with k-means. \n \nFuzzy clustering \nAll the methods described so far have in common that an object is always fully \nassigned to one and only one cluster. In the so called fuzzy clustering the \nobjects/points have a degree of belonging (‘membership’ expressed in a value \nbetween 0 and 1) to the various clusters. Points on the edge of a cluster may thus be in \nthe cluster to a lesser degree than points in the centre of a cluster. For each point x we \nhave a coefficient uk(x) giving the degree of which it is in the k-th cluster. Typically \nthese coefficients are normalized such that they sum up to 1 for each x. k-means can \nnow be generalized into ‘fuzzy c-means’, where the centroid of the cluster is a kind of \n‘mean’ of all points, weighted by their degree of belonging to the specific cluster: \n(\n)\n(\n)\n\n\n=\nx\nj\nx\nj\nj\nx\nu\nx\nx\nu\ncenter\nν\nν\n)\n(\n)\n(\n\n32 \n \nwith v ≥1 being a coefficient which is called the fuzzifier. Typically v is taken as 2. \nSee Hathaway and Bezdek (1988) for further details. See also Kaufman and \nRousseeuw, 1990 with their fuzzy cluster analysis program FANNY, which is \navailable as function (fanny()) in the R-package <<cluster>>13. Mingoti and Lima \n(2005) present a comparative study between SOM, fuzzy c-means, k-means and \ntraditional hierarchical clustering, showing that especially fuzzy c-means has a very \ngood performance and renders robust results in the presence of outliers and \noverlapping clusters. \n \nClustering high-dimensional data \nThe curse of dimensionality is plaguing the clustering in applications where objects \nthat contain a large number of features or dimensions have to be classified (e.g. text-\ndocuments containing thousands of keywords as features; DNA microarray data \nproviding information on the expression levels of thousands of genes under hundreds \nof conditions). Many dimensions may not be relevant, moreover the data become \nincreasingly sparse when the number of dimensions increases, causing the distance \nmeasure between pairs of points to become meaningless, while the average density of \npoints in the data-space is likely to be low. This requires specific clustering \nmethodologies for high-dimensional data. CLIQUE and PROCLUS are two influential \nsubspace clustering methods, searching for clusters in subspaces or subsets of \ndimensions, rather than in the entire data-space. Another methodology, so called \nfrequent pattern-based clustering, extracts patterns to group objects into meaningful \nclusters. An example of this is pCluster. See section 7.9 of Han and Kamber (2006), \nand chapter 8 in Xu and Wunsch (2009). \n \nConstraint-based clustering \nMost clustering approaches discussed by now are implemented in an automatic, \nalgorithmic fashion, with little user guidance or interaction involved. However in \nsituations where there are clear application requirements (e.g. preferences and \nconstraints), one ideally wants to use these requirements to guide the search for \nclusters.  \nThis can include e.g. information on the expected number of clusters, the minimal or \nmaximal cluster size, weights for different objects, and other desirable characteristics \nof the resulting clusters. For clustering tasks in high-dimensional spaces, user input on \nimportant dimensions or desired results can render crucial hints or meaningful \nconstraints for effective clustering. Some examples how constraints and semi-\nsupervised clustering tasks can be established are presented in section 7.10 of Han \nand Kamber (2006). \n \nMulti-objective clustering \nWhen clustering a dataset having different properties or when analyzing it from \nvarious user-perspectives, the reliance on one sole clustering criterion is often not \nappropriate. In these cases it is more of interest to consider various clustering criteria \nsimultaneously, although they can be partially complementary and even conflicting to \na certain extent. The framework of multi-objective clustering allows this perspective, \nby framing clustering as a multi-objective optimization problem, see e.g. Handl and \nKnowles (2006a). They propose MOCK (Multi Objective Clustering with automatic \nK-determination) as an multi-objective extension of k-means, which uses an \n                                                     \n13 See http://cran.r-project.org/web/packages/cluster/\n\n33 \n \nevolutionary search algorithm to obtain a set of trade-off solutions between the \nvarious (often conflicting) goals as a good approximation of the Pareto front. These \nsolutions correspond to different compromises of the considered objectives, and \nprovide a range of alternative hypotheses to the researcher. Moreover they may lead \nto additional insight into the properties of the data, and thus increase confidence in the \nresults obtained. The algorithm is shown to give robust performance for data with \ndifferent properties and outperforms traditional single-objective methods. Moreover it \nallows for automatic determination of the number of clusters. Runtime of the method \nis however high, and for data where clustering criteria are more specifically known, \nspecialized methods will generally be more efficient. In Handl and Knowles (2007) \nand Handl, Kell and Knowles (2007) alternative applications of multi-objective \noptimization are presented in the context of semi-supervised learning and feature \nselection. \n \nMining sequential data (data streams, time-series)  \nSequential data consist of a sequence of sets of objects with possibly variable length \nand other changing characteristics like dynamic behaviour and time constraints. \nRecognizing patterns or groups in these dynamic datasets requires specific \napproaches, which we will not discuss. We refer to chapter 8 of Han and Kamber \n(2006) and chapter 7 in Xu and Wunsch (2009) for more information on these topics. \n \nSpatial clustering \nWhen spatial dimensions are involved in the data, e.g. for objects having a location or \nhaving features which differ as function of location, then it can be beneficial to \nexplicitly account for spatial structure when looking for clusters in the data. Methods \nfor exploratory spatial data analysis can serve as means to identify groups in the data. \nE.g. methods for identifying (local) spatial associations and correlations from the field \nof spatial statistics and GIS (see e.g. Jacquez, 2008), like Moran’s I or Geary’s c (cf. \nBao and Henry, 1996) of Anselin’s LISA (Local Indicators of Spatial Association, cf. \nAnselin, 1995, 2005), or Getis and Ord’s statistics (Getis and Ord, 1996, Ord and \nGetis, 2001, Aldstadt and Getis, 2006) for identifying statistical significant hot spots \ncan be a good basis for these analyses, leading to the identification of characteristic \nspatial patterns (see e.g. Premo, 2004, Nelson and Boots, 2008). For software see the \nR-package <<spdep>>14 which supports part of these analyses. See also the \ninformation page on spatial statistical software in R15 for further software for further \nsoftware, as e.g. packages as <<DCluster>> and <<clustTool>>16.  \n \n \nDiscovering clusters in networks \nThe analysis of networks and their structure and behaviour is presently an important \ntopic in studying complex systems in nature and society (e.g. Palla et al. 2005). \nEspecially the property of the ‘community structure’, in which network nodes are \njoined together in tightly knit groups, between which there are only loose connections, \nis an important research topic, as exemplified by Girvan and Newman (2002), \nNewman (2003,2004), Newman and Leicht (2007), Mishra et al. (2007), Handcock \n                                                     \n14 http://cran.r-project.org/web/packages/spdep/index.html \n15 http://www.spatialanalysisonline.com/output/html/R-Projectspatialstatisticssoftwarepackages.html \n16 http://cran.r-project.org/web/packages/DCluster/index.html and http://cran.r-\nproject.org/web/packages/clustTool/index.html\n\n34 \n \net al. (2009, 2007). See also the R-package <<latentnet>>17 which has been \ndeveloped for the analysis reported in the latter reference.  \nRemark: According to (Newman, 2003) network clustering is not to be confused with \ndata clustering which detects groupings of data points in high-dimensional data \nspaces. The two problems have common features, and algorithms for the one can be \nadapted for the other, and vice versa, but, on balance, one typically finds that this \ntransposition of algorithms between fields works less than the algorithms which have \nbeen directly developed. \n \n \nBootstrapping cluster analysis  \nBy experimentally replicating the cluster analysis, using e.g. random \nrestarts/initializations or random noise simulations, one can get clues about the \nstability (robustness) of the clustering results. Kerr and Churchill, 2001 elaborate on \nthis technique in an ANOVA setting, allowing for a distinction between systematic \nsources of variations and noise. They illustrate the bootstrapping technique with a \npublicly available data set and draw conclusions about the reliability of clustering \nresults in light of variation in the data; implications of replication and good design in \nmicroarray experiments are discussed. See also the R-package <<maanova>> \n18which builds consensus groups (for k-means methods) or consensus trees (for \nhierarchical methods) on basis of bootstrap. \n \nRandom Forest clustering:  \n‘Random Forests’ (RF) is a popular ‘ensemble-based learning’ technique, based on \nconstructing many classification trees from bootstrap sampling of the data, and \nsubsequently generating a classification on basis of the thus generated ‘forest’ of \ntrees. The procedure provides a classification with an associated estimate of the error \nrate, and moreover generates a measure of the importance of the involved (predictor) \nvariables, as well as a measure of the internal structure of the data (e.g. the proximity \nof different data points to each other). The RF-technique is user-friendly and performs \nvery well compared to many other classifiers, including discriminant analysis, support \nvector machines and neural networks, and is robust against over fitting (Breiman, \n2001).  \n \nThough initially meant for supervised learning activities like classification and \nregression, it can also be applied for unsupervised learning, like clustering. To this \nend one invokes a ‘trick’, calling the original data “class 1”, and constructing a \nsynthetic dataset, “class 2”. The synthetic dataset “class 2” can be constructed in two \nways: (1) the “class 2” data are sampled from the product of the marginal distributions \nof the variables (by independent bootstrap of each variable separately); \n(2) the “class 2” data are sampled uniformly from the hypercube containing the data \n(by sampling uniformly within the range of each variable).  \n \nSubsequently one tries to classify the combined data with the RF-procedure. The idea \nis that real data points that are similar to each other will often end up in the same \nterminal node of a tree, as measured by the proximity matrix returned by the RF-\ntechnique. This proximity matrix can thus be taken as a similarity measure, and \nclustering or multi-dimensional scaling on basis of this similarity can be used to \n                                                     \n17 http://www.stat.washington.edu/raftery/latentnet.html \n18 http://cran.r-project.org/web/packages/maanova/index.html\n\n35 \n \ndivide the original data points into groups for visual exploration. See the example in \nLiaw and Wiener (2002) as a work-out how to perform such an analysis with the \n<<randomForest>> package in R19. \n \nKernel-Based Clustering, Support Vector Clustering and Spectral clustering \nAll these approaches allow to identify non-spherical clusters, which is typically not \nprovided for by direct k-means oriented methods. The kernel-based method \napproaches the problem by non-linearly transforming the data into a high dimensional \n‘feature space’. In this space it is more likely to obtain a linear separation of these \nclusters/patterns, applying e.g. a SVM (Support Vector Machines) which constructs an \noptimal hyper-plane on basis of a small number of support points (the “support \nvectors”). The difficulty of the curse of dimensionality in the mapping to a high-\ndimensional ‘feature space’ can be overcome by the ‘kernel trick’, i.e. applying an \ninner-product kernel which avoids the time-consuming process of explicitly nonlinear \nmapping the data-points to the transformed space. Commonly used kernels include \npolynomial kernels, Gaussian radial basis function kernels and sigmoid kernels (cf. \nMuller et al. 2001). Different kernel functions usually lead to different non-linear \nseparating hyper-surfaces (and thus clusters) in the original data-space. The selection \nof an appropriate kernel is still an open problem and is currently determined \nempirically. In the above way kernel versions of classical clustering algorithms can be \nconstructed. See e.g. papers on kernel k-means and support vector clustering (Ben-\nHur et al. (2001), Moguerza, Munoiz, Martin-Merino (2002) and Winters-Hilt and \nMerat (2007).  \n \nSpectral clustering is based on regarding the data as a graph with a set of vertices and \nedges (with corresponding weights). The clustering is configured as a graph cut \nproblem where an appropriate objective function has to be optimized. The problem is \nsolved by an eigenvector algorithm involving the matrix of weights, which performs \nthe spectral decomposition. It results in an optimal sub graph-partitioning (see e.g. Shi \nand Malik, 200, Ng et al. 2002, von Luxburg, 2008). Dhillon et al. (2004), Filippone \net al. (2007) show that spectral clustering and kernel-based clustering are in fact \nclosely linked; see also Kulis et al. (2009a). \nTo enable analysis of large datasets - for which a full spectral decomposition is \ncomputationally prohibitive – Fowkles et al. (2004) propose the use of the Nyström \nmethod for solving eigenfunction problems; see also Drineas and Mahoney (2005) for \nmore information on the use of this approximation in kernel-based learning. Recently \nBelabbas and Wolfe (2009a) provide two methods, one based on sampling and \nsorting, to enable the use of spectral models for very large datasets. \n \nR-software for performing spectral clustering is available in the R-package \n<<kernlab>>20. The high-computational costs of the above methods (polynomial, \norder (O(n3)) can be prohibitive, but recently proposals for alternative faster variants \nhave been put forward, see e.g. Yan et al. 2009, Kulis et al. 2009b, Belabbas and \nWolfe (2009a, 2009b). \n \nBi-clustering  \nBi-clustering (co-clustering or two-mode clustering) is a clustering method which \nattempts to simultaneously cluster both the samples and the features (i.e. rows and \n                                                     \n19 http://cran.r-project.org/web/packages/randomForest/ \n20 http://cran.r-project.org/web/packages/kernlab/\n\n36 \n \ncolumns of the data-matrix), with the goal of finding “bi-clusters”, subsets of features \nthat seem to be closely related for a given subset of samples. It is for example used in \ngene expression analysis by clustering microarray-data (see e.g. Cheng and Church, \n2000, Madeira and Oliveira, 2004, and Tanay et al., 2002). The field shows a rapid \nexpansion of approaches and software tools, compare e.g. Wu and Kasif (2005), Kerr \net al. (2007,2008), Li et al. (2009). See also the <<BicARE>> R-package21 for \nBiclustering Analysis and Results Exploration in the BioConductor-suite \n \nConsensus clustering \nConsensus clustering, also called ‘ensemble clustering’ or ‘clustering aggregation’,   \ninvolves reconciling of diverse clusterings performed on the same dataset. The \nvarious clusterings come e.g. from different sources (e.g. using different clustering \nalgorithms; different selections of attributes) or from different runs of the same \nalgorithm (using other parameters; different subsamples, selections of attributes). \nWhen viewed as an optimization problem (“given a number of clusterings of some set \nof elements, find a clustering of those elements that is as close as possible to all the \ngiven clusterings”), it is known as “median partition”, and has been shown to be a \ncomputationally hard problem (NP-complete), see Goder and Gilkov (2008). For \nfurther information on alternative approaches to consensus clustering we refer to \nliterature, e.g. Strehl and Ghosh (2002), Monti et al. (2003), Gionis et al. (2005). \nSee also the R-software package <<clue>>22 which provides an extensible \ncomputational environment for creating and analysing cluster ensembles. \n \n4.7 Which method to choose? \n \nAgainst the background of the multitude of methods (different, as well related) for \ncluster analysis, one is confronted inevitably with the question ‘which one to choose’? \nIn a certain sense clustering can be considered both as an art and as a science, as \nreflected by discussions on a recent conference on this issue \n(http://stanford.edu/~rezab/nips2009workshop).  \nThe choice of the clustering algorithm is not an application-independent issue, but \nshould always be addressed in the context of its end-use, taking also account of the \ncharacter and type of data which is available. Typically it is considered a good idea to \ntry several algorithms on the same data to study what they will disclose. This however \nleaves one with the task to decide what methods to apply, and how to use and interpret \nthem. An important issue in using and interpreting the results from the cluster analysis \nwill be the flexibility in going back-and-forth from statistical technique to subject-\ncontent. This involves combining expertise on cluster analysis with expertise on the \nspecific subject area where the cluster analysis is applied, and typically requires a \nclose cooperation between content-expert and cluster-analysts, if the analysis is not \ndone by the content-expert. \n \nObviously it will depend on the available expertise (on clustering and on the specific \nsubject), software, time, money and mancraft to what extent the choice of the \nclustering algorithm is covered. Requirements with which one should account can be \ndiverse, as exemplified e.g. by the list of issues like ‘scalability’, ‘ability to deal with  \n                                                     \n21 See http://www.bioconductor.org/packages/2.6/bioc/vignettes/BicARE/inst/doc/BicARE.pdf \n22 http://cran.r-project.org/web/packages/clue/index.html\n\n37 \n \n \n \ndifferent types of attributes’, ‘discovery of clusters with arbitrary shape’, ‘ease of \nusing the cluster analysis procedure’, ‘ability to deal with noisy data’, ‘treatment of \nnewly inserted data’, ‘insensitivity to the order of the input records’, ‘high \ndimensionality’ presented in chapter 7.1 Han and Kamber (2006). Moreover also \nissues related to cluster validity (see next chapter) will be of importance. \n \nThree fundamental properties for clustering  \n(according to Handl and Knowles (2005)) \n \nHandl and Knowles  (2005) distinguish three fundamental properties for clustering, which can \ngive rise to conflicting objectives, and which would argue for a multi-objective approach \ntowards clustering as exemplified e.g. in Handl and Knowles (2005).  \n \nCompactness: Generally this is implemented by keeping intra-cluster variation small. \nAlgorithms like k-means, average link-agglomerative clustering, self-organizing maps or \nmodel-based clustering fit into this category. These methods are very effective for spherical or \nwell-separated clusters, but may fail for more complicated cluster structures. \n \nConnectedness: This more local concept of clustering is based on the idea that neighboring \ndata items should share the same cluster, and methods as density based clustering and single-\nlinkage agglomerative clustering are related to this property. Detection of arbitrarily shaped \nclusters is possible, but these methods can lack robustness in case clusters are not clearly \nseparated spatially. \n \nSpatial separation: This property on its own does not give much guidance for clustering, and \ncan easily lead to trivial solutions. Typically it is combined with other objectives, as \ncompactness of clusters or balance of cluster sizes.  \n \n \nExamples of data sets exhibiting compactness, connectedness and spatial separation, \nrespectively. Connectedness and spatial separation are related (albeit opposite), and in \nprinciple, the cluster structure in data sets B and C can be identified by a clustering \nalgorithm based on either connectedness or on spatial separation, but not by one based \nprincipally on compactness.\n\n38 \n \nHandl and Knowles (see textbox) state that in clustering various objectives are involved, \nwhich can be conflicting. Therefore they argue that multi-objective approach to clustering is \nappropriate. \n \n \nSummary \n \nThe extensive – and ever-growing - literature on clustering illustrates that there is no \nsuch thing like an optimal clustering method. We have grouped the multitude of \nmethods into a restricted number of classes, and have especially focused on two \ncommonly used classes, one which is based on hierarchically performing the \nclustering, while the other consists of constructively partitioning the dataset into a \nnumber of clusters, using the k-means method. The other classes are briefly discussed \nwith due reference to literature for further information.\n\n39 \n \n5 How to measure the validity of a cluster? \n \n5.1 Comparing cluster solutions \nThe comparison of cluster solutions (e.g. partitions or trees) either with each other or \nwith benchmark information is an important aspect of cluster validation. For example, \ntesting whether different subsamples of the same dataset or different methods applied \nto the data generate similar results is considered as a relevant activity in evaluating the \ncluster-quality (‘robustness issue’). Moreover, in situations where an external \nclassification is available, one would like to check the similarity of this classification \nand the clustering results as an indication of external clustering validity.  \nBelow we briefly highlight a number of well-established techniques for comparing \ntwo partitions. See Everitt et al. 2001, section 8.4, for additional material on \ncomparing two dendrograms/trees or two proximity matrices; see also Campbell, \nLegendre and Lapointe (2009) for further information on these issues. \nE.g., when two classifications of a group of n objects are available, one can represent \nthem as a c1-by-c2 matrix N=[nij] where nij is the number of objects in group i of \npartition 1 (i=1, …, c1) and group j of partition 2 (j=1,…,c2). The labelling of the two \npartitions are arbitrary. When the partitions have the same number of clusters and \ntheir agreement is good, it is usually obvious from inspection how the labels \ncorrespond, and one partition can straightforwardly be relabelled to match the other. \nUsing simple percentage agreement or the kappa coefficient (see Cohen, 1960) the \npartitions can then be compared, after relabeling. \nRemark: One can think of various procedures to match the labels of two cluster partions, say \n1 and 2.  \nA straightforward strategy consists of: \n(a) first determining the Euclidean distances between the cluster-centres for clustering 1 and \nclustering 2. These distances are stored in a ‘distance matrix’ with entry di,j expressing the \ndistance between the i-th cluster-centre for clustering 1 and the j-th cluster-centre for \nclustering 2; \n(b) next linking the labels for clustering1 and 2 by consecutively searching for the smallest \nentry in this matrix (smallest distance), matching the corresponding row and column and \neliminating them from the matrix consecutively. \nIn this way a match between the cluster-classes in clustering 1 and those in clustering 2 is \nobtained iteratively. This is however not the only procedure to perform this matching. One \ncan easily come up with alternatives when considering these steps: \n \nConcerning step (a): Matching can also be done by comparing the cluster-class counts in \nthe cross table-matrix N. The idea behind this matching is to find a match which renders \nthe largest number of counts (data points) in the corresponding matched cluster-classes.  \nNotice that the match proposed sub (a) above, is based on the underlying (average) \nfeatures of the data points, and aims to establish a match on basis of these averages. \n \nConcerning the search step (b): Instead of performing the search heuristically like \nsketched above one can envisage to perform this search exhaustively (i.e. exact) by \nconsidering all cluster-combinations involved, and finding the one which renders the sum\n\n40 \n \nof the distances minimal23. Although the number of all cluster-combinations involved is \nequal to k! (k is the number of clusters in clustering 1), this task of finding the exact \noptimum cluster combinations can be performed far more efficiently (in O(k3) steps) by \nusing the ‘Hungarian algorithm’ proposed by Kuhn (1955) and Munkres (1957). This \nalgorithm is available in the R-package “<<clue>>24”, i.e. use the LSAP function for \noptimal cluster matching/assignment \n \nA simple example illustrates that the outcomes of both search methods (in step (b) can be \ndifferent. E.g. let the cross-table for two cluster partions (5 cluster-classes) be: \n   \n17 24 \n1 \n8 15\n23 \n5 \n7 14 16\n25 \n6 13 20 22\n10 12 19 21 \n3\n11 18 25 \n2 \n9\n \nThe heuristic search method and the optimal search method match the rows 1,2,3,4, and 5 \nwith the columns 2, 5, 1, 4, 3 (heuristic) and 2, 1, 5, 4, 3 (exact) respectively, giving a total \nnumber count of 111 and 115 respectively, which shows the (slightly) suboptimal \nperformance of the heuristic method.  \nRemark: Cohen’s Kappa-statistic which corrects for chance effects in comparing two cluster \npartitions is given by (N* stands for the relabelled cross-table): \n \n[\n]\nchance\nagree\nchance\nagree\nreal\nagree\nKappa\ni\nj\nji\nk\nik\nchance\nagree\nreal\nagree\nP\nP\nP\nI\nN\nN\nN\nSum\nP\nN\nSum\nN\nTrace\nP\n,\n,\n,\n*\n*\n2\n,\n,\n1\n*)\n(\n1\n*)\n(\n/\n*)\n(\n−\n−\n=\n\n\n\n\n\n\n\n\n\n\n\n\n\n⋅\n\n\n\n\n\n=\n=\n\n\n\n     \nwhere \nreal\nagree\nP\n,\n refers to the relative observed agreement between clustering 1 and 2, and \nchance\nagree\nP\n,\n refers to the hypothetical probability of the agreement by chance, in case random \nclasses would have been assigned to the objects for both clustering 1 and 2. If the clusterings \nare the same \nKappa\nI\n is 1, if there is no agreement, other than the one happening by chance, \nKappa\nI\n <=1. \nWhen the number of clusters differs between the two partitions/clusterings, one can \ntake another alley towards comparing the partition rather than by analysing the cross-\ntabulation of frequencies. Starting point is to investigate the co-occurrence of the \ngroupings of every pair of n objects in the partitions. This can be presented in a 2 x 2 \ncontingency table: \n \n                                                     \n23 For the case of matching on basis of cluster-counts, one would strive to find a match which renders a \nmaximal sum of the number of counts. \n24 http://cran.r-project.org/web/packages/clue/index.html\n\n41 \n \n \n \n \nPartition 2 \n \n \n \nPair in \n same group \nPair in \ndifferent \ngroups \nTotal \nPartition 1 \nPair in same \n group \na \nb \na+b \nPair in different \n groups  \nc \nd \nc+d \n \nTotal \na+c \nb+d \n( )\n2\nn  \n \nThis contingency table can be directly derived from the cross-table N with cluster-\nclass counts, using the relationships presented in table 1 and 2 of Hubert and Arabie, \n1985.  \nThe Rand and Jaccard index for expressing the correspondence of these partitions are \ndefined by (a+d)/(a+b+c+d) and a/(a+b+c) respectively. Correcting for the effects \nof chance in grouping points in clusters, adjustments of the Rand index have been put \nforward in literature of which the adjusted Rand index of Hubert and Arabie (1985) is \nespecially judged a suitable one (see also Steinley, 2004). It is defined as: \n)]\n)(\n(\n)\n)(\n[(\n2\n)]\n)(\n(\n)\n)(\n[(\n)\n(\n2\n2\nd\nb\nd\nc\nc\na\nb\na\nn\nd\nb\nd\nc\nc\na\nb\na\nd\na\nn\nadjRand\n+\n+\n+\n+\n+\n−\n\n\n\n\n\n\n+\n+\n+\n+\n+\n−\n+\n\n\n\n\n\n\n=\n \nwhere \n\n\n\n\n\n\n2\nn  denotes the total number of object-pairs (i.e. (a+b+c+d)). \nMeila (2007) recently proposed a novel criterion for comparing partitions, the \n“Variation of Information”-criterion, which accounts for the amount of information- \nloss and gain when changing from clustering 1 to clustering 2. It is calculated on basis \nof information theoretic measures which can be directly evaluated in terms of the \nentries in the cross-table-matrix N with the cluster-class counts. See Meila (2007) for \ndetails. Vinh et al. (2009) recently argue that also for information theoretic measures a \ncorrection for chance is needed, similar to the adjustment of the Rand index.  \nThe above mentioned indices that can be calculated on basis of the cross-table N of \nthe cluster-class counts appear to be insensitive to permutations of the columns and \nrows of the cross-table. This implies that they do not depend on the cluster-label-\nmatching strategy involved in linking clustering 1 to 2. \nThe presented indices have been implemented in the CRAN-package <<mcclust>>25 \nwhere the adjusted rand index is evoked by the function arandi() and Meila’s criterion \nby the function vi.dist(). \n                                                     \n25 http://cran.r-project.org/web/packages/mcclust/\n\n42 \n \nRemark: The above indices can be used to measure the influence of individual data points on \na cluster analysis: by comparing the partitioning which results from deleting specific data \npoints from the dataset, with the partitioning of the complete reference dataset one can detect \nhighly influential data points that directly impact the resulting partition. Cheng and Milligan \n(1995, 1996a,b) e.g. advocate the use of the adjusted Rand index for this purpose. See also \nsection 8.5.3 in Everitt et al. 2001. \n \n \n \n \n \n \n \n \n5.2 Validation measures \nValidation measures are intended to measure how well the clustering captures the \nunderlying structure in the data. An excellent account of different types of validation \nmeasures and their potential biases is given in Handl et al. (2005). This reference \nunderlines that there does not exist a golden standard in clustering methods nor in \nvalidation measures. It will often not be sufficient to use a single clustering algorithm \nand/or a single validation measure when the real underlying structure of the data is \nunknown. Rather one should apply a number of different clustering algorithms and \nvalidation measures that optimize different aspects of a partitioning for an appropriate \nrange of cluster sizes. Also Brun et al. (2007) address similar points, and advise to be \ncautious with automatically applying and interpreting results from calculated validity \nindices.  \nTypically three groups of validation measures are distinguished (see Figure 6): the \nfirst type is based on calculating properties of the resulting clusters, such as \ncompactness, separation, roundness, and is called internal validation, since it does not \nrequire additional information on the data.  \nThe second approach is called relative validation and is based on comparisons of \npartitions generated by the same algorithm with different parameters (e.g. \ninitializations), or different subsets of the data. This approach in fact measures \nrobustness of the clustering results and - similar to internal validation - also doesn’t \nrequire additional information. \nAn axiomatic approach to measure cluster quality \nAckerman and Ben-David (2008) have recently initiated a systematic study of measures \nfor the quality of a given data clustering. These measures, given a data set and its \npartition into clusters, return a non-negative real number representing how ‘strong’ or \n‘conclusive’ the clustering is. They propose to use the notion of ‘cluster quality measure’ \nas a basis for developing a formal theory of clustering, which unlike Kleinberg’s \naxiomatic approach (Kleinberg, 2002) does not lead to contradictions.  \nAckerman and Ben-David have proposed quality measures for wide families of common \nclustering approaches, like center-based clustering (e.g. k-means, k-median), loss-based \nclustering (e.g.  k-means) and linkage-based clustering (e.g. hierarchical clustering), and \nanalyze their computational complexity. In addition, they show that using these quality \nmeasures, the clustering quality of a clustering can be computed in low polynomial time.\n\n43 \n \nValidation\nProperties of \nthe clusters\n(internal)\nComparison\nof\npartitions\nBetween clusters\ngenerated by the\nalgorithm\n(relative)\nBetween clusters\nand classes\n(external)\nDetermination of quality\nof algorithm to generate\ninteresting partition\nDetermination of quality\nof algorithm to generate\nconsistent groups\nDetermination of quality\nof algorithm to recognize\nexisting groups\n \nFigure 6: Different approaches for cluster validation \nThe third approach, called external validation is based on comparison of the \nclustering partition of the data with a known class partition of the data, thus \npresupposing that the class labels are known and uncontested. It is clear that this kind \nof validation will only be possible for a limited number of situations, e.g. for \nbenchmark data, or for situations where cluster labels are known beforehand. It will \nevidently depend on the application field whether (and which) explicit validation \ncriteria are feasible and useful: e.g. Datta and Datta (2006) propose two specific \nevaluation indices in the context of gene expression data-analysis with a content \nrelated meaning, namely the biological homogeneity index and the biological stability \nindex. \nIn appendix E a large number of internal validation indices are listed that use the \ninter-cluster and the intra-cluster distances to identify the best partition. These indices \nuse the inter-cluster and the intra-cluster distances to identify the best partition. They \nare appropriate when clusters are compact and well-separated, but fail when sub-\nclusters exist or when the clusters are arbitrarily shaped (and thus have no \nrepresentative centre points to assess the inter-cluster variance). Therefore frequently \nalternative approaches are put forward in literature, which are compared to the \nestablished ones on basis of synthetic and/or real data. These comparative studies are \nnecessarily always limited to a certain extent: their scope is given by the datasets \nwhich are analysed, and one can often find other data on which the one method \nperforms better than another candidate. Jonnalagadda and Srinivasan (2009) propose \nan approach that overcomes this limitation by not using inter-cluster distances, but \ninstead focusing on information which is lost or gained when a cluster intersects with \nanother. The proposed NIFTI-index (Net InFormation Transfer Index) was compared\n\n44 \n \nwith other ones - Dunn’s, Silhouette, Davies-Bouldin and the Gap-statistic – and it \nwas shown - on synthetic datasets as well as on real-life data - that NIFTI outperforms \nthese methods in determining the appropriate number of clusters. However, the \nproposed method has as limitation that it models clusters as hyper-spheres, which \nmake it less appropriate for clusters that do not have a spherical shape. Also Saitta et \nal. (2008) propose a new bounded index for cluster validity, the score function (SF). It \nis found to be always as good or better than four common validity indices – Dunn’s, \nSilhouette, Davies-Bouldin and the Maulik Bandyopadhyay-statistic – in the case of \nhyper-spherical clusters. It works well on multidimensional data sets and \naccommodates unique and sub-cluster cases. \nRelative validation indices are based on measuring the consistency of algorithms, \ncomparing the clusters obtained by the same algorithm under different conditions, or \nby different clustering algorithms, and two typical approaches are discussed \nsubsequently: \n• The use of a Figure of Merit (FOM, see Yeung, Haynor and Russo, 2001) assesses \nthe ‘predictive power’ of a clustering technique and strikes a balance between the \nexternal and internal criteria: FOM requires no prior knowledge nor relies entirely \non information from the clustering process. It can e.g. be obtained by leaving out a \nvariable, j, clustering the data (into k clusters), then calculate the RMSE (Root \nMean Squared Error) of j relative to the cluster means: \n     \n \n \n\n=\n∈\n−\n=\nk\nr\nC\nx\nC\nij\nr\ni\nr j\nx\nN\nk\nj\nRMSE\n1\n2\n))\n(\n(\n1\n)\n,\n(\nμ\n \nwith xij being the measurement of the j-th variable for the i-th observational unit; \nN the number of observational units, Cr the set of observational units in the r-th \ncluster; \n)\n( j\nr\nC\nμ\n the mean of variable j over the observational units in the r-th \ncluster. Summing these RMSE over all variables j renders an aggregate FOM \n(AFOM): \n\n=\n=\np\nj\nk\nj\nRMSE\nk\nAFOM\n1\n)\n,\n(\n)\n(\n \nCalculating the AFOM for each k and adjusting for cluster size, and dividing by \nthe number of variables ‘left out’ renders the adjusted AFOM: \n \n)\n(\n1\n)\n(\nk\nAFOM\nN\nk\nN\np\nk\nAFOM adj\n⋅\n−\n=\n \nLow values of the clustering algorithm’s AFOM indicate a high predictive power. \nBy comparing the AFOM values at each k for different clustering algorithms their \nperformance can be compared. However, Yeung et al. (2001) comment that this\n\n45 \n \nshould only be done if the similarity metrics of the compared clustering \nalgorithms are identical. Olex et al. (2007) show limitations of the FOM when the \nunderlying similarity measure is non-Euclidean. For similarity measures based on \nthe Pearson correlation coefficients they propose a more suitable alternative FOM. \n• The use of a stability measure expresses how the cluster-membership assignment \nis affected by small changes/alterations in the dataset (e.g. sampling different \ndata(sub)sets; adding noise to data) or by applying different parameter-settings for \nthe cluster algorithm. It provides information on the stability/robustness of the \nprevailing clustering partition for these alternative choices. The stability measure \nis typically based on the use of an explicit criterion for cluster comparison, like the \nadjusted Rand index, or Meila’s variation of information criterion, cf. Meila \n(2007). The stability-based approach can also be used to determine the appropriate \nnumber of clusters k, by studying for which k the resulting cluster partition is \nrelatively stable/robust towards (re)sampling of the data or noise in the data. This \napproach is presently very popular and was initially advocated by Dudoit and \nFridlyand (2002), Tibshirani et al. (2002), Ben-Hur et al. (2002), Bel Mufti and \nBertrand (2007). Notice that these resampling methods in fact assume that the \nemployed subset-samples are representative enough to reflect the inherent \nstructure in the whole dataset. In situations where some clusters are of small size, \nthis may be a problematic assumption. See also Lange et al. (2005), Hennig \n(2006), the <<fpc>> package26 and Volkovich et al. (2008) for related \napproaches. Kuncheva and Vetrov (2006) specifically analyse the stability of the \nk-means cluster results with respect to random initialization. See the next textbox \nfor s critical remarks on the appropriateness of the stability approach for the \ndetermination of the number of clusters \n \nIn the cluster analysis that we have set up for identifying patterns of vulnerability for \nglobal change we have implemented the above mentioned stability procedure in the \nfollowing way in order to determine an adequate number of clusters k on basis of \nrepetitively performing clustering for k=2 until a maximum value Kmax: \n \n1. Initialize k:=2; \n2. IF [k ≤ Kmax] THEN   \n{ Repeatedly (e.g. n=150) perform two clusterings by k-means, initializing \neach clustering with a random start-setting and compare these clusterings on \nbasis of a criterion which gives a value between 0 and 1 to express their \nsimilarity (values around 1 hint at high similarity of the pair of clusterings).  \nNext take the average of this criterion value \n)\n(k\nS\n over all these n repetitions \nas a measure for the stability of this resampling procedure for the specific k.} \nELSE Go to step 4 \n                                                     \n26 http://cran.r-project.org/web/packages/fpc/index.html\n\n46 \n \n3. k:=k+1; Go to step 2; \n4. Plot the average values \n)\n(k\nS\n as a function of k for k=2, …, Kmax. This is a so-\ncalled consistency graph, which displays the average stability/robustness of \nthe outcome of the clustering analysis for the resampling. \nFigure 7 gives a graphical overview of the procedure (from Dietz et al., 2011). Since \nwe used the counting of overlap method we had to reallocate the labelling of the \ncluster via the straightforward method of the Euclidean distance (See 5.1) to achieve \ncomparable maps. \nNo\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nYes\nNo\nYes\nYes\nYes\nYes\nNo\n1.\n2.\nidentification of overlap\nconsistencymeasure =\nnumber of overlaps\nnumber pixel\n1.\n2.\nresults of two \ncluster analysis runs\n(color allocations arbitrary)\nreallocation by \ncomparison of the \neuclidean distance of \nthe cluster centroids\nnumber of identical colours for each pixel\ncomparable maps\n \nFigure 7: Operational sequence for calculating the consistency measure exemplary for k=4.  \nThe value of k for which this consistency measure is optimal indicates a suitable \nchoice for the number of clusters. Figure 8, shows an example from Kok et al. \n(2011). Besides the global optimum at k=3 there is an interesting relative \nmaximum for eight clusters, suggesting that this number of clusters reflects also \nthe structure of the data in case one is looking for a more differentiated partition. \n \n2\n4\n6\n8\n10\n12\n0.75\n0.85\n0.95\nConsistency Measure\n# cluster\naverage consistency\n \nFigure 8: Consistency graph for determining the number of clusters. The local optimum at \nk=8 indicates that possibly an interesting suitable clustering can result if choosing e.g. 8 \nclusters. The number of repetitions n has been 150 in this case.\n\n47 \n \nAlthough the above procedure is formulated primarily for the k-means method, it can \nalso be applied to other clustering methods as well.  \nMoreover, our R-code offers various options the choice of the criterion to express the \nsimilarity between the clusterings: next to using the adjusted Rand index or Meila’s \nvariation of information criterion, it is possible to explicitly calculate the fraction of \ndata points which have been clustered similarly when repeating the clustering with a \nrandom restart. In this case the average value \n)\n(k\nS\ncan be viewed as the average \nfraction of data points which are clustered similarly when randomly restarting the \nclustering for this specific k. Typically the criterion choice does not lead to different \nchoices in the ‘optimal’ number of clusters. \n \n \n \n \n5.3  Software for cluster validation \n \nThe R-package <<clValid>> provides software for cluster validity (see Brock et al., \n2008), where the generic function cl_validity() can be used to evaluate cluster validity \nindices for partitions and hierarchies obtained by clustering. See also cluster.stats in \npackage <<fpc>> for a variety of cluster validation statistics; fclustIndex in package \n<<e1071>> for several fuzzy cluster indexes; clustIndex in package <<cclust>>; \nsilhouette in package <<cluster>>. The R-package <<clusterSim>> provides \nvarious measures to express the performance of a clustering on a dataset, including \nthe Tibshirani et al. (2001) gap statistic. \nCriticism on the stability-based approach for choosing the number of clusters \n \nBen-David and von Luxburg (2006) have recently criticized the popular stability-\nbased methods on basis of a theoretical analysis of stability issues in cluster-analysis  \nmethods that determine the clusters by globally minimizing an objective function. \nThey discovered that for large datasets the common belief (and practice) that stability \nreflects the validity or meaningfulness of the chosen number of clusters is not true. \nFor an elegant and useful exposition of the implications of these and other related \nfindings see the recent publication von Luxburg (2009). Albeit the initial critical \ntheoretical findings on the stability-based approach von Luxburg at the end draws a \n“carefully optimistic picture about model selection base on clustering stability for the \nk-means algorithm. Stability can discriminate between different values of k, and the \nvalues of k which lead to stable results have desirable properties. If the data set \ncontains a few well-separated clusters which can be represented by center-based \nclustering then stability has the potential to discover the correct number of clusters.” \n(von Luxburg, 2009; italics are added by us). In case of very elongated clusters or \nclusters with complicated shapes the k-means algorithm cannot find a good \nrepresentation of the dataset, regardless of the number k used, and in these situations \nstability based model selection breaks down. Von Luxburg moreover states that these \nresults only hold true for situations where the number of clusters is relatively small \n(in the order of 10, rather than in the order of 100). For other clustering algorithms \nthat work very different from k-means it remains an open question whether the \nstability-based model selection is a suitable approach.\n\n48 \n \nAlternative tools for validity assessment are proposed by Bolshakova et al. (2003, \n2005a,b) and contain also visualization method for evaluating the clustering results. \n \nSummary \n \nVarious ways to evaluate clustering performance and compare different clusterings have been \npresented. A general (stability-based) approach is put forward which assesses the robustness \nof clustering results for repeated analysis of the dataset under different settings (e,g, \ninitialisations) of the cluster algorithm. It can be used for estimating the number of clusters.\n\n49 \n \n6 Graphical representation of the results \n \nData visualisation can greatly support the interpretation of the cluster analysis. \nVarious ways to visualize the results of the cluster analysis are possible (see also \nsection 2.3.7). In the last chapter of this guideline we do not intend to give a \ncomprehensive overview of all possibilities but to show some examples which \noccurred to be useful to us.  \n \n \n \n \nFigure 9: Heatmap of the dataset shown in Gentleman et al. (2004). See \nhttp://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/r/heatmap/ for further explanation. \n6.1 Hierarchical cluster analysis \nHierarchical cluster analyses are typically illustrated by dendograms, showing clearly \nhow the groupings are established. This information can further be enhanced by using \nheat-maps which provide a sorting/structuring of the data-matrix, permuting the\n\n50 \n \ncolumns and rows of this matrix to conform with the hierarchical clustering of \nvariables and objects (see Figure 9).  \n \nThe ‘clustergram graph’ proposed by Schonlau (2002,2004) as alternative to \ndendrogram-graphs (e.g. by using the R-function dendrogram()) is in fact of similar \nnature as the branching diagram. It examines how objects are assigned to clusters as \nthe number of clusters increases. Clustergrams are useful for non-hierarchical \nclustering algorithms such as k-means as well as hierarchical cluster algorithms when \nthe number of objects is large enough to make dendrograms impractical.  \n \nAgrafiotis et al. (2007) propose radial clustergrams to visualize the aggregate \nproperties of hierarchical clusters, which are specially apt for visualizing large trees \nwhich can not be displayed appropriately in straightforward dendrograms. One can \nalso consider the use of the Dendroscope software from the University of Tübingen \nfor this purpose (Huson et al. 2007, see Figure 10). \n \n \n \n \nFigure 10: Seven alternative views for visualizing the same tree, implemented in the \nDendroscope software (Huson et al. 2007): Rectangular Phylogram, Rectangular Cladogram, \nSlanted Cladogram, Circular Phylogram, Circular Cladogram, Radial Phylogram and Radial \nCladogram. \n \n6.2 Partitioning cluster analysis \nPartitioning cluster analyses are often visualized by projecting the data in two-\ndimensional space, using e.g. multidimensional scaling (MDS) or self-organized maps \n(SOM) (see Figure 11, using Clusplot as in Pison et al. 1999; see also Vesanto, 1999, \nEwing and Sherry, 2001).\n\n51 \n \n-3\n-2\n-1\n0\n1\n2\n3\n-3\n-2\n-1\n0\n1\n2\nCLUSPLOT( iris.x )\nComponent 1\nComponent 2\nThese two components explain 95.81 % of the point variability.\n \nFigure 11: Two dimensional projection of the clusterpoints for the Iris dataset.  \n \n6.3 Cluster membership \n \nCluster membership is usually indicated by different colours and glyphs. The \ncharacteristics of the various clusters can e.g. be displayed by showing boxplots per \nvariable/feature for the various clusters (see Figure 12), or by showing a graph of the \ncluster centres (see the spectral plot Fig. 13).  \nIn the boxplot the cluster centre is indicated by the circle, while the spread around this \ncentre is indicated by the box-boundaries denoting the lower and upper quartiles (25th \nand 75th percentile) of the data; thus the box-length indicates the interquartile \ndistance, IQR. The band near the middle of the box denotes the median. Typically, \nboxplots are extended by whiskers denoting the minimum or maximum data values \nwithin 1.5 IQR of the lower and upper quartile. But, since we are specifically \ninterested in high/low end percentiles, and in highlighting potential asymmetry of the \ndistribution, we have chosen to work with alternative whiskers, and indicate them by \nthe ends of the dotted lines which show the 5th and 95th-percentile. So between these \ntwo points 90% of the objects within a cluster are located. Notice that the boxplots for \nthe clusters in fact only display one-dimensional information, as projected on the \nindividual axes associated to the various variables/indicators. Information on the \nspecific spatial structure of the cluster of points in the multi-dimensional data space \n(spanned by all variables/indicators considered) does not clearly show up in the \nboxplot.\n\n52 \n \n Figure 12: Boxplots, showing the variation in indicator values per cluster (colours indicate \nclusters; all indicator values are between 0 and 1); see Kok et al. (2010).  \nNote: the boxes present the 25-75 percentile range of the indicator values; the circles at the \nend of the dotted lines indicate the 5- and 95-percentile, while the red circle indicates the \narithmetic mean; the band near the middle of the box indicates the median value. The \nnumber of points in the respective clusters is indicated in the top of the sub frames. \n \nGraphs of the normalized cluster centres give information on how the average \ncharacteristics of the clusters differ (see Figure 13). They are helpful in suggesting the \n(dis)similar properties and characteristics of the various clusters.  \n \n0\n0.2\n0.4\n0.6\n0.8\n1\n1.2\nROD\nRUF\nAGP\nSER\nPOD\nGDC\nIMR\nC7\nC1\nC2\nC5\nC6\nC8\nC4\nC3\n \nFigure 13: Cluster centres (= typical indicator values) for the 8 clusters C1 - C8; see Kok et al. \n(2010).  \n \nIn case that the data have a spatial dimension, showing maps can give a clue on how \nthe clusters are geographically distributed, serving to identify and connect features \nwith similar characteristics at different geographical locations (see Figure 14).\n\n53 \n \n \nFigure 14: Distribution of clusters within the drylands (see Kok et al., 2010). Light grey: non-\narid areas. Each of the 8 clusters denotes a typical constellation of the 7 indicators road \ndensity, renewable water resource, agro-potential, soil erosion, population density, GDP/cap \nand infant mortality rate, which are also displayed in the boxplots of Figure 12.  \n \n \nFigure 15: Branching diagram, showing cluster subdivision when increasing cluster-numbers \nin k-means cluster analysis of the dataset (N=45000) consisting of the indicators for the \nforest overexploitation archetype\n\n54 \n \n \n6.4 Branching diagrams \n \nWhen performing the cluster analysis repeatedly for a consecutive number of clusters \nit is insightful to construct a ‘branching diagram’ (see figure 15) which displays how \nthe clustering structure changes when using another number of clusters. This diagram \ngrossly indicates which clusters are split or merged, and thus renders useful \ninformation on the potential relatedness of the clusters. \n \n \nBesides the above presented methods Leisch (2008, 2009) recently provide an \noverview of various visualization possibilities for centroid based clustering methods \n(neighbourhood graphs, convex cluster hulls, bar charts of cluster medoids etc.). The \nCRAN-package <<flexclust>> contains implementations of these visualization \nmethods. See also the interactive visualization toolbox for cluster analysis in the \ncontext of gene expression data <<gcExplorer>> developed by Scharl and Leisch, \n2009. Additional information can be found in literature on visualization methods for \nbioinformatics applications, like analysing gene expression microarray clusters, see \ne.g. Hibbs et al., (2005), Saraiya et al. (2005).  \n \n \nSummary \n \nA number of possibilities is given for graphically displaying different properties of clusters. It \nturned out that adequate graphical representations play a vital role in the process of \nidentifying promising further questions and next steps in a clustering oriented research \nprocess.\n\n55 \n \n7 References \n \n \nAckerman, M., Ben-David, S. (2008). Measures of Clustering Quality: A Working Set \nof Axioms for Clustering. Proceedings of Neural Information Processing Systems \n(NIPS 2008). http://books.nips.cc/papers/files/nips21/NIPS2008_0383.pdf. \n \nAckerman, M., Ben-David, S. (2009). Which Data Sets are 'Clusterable'? – A \nTheoretical Study of Clusterability. Proceedings of the Twelfth International \nConference on Artificial Intelligence and Statistics, 2009. \nhttp://www.cs.uwaterloo.ca/~shai/publications/ability_submit.pdf. \n \nAgrafiotis, D.K., Bandyopadhyay, D., Farnam, M. (2007). Radial Clustergrams: \nVisualizing the aggregate properties of hierarchical clusters. Journal. Chem. Inf. \nModel., Vol. 47, 69–75. \n \nAldenderfer, M.S., Blashfield, R.K. (1976). Cluster Analysis. Sage, Beverly Hills, \nCA. \n \nAnselin, L. (1995). \"Local indicators of spatial association – LISA\". Geographical \nAnalysis, 27, 93–115. \n \nAnselin, L. (2005). \"Exploring Spatial Data with GeoDATM: A Workbook\". Spatial \nAnalysis Laboratory. p. 138. \nhttp://www.csiss.org/clearinghouse/GeoDa/geodaworkbook.pdf. \n \nAnselin, L., Kim, Y.-W., Syabri, I. (2004b). Web-based analytical tools for the \nexploration of spatial data. Journal of Geographical Systems, 6, 197–218. \n \nBao, S., Henry, M.S. (1996). \"Heterogeneity issues in local measurements of spatial \nassociation.” Geographical Systems, 1996, Vol. 3, 1–13. \n \nBao, S., Martin, D. (1997). Integrating S-PLUS with ArcView in Spatial Data \nAnalysis: An Introduction to the S+ArcView Link, ESRI’s Users Conference, \nSan Diego, CA. \n \nBao, S. (1999). Literature Review of Spatial Statistics and Models. China Data \nCenter, http://141.211.136.209/cdc/docs/review.pdf. \n \nBao, S., Li, B. (2000). Spatial Statistics in Natural Resources, Environment and Social \nSciences (eds.), A Special Issue of the Journal of Geographic Information Science. \n \nBação, F., Lobo1,V., Painho, M. (2005). Self-organizing Maps as Substitutes for  \nK-Means Clustering. V.S. Sunderam et al. (Eds.): ICCS 2005, LNCS 3516, 476–483, \n2005. \n \nBarbakh, W., Fyfe, C. (2008). Local vs global interactions in clustering algorithms: \nAdvances over K-means. International Journal of Knowledge-based and Intelligent \nEngineering Systems, 12, 1–17.\n\n56 \n \n \nBelabbas, M.-A., Wolfe, P.J. (2009a). Spectral methods in machine learning and new \nstrategies for very large datasets. PNAS January 13, 2009, Vol. 106 (2), 369–374. \n \nBelabbas, M.-A., Wolfe, P.J. (2009b).On landmark selection and sampling in high-\ndimensional data analysis, in Philosophical Transactions, Series A, of the Royal \nSociety 367 (2009), 4295–4312. \n \nBen-Hur, A., Elisieeff, A., Guyon, I. (2002). A stability based method for discovering \nstructure in clustered data. Pac Symp Biocomput. 2002, 6–17. \n \nBen-Hur, A., Horn, D., Siegelmann, H., Vapnik, V. (2001). Support vector clustering. \nJ. Mach. Learn. Res., 2 ,125–137. \n \nBezdek, J.C., Hathaway, R.J. (2002). VAT: A Tool for Visual Assessment of \n(Cluster) Tendency, Proc. IJCNN 2002, IEEE Press, Piscataway, N.J., 2225–2230. \n \nBezdek, J.C., Hathaway, R.J., Huband, J.M. (2007). Visual Assessment of Clustering \nTendency for Rectangular Dissimilarity Matrices. IEEE Trans. On Fuzzy Systems, \nVol. 15 (5), 890–903.  \n \nBezdek, J.C., Pal, N.R. (1998). Some new indexes of cluster validity. IEEE Trans Syst \nMan Cybern B Cybern 1998, 28 (3), 301–315. \n \nBies, B., Dabbs, K., Zou, H. (2009). On Determining The Number Of Clusters – \nA Comparative Study. Paper during 2009 IMA Interdisciplinary Research Experience \nfor Undergraduates, June 28 to July 31. \nhttp://www.ima.umn.edu/~iwen/REU/paper4.pdf. \n \nBlum, A.L., Langley, P. (1997). Selection of relevant features and examples in \nmachine learning. Artificial Intelligence, Vol. 97, 245–271. \n \nBolshakova, N., Azuaje, F. (2003). Cluster validation techniques for genome \nexpression data. Signal Processing 2003, 83, 825–833. \n \nBolshakova, N., Azuaje, F., Cunningham, P. (2005a). A knowledge-driven approach \nto cluster validity assessment. Bioinformatics 2005, 21, 2546–2547. \n \nBolshakova1, N., Azuaje, F., Cunningham, P. (2005b). An integrated tool for \nmicroarray data clustering and cluster validity assessment. Bioinformatics.  \nVol. 21 (4), 451–455. \n \nBreiman, L. (2001). Random forests. Machine Learning, 45 (1), 5–32. \n \nBrock, G., Pihur, V., Datta, S., Datta, S. (2008). clValid, an R package for cluster \nvalidation. http://louisville.edu/~g0broc01/. \n \nBrys, G. (2006). Finding groups in a diagnostic plot. In: COMPSTAT 2006, \nProceedings in Computational Statistics.\n\n57 \n \nCai, W., Chen, S., Zhang, D. (2009). A simultaneous learning framework for \nclustering and classification. Pattern Recognition, Vol. 42 (7), 1248–1259.  \n \nCamastra, F. (2003). Data Dimensionality Estimation Methods: A Survey. Pattern \nRecognition, Vol. 36 (12), 2945–2954, Elsevier Science, Amsterdam, (2003). \n \nCamastra, F., Verri, A. (2005). A novel kernel method for clustering. IEEE \nTransaction on PAMI, Vol. 27, 801–805. \n \nCampbell, V., Legendre, P., Lapointe, F.-J. (2009). Assessing Congruence Among \nUltrametric Distance Matrices. Journal of Classification, Vol. 26, 103–117.  \n \nCeleux, G., Govaert, G. (1995). Gaussian parsimonious clustering models, Pattern \nRecognition, 28, 781–793. \n \nChang, W.-C. (1983). On Using Principal Components Before Separating a Mixture \nof two Multivariate Normal Distributions. Applied Statistics, 32, 267–275. \n \nCheng, R., Milligan, G.W. (1995). Mapping Influence Regions in Hierarchical \nClustering Multivariate Behavioral Research, Vol. 30. \n \nCheng, R., Milligan, G.W. (1996a). Measuring the influence of individual data points \nin a cluster analysis. Journal of Classification. Vol. 13 (2), 1432–1343.  \n \nCheng, R., Milligan, G.W. (1996b). K-means clustering with influence detection. \nEducational and Psychological Measurement, Vol. 56, 833–838. \n \nCheng, Y., Church, G.M. (2000). Biclustering of expression data. Proceedings of the \n8th International Conference on Intelligent Systems for Molecular Biology, 93–103. \n \nCristianini, N., Shawe-Taylor, J., Kandola, J. (2002). Spectral kernel methods for \nclustering. In NIPS 14, 2002. \n \nDamian, D., Orešič, M., Verheij, E., Meulman, J., Friedman, J., Adourian, A., \nMorel, N., Smilde, A., van der Greef, J. (2007). Applications of a new subspace \nclustering algorithm (COSA) in medical systems biology, Metabolomics 3.  \n \nDe Soete, G. (1986). Optimal variable weighting for ultrametric and additive tree \nclustering. Quality&Quantity, 20, 169–180.  \n \nDe Soete, G. (1988). OVWTRE: A program for optimal variable weighting for \nultrametric and additive tree fitting. Journal of Classification, 5, 101–104.  \n \nDixon, J.K. (1979). Pattern recognition with partly missing data. IEEE Transactions \non Systems, Man and Cybernetics SMC 9, 617–621. \n \nDonoho, D., Jin, J. (2008). Higher criticism thresholding: Optimal feature selection \nwhen useful features are rare and weak. Proceedings of the National Academy of \nSciences, Vol. 105, 14790–14795.\n\n58 \n \nDonoho, D., Jin, J. (2009). Feature selection by higher criticism thresholding achieves \nthe optimal phase diagram. Phil Trans R Soc A, 367, 4449–4470. \n \nDrineas, P., Mahoney, M.W. (2005). On the Nyström Method for Approximating a \nGram Matrix for Improved Kernel-Based Learning. Journal of Machine Learning \nResearch, Vol. 6, 2153–2175.  \n \nDudoit, S, Fridlyand, J. (2002a). A prediction-based resampling method for estimating \nthe number of clusters in a dataset. Genome Biology, Vol. 3 (7). \n \nDudoit, S., Fridlyand, J., Speed, T.P. (2002b). Comparison of discriminant methods \nfor the classification of tumors using gene expression data. J. Am. Stat. Assoc., 97, \n77–87.  \n \nEveritt, B.S., Dunn, G. (2001). Applied Multivariate Data Analysis. (Second Edition). \nHodder Education.  \n \nEveritt, B.S., Landau, S., Leese, M. (2001). Cluster Analysis. Fourth edition. Arnold \nPublishers. \n \nEwing, R.M., Sherry, J.M. (2001). Visualization of expression clusters using \nSammonb’s non-linear mapping. Bioinformatics, Vol. 17, 658–659. \n \nFilippone, M., Camastra, F. Masulli, F., Rovetta, S. (2007). A survey of kernel and \nspectral methods for clustering. Pattern Recognition. Vol. 41 (1), 176–190.  \n \nFlexer, A. (2001). On the use of self-organizing maps for clustering and visualization. \nIntelligent Data Analysis, 5, 373–384. \n \nFodor, I.K. (2002). A survey of dimension reduction techniques. (pdf file)  \nUS-department of Energy. https://e-reports-ext.llnl.gov/pdf/240921.pdf. \n \nFowlkes, E.B., Gnanadesikan, R., Kettenring, J.R. (1988). Variable selection in \nclustering. Journal of Classification, 5, 205–228. \n \nFraiman, R., Justerl, A., Svarc, M. (2008). Selection of Variables for Cluster Analysis \nand Classification Rules. Journal of the American Statistical Association September \n2008, Vol. 103 (483), 1294–1303. \n  \nFraley, C., Raftery, A.E. (1998). How many clusters? Which clustering method? - \nAnswers via Model-Based Cluster Analysis. Computer Journal, 41, 578–588. \n \nFraley, C., Raftery, A.E. (1999). MCLUST: Software for model-based clustering. \nJournal of Classification, 16, 297–306. \n \nFraley, C., Raftery, A.E. (2002). Model-Based Clustering, Discriminant Analysis, and \nDensity Estimation. Journal of the American Statistical Association, 97, 611–612. \n \nFraley, C., Raftery, A.E. (2003). Enhanced model-based clustering, density estimation \nand discriminant analysis software: MCLUST. Journal of Classification, 20, 263–296.\n\n59 \n \n \nFriedman, J.H., Meulman, J.J. (2004). Clustering objects on subsets of attributes (with \ndiscussion). Journal of the Royal Statistical Society Series B (Statistical \nMethodology), 66 (4), 815–849. \n \nGat-Viks, I., Sharan, R., Shamir, R. (2003). Scoring clustering solutions by their \nbiological relevance. Bioinformatics, Vol. 19 (18), 2381–2389. \n \nGeary, R. (1954). The contiguity ratio and statistical mapping. Incorporated \nStatistician, Vol. 5, 115–145.  \n \nGentleman, R.C., et al. (2004). Bioconductor: open software development for \ncomputational biology and bioinformatics, Genome Biology, 2004, 5:R80. \n \nGetis, A., Ord, J.K. (1992). The analysis of spatial association by use of distance \nstatistics. Geographical Analysis, Vol. 24 (3), 189–206. \n \nGetis, A., Ord, J.K. (1996). Local spatial statistics: an overview. In: P. Longley and \nM. Batty (eds.) “Spatial analysis: modeling in a GIS environment” (Cambridge: \nGeoinformation International), 261–277. \n \nGionis, A., Mannila, H., Tsaparas, P. (2005). Clustering Aggregation. 21st \nInternational Conference on Data Engineering (ICDE 2005). \n \nGirolami, M. (2002). Mercer Kernel-Based Clustering in Feature Space. IEEE \nTransactions on Neural Networks, 13 (3), 780–784. \n \nGirvan, M., Newman, M.E.J. (2002). Community structure in social and biological \nnetworks. PNAS June 11, 2002, Vol. 99 (12), 7821–7826. \n \nGnanadesikan, R., Kettenring, J.R., Tsao, S.L. (1995). Weighting and Selection of \nVariables for Cluster Analysis. Journal of Classification, 12, 113–136. \n \nGnanadesikan, R., Kettenring, J.R., Maloor, S. (2007). Better alternatives to current \nmethods of scaling and weighting data for cluster analysis. Journal of Statistical \nplanning and Inference, Vol. 137, 3483–3496. \n \nGoder, A., Filkov, V. (2008). Consensus Clustering Algorithms: Comparison and \nRefinement. Proceedings of the Ninth Workshop on Algorithm Engineering and \nExperiments (ALENEX) — San Francisco, January 19, 2008. Society for Industrial \nand Applied Mathematics. \n \nGordon, A.D. (1999). Classification. (2nd edition). Chapman & Hall/CRC, Boca \nRaton. Fl. \n \nGuyon, I., Elisseeff, A. (2003). An introduction to variable and feature selection. \nJournal of Machine Learning Research, Vol. 3, 1157–1182. \n \nGünter, S, Bunke, H. (2003). Validation indices for graph clustering. Pattern \nRecognition Letters, Vol. 24, 1107–1113.\n\n60 \n \n \nGuyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (eds.) (2006). Feature Extraction, \nFoundations and Applications. Series Studies in Fuzziness and Soft Computing, \nPhysica-Verlag, Springer, 2006. \n \nHalkidi, M., Batistakis, Y., Vazirgiannis, M. (2001). On clustering validation \ntechniques. Journal of Intelligent Information Systems, 17, 107–145. \n \nHan, J., Kamber, M. (2006). Data mining: concepts and techniques. Second Edition. \nMorgan Kaufmann, 2006. \n \nHandcock, M.S., Raftery, A.E., Tantrum, J. (2007). Model-based clustering for social \nnetworks (with Discussion). Journal of the Royal Statistical Society, Series A, 170, \n301–354. \n \nHandcock, M.S., Hunter, D.R., Butts, C.T., Goodreau, S.M., Morris, M. (2009). \nstatnet: Software Tools for the Representation, Visualization, Analysis and Simulation \nof Network Data. J Stat Softw. 2008, 24 (1), 1548–7660. \n \nHandl, J., Knowles, J. (2004). Multiobjective clustering with automatic determination \nof the number of clusters. Technical Report TR-COMPSYSBIO-2004-02. UMIST, \nManchester, UK.  \n \nHandl, J., Knowles, J. (2005a). Multiobjective clustering around medoids. \nProceedings of the Congress on Evolutionary Computation (CEC 2005). Vol. 1,  \n632–639. Copyright IEEE Press.  \n \nHandl, J., Knowles, J. (2005b). Improving the scalability of multiobjective clustering. \nProceedings of the Congress on Evolutionary Computation (CEC 2005). Vol. 3, \n2372–2379. Copyright IEEE Press.  \n \nHandl, J., Knowles, J. (2005c). Exploiting the trade-off – the benefits of multiple \nobjectives in data clustering. Proceedings of the Third International Conference on \nEvolutionary Multi-Criterion Optimization (EMO 2005), 547–560, LNCS 3410.  \n \nHandl, J., Knowles, J. (2006a). Multiobjective clustering and cluster validation. In \nMultiobjective machine learning edited by Yaochu Jin. Springer Series on \nComputational Intelligence 16, 21–47. \n \nHandl, J., Knowles, J. (2006b). Feature subset selection in unsupervised learning via \nmultiobjective optimization. International Journal of Computational Intelligence \nResearch, 2 (3), 217–238. \n \nHandl, J., Knowles, J. (2007). An evolutionary approach to multiobjective clustering. \nIEEE Transactions on Evolutionary Computation, 11 (1), 56–76. \n \nHandl, J., Knowles, J., Kell, D.B. (2005). Computational cluster validation in post-\ngenomic data analysis. Bioinformatics, Vol. 21 (15), 3201–3212.\n\n61 \n \nHandl, J., Knowles, J., Kell, D.B. (2007). Multiobjective optimization in \nbioinformatics and computational biology. IEEE/ACM Transactions on \nComputational Biology, Vol. 4, 279–292. \n \nHastie, T., Tibshirani, R., Friedman, J. (2001). Elements of Statistical Learning: Data \nMining, Inference and Prediction. Springer-Verlag, New York. \n \nHinton, G.E., Salakhutdinov, R.R. (2006). Reducing the dimensionality of data with \nneural networks. Science, 313, 504–507.  \n \nHothorn, T., Hornik, K., Zeileis, A. (1996). party: A Laboratory for Recursive \nPart(y)itioning. [http://CRAN.R-project.org/package=party]. R package version 0.9-\n96.  \n \nHothorn, T., Hornik, K., Zeileis, A. (2006). Unbiased Recursive Partitioning: A \nConditional Inference Framework. Journal of Computational and Graphical Statistics \n2006, 15 (3), 651–674. \n \nHubert, L., Arabie, P. (1985). Comparing Partitions. Journal of Classification, Vol. 2, \n193–218. \n \nHu, Y., Hathaway, R.J. (2008). An Algorithm for Clustering Tendency Assessment. \nWSEAS TRANSACTIONS on MATHEMATICS, Vol. 7 (7), 441–450, 2008. \n \nHuang, J.Z., Ng, M.K., Rong, H., Li, Z. (2005). Automated Variable weighting in k-\nMeans type clustering. IEEE T-on Pattern Analysis and Machine Intelligence,  \nVol. 27 (5), may 2005, 657–668. \n \nHubert, L., Schultz, J. (1976). Quadratic assignment as a general data-analysis \nstrategy. British Journal of Mathematical and Statistical Psychologie, 29, 190–241. \nhttp://machaon.karanagai.com/validation_algorithms.html. \n \nHubert, M., Vandervieren, E. (2008). An adjusted boxplot for skewed distributions, \nComputational Statistics and Data Analysis, 52, 5186–5201. \n \nHubert, M., Van der Veeken, S. (2008). Outlier detection for skewed data, Journal of \nChemometrics, 22, 235–246. \n \nHubert, L., Arabie, P. (1985). Comparing Partitions. Journal of Classification, Vol. 2, \n193–218. \n \nHurley, C.B. (2004). Clustering Visualizations of Multidimensional Data, Journal of \nComputational & Graphical Statistics, Vol. 13 (4), 788–806. \n \nHuson, D.H., Richter, D.C., Rausch, C., Dezulian, T., Franz, M., Rupp, R. (2007). \nDendroscope: An interactive viewer for large phylogenetic trees. BMC \nBioinformatics, 8, 460. \n \nIrigoien, I., Arenas, C. (2008). INCA: New statistic for estimating the number. Statist. \nMed., 27, 2948–2973.\n\n62 \n \n \nJain, A.K., Dubes, R.C. (1988). Algorithms for Clustering Data. Prentice Hall, \nEnglewood Cliffs, 1988. \n \nJacquez, G.M., Jacquez, J.A. (1999). Disease clustering for uncertain locations. In: \nDisease mapping and risk assessment for public health. A.B. Lawson, A. Biggeri, D. \nBohning, E. Lesaffre, J.-F. Viel, and R. Bertollini, eds. New York: John Wiley & \nSons. \n \nJacquez, G.M. (2008). Spatial Cluster Analysis. Chapter 22 In “The Handbook of \nGeographic Information Science”, S. Fotheringham and J. Wilson (Eds.). Blackwell \nPublishing, 395–416. \n \nJohn, G.H., Kohavi, R., Pfleger, K. (1994). Irrelevant features and the subset selection \nproblem. Volume 129. New Brunswick, NJ, USA, Morgan Kaufmann; 1994. \n \nJolliffe, I.T. (2002). Principal Component Analysis. 2nd-edition. New York: Springer-\nVerlag. \n \nJonnalagadda, S., Srinivasan, R. (2009). NIFTI: An Evolutionary Approach for \nFinding Number of Clusters in Microarray Data. BMC Bioinformatics, Vol. 10, p 40. \n \nKannan, R., Vempala, S., Vetta, A. (2004). On clusterings: Good, bad and spectral. \nJournal of the ACM, 51 (3), 497–515. \n \nKaufman, L. Rousseeuw, P.J. (1990). Finding Groups in Data. John Wiley and Sons. \nNew York. \n \nKemp, C., Tenenbaum, J.B. (2008). The discovery of structural form. Proc. Natl. \nAcad. Sci. USA 2008, 105, 10687–10692. \n \nKerr, M.K., Churchill, G.A. (2001). Bootstrapping cluster analysis: Assessing the \nreliability of conclusions from microarray experiments. PNAS, Vol. 98 (16),  \n8961–8965. \n \nKerr, G., Ruskin, H.J., Crane, M. (2007). Pattern Discovery in Gene Expression Data. \nIntelligent Data Analysis: Developing New Methodologies Through Pattern \nDiscovery and Recovery. \n \nKerr, G., Ruskin, H.J., Crane, M., Doolan, P. (2008). Techniques for Clustering Gene \nExpression Data. Computers In Biology And Medicine, 38 (3), 283–293. \n \nKettenring, J.R. (2006). The practice of cluster analysis, J. Classif., 23, 3–30. \n \nKleinberg, J. (2002). An Impossibility Theorem for Clustering. Advances in Neural \nInformation Processing Systems (NIPS) 15, 2002. \n \nKohavi, R, John, G.H. (1998). Wrappers for Feature Subset Selection. Artificial \nIntelligence, Vol. 97 (1–2), 273–324.\n\n63 \n \nKohavi, R, John, G.H. (1998). The Wrapper Approach. In “Feature Extraction, \nConstruction and Selection: a data mining perspective”, eds. Liu, H., Motoda, H. \n \nKohonen, T. (1982). Self-organized formation of topologically correct feature maps. \nBiological Cybernetics, 43, 59–69. \n \nKok, M.T.J., Lüdeke, M.K.B., Sterzel, T., Lucas, P.L., Walter, C., Janssen, P., de \nSoysa, I. (2010). Quantitative analysis of patterns of vulnerability to global \nenvironmental change. Den Haag: Netherlands Environmental Assessment Agency \n(PBL) 90 p. \n \nKrzanowski, W.J., Hand, D.J. (2009). A simple method for screening variables before \nclustering microarray data. Computational Statistics and Data Analysis, Vol. 43, \n2747–2753.  \n \nKulis, B., Basu, S., Dhillon, I.S., Mooney, R.J. (2009a). Semi-Supervised Graph \nClustering: A Kernel Approach. Machine Learning, Vol. 74 (1), 1–22, January 2009. \n \nKulis, B., Sustik, M.A., Dhillon, I.S. (2009b). Low-Rank Kernel Learning with \nBregman Matrix Divergences. Journal of Machine Learning Research, Vol. 10,  \n341–376. \n \nLaw, M.H.C., Jain, A.K. (2006). Incremental Nonlinear Dimensionality Reduction By \nManifold Learning. IEEE Transactions of Pattern Analysis and Intelligence. Vol. 28, \n377–391. \n \nLeisch, F. (2006). A toolbox for k-centroids cluster analysis. Comput. Stat. Data \nAnal., 51 (2), 526–544. \n \nLeisch, F. (2008). Visualizing cluster analysis and finite mixture models. In: Chen, C., \nHärdle, W., Unwin, A. (eds.) Handbook of Data Visualization. Springer Handbooks \nof Computational Statistics. Springer, Berlin (2008). ISBN 978-3-540-33036-3. \n \nLeisch, F. (2009). Neighborhood graphs, stripes and shadow plots for cluster \nvisualization. Statistics and Computing, 2009. to appear. \n \nLerner, B., Guterman, H., Aladjem, M., Dinstein, I. (2000). On the Initialisation of \nSammon’s Nonlinear Mapping. Pattern Analysis & Applications, Vol. 3, 61–68. \n \nLi, G., Ma, Q., Tang, H., Paterson, A. H., Xu, Y. (2009). QUBIC: a qualitative \nbiclustering algorithm for analyses of gene expression data. Nucleic Acids Res., \nAugust 1, 2009; 37(15): e101 - e101. \n \nLiaw, A., Wiener, M. (2002). Classification and Regression by randomForest.  \nR News, 2(3), 18–22. URL http://CRAN.R-project.org/doc/Rnews/. \n \nLittle, R.J.A., Rubin, D.A. (1987). Statistical analysis with missing data. John \nWiley and Sons.\n\n64 \n \nLiu, H., Yu, L. (2005). Towards integrating feature selection algorithms for \nclassification and clustering. IEEE Transactions on Knowledge and Data Engineering, \n17 (3), 1–12. \n \nLuxburg, U. von. (2007). A tutorial on spectral clustering. Statistics and Computing, \n17 (4), 395–416. \n \nLuxburg, U. von. (2010). Clustering stability: an overview. Foundations and Trends in \nMachine Learning, Vol. 2 (3), 235–274,  \nURL (30-08-2010): http://arxiv.org/abs/1007.1075. \n \nLuxburg, U. von., Belkin, M., Bousquet, O. (2008). Consistency of Spectral \nClustering. Annals of Statistics 36 (2), 555–586. \n \nMacCuish, J., Nicolaou, C., MacCuish, N.E. (2001). Ties in proximity and clustering \ncompounds. J. Chem. Inf. Comput. Sci., 41, 134–146. \n \nMadeira, S.C., Oliveira, A.L. (2004). Biclustering Algorithms for Biological Data \nAnalysis: A Survey. IEEE Transactions on Computational Biology and \nBioinformatics, 1 (1), 24–45. \n \nMahoney, M.W., Drineas, P. (2009). CUR matrix decompositions for improved data \nanalysis. PNAS January 20, 2009, Vol. 106 (3), 697–702. \n \nMakarenkov, V., Legendre, P. (2001). Optimal Variable Weighting for Ultrametric \nand Additive Trees and K-means Partitioning: Methods and Software. Journal of \nClassification, 18, 245–271.  \n \nMaruca, S.L., Jacquez, G.M. (2002). Area-based tests for association \nbetween spatial patterns. Journal of Geographic Systems, 4 (1), 69–83. \n \nMcCullagh, M. J. (2006). Detecting Hotspots in Time and Space. ISG06. \n  \nMcLachlan, G., Peel, D., Basford, K.E., Adams, P. (2000). The EMMIX software for \nthe fitting of mixtures of normal and t-components. Journal of Statistical Software, 4, \n1–14. \n \nMcQuitty, L.L. (1966). Similarity Analysis by Reciprocal Pairs for Discrete and \nContinuous Data. Educational and Psychological Measurement, 26, 825–831. \n \nMeila, M. (2007). Comparing clusterings – an information based distance. Journal of \nMultivariate Analysis, 98, 873–895. \n \nMelnykov, V. Maitra, R. (2010). Finite mixture models and model-based clustering. \nStatistics Surveys, 2010, Vol. 4, 80–116. \n \nMilligan, G.W. (1980). An examination of the effect of six types of error perturbation \non fifteen clustering algorithms. Psychometrika, 45, 325–342.\n\n65 \n \nMilligan, G.W., Cooper, M.C. (1985). An examination of procedures for determining \nthe number of clusters in a data set. Psychometrika, 50, 159–179. \n \nMilligan, G.W., Cooper, M.C. (1987). Methodology Review: Clustering Methods. \nApplied Psychological Measurement, Vol. 11 (4), 329–354.  \n \nMilligan, G.W., Mahajan, V. (1980). A note on procedures for testing the quality of a \nclustering of a set of objects. Decision Sciences, 11, 669–677.  \n \nMilligan, G.W. (1989). A validation study of a variable weighting algorithm for \ncluster analysis. Journal of Classification, 6 (1), 53–71. \n \nMilligan, G.W. (1996). Clustering validation: results and implications for applied \nanalyses. In P. Arabie, L. J. Hubert, and G. D. Soete, editors, In Clustering and \nClassication., pages 341–375. World Scientic Publishing, River Edge, NJ, 1996. \n \nMingoti, S.A., Lima, J.O. (2006). Comparing SOM neural network with Fuzzy  \nc-means, K-means and traditional hierarchical clustering algorithms. European \nJournal of Operational Research, 174, 1742–1759. \n \nMirkin, B. (2005). Cluster Analysis for Data Mining: A Data Recovery Approach. \nCRC Press. \n \nMishra, N., Schreiber, R., Stanton, I., Tarjan, R.E. (2007). Clustering Social Networks \nA. Bonato and F.R.K. Chung (Eds.): WAW 2007, LNCS 4863, pp. 56–67, 2007. \n \nMoguerza, J.M., Muñoz, A., Martin-Merino, M. (2002). Detecting the number of \nclusters using a support vector machine approach. Proc. International Conference on \nArtficial Neural Networks. Lecture Notes in Comput. Sci. 2415. 63.768. Springer, \nBerlin. \n \nMonti, S., Tamayo, P., Mesirov, J., Golub, T. (2003). Consensus clustering: A \nresampling-based method for class discovery and visualization of gene expression \nmicroarray data. Mach. Learn., 52, 91–118. \n \nMoran, P.A.P. (1948). The interpretation of statistical maps. Journal of the Royal \nStatistical Society, Series B., Vol. 10, 243–251. \n \nMorgan, B.J.T., Ray, A.P.G. (1995). Non-uniqueness and inversions in cluster \nanalysis. Applied Statistics, 44, 117–34. \n \nMoya-Anegón, F., Herrero-Solana, V., Jiménez-Contreras, E. (2006). A connectionist \nand multivariate approach to science maps: the SOM, clustering and MDS applied to \nlibrary and information science research Journal of Information Science, 32 (1) 2006, \n63–77. \n \nMurtagh, F., Hernández-Pajares, M. (1995). The Kohonen self-organizing map \nmethod: An assessment. Journal of Classification. Vol. 12 (2), 165–190.\n\n66 \n \nNardo, M., Saisana, M., Saltelli, A., Tarantola, S., Hoffman A., Giovannini, E., \n(2005). Handbook on Constructing Composite Indicators: Methodology and User \nGuide. OECD Statistics Working Papers.  \n \nNelson, T.A., Boots, B. (2008). Detecting spatial hot spots in landscape ecology. \nEcography, Vol. 31 (5), 556–566. \n \nNewman, M.E.J. (2003). The structure and function of complex networks. SIAM \nreview, 2003 – JSTOR. \n \nNewman, M.E.J. (2004). Fast algorithm for detecting community structure in \nnetworks. Phys. Rev. E 69, 066133. \n \nNewman, M.E.J., Leicht, E.A. (2007). Mixture models and exploratory analysis in \nnetworks. PNAS, Vol. 104 (23), 9564–9569. \n \nNg, A.Y., Jordan, M., Weiss, Y. (2002). On spectral clustering: Analysis and an \nalgorithm. In Advances in Neural Information Processing Systems (NIPS), 2002. \n \nNorg, I., Groenen, P. (1997). Modern multidimensional scaling theory and \napplications. New York: Springer Verlag. \n \nOlex, A.L., John, D.J., et al. (2007). Additional limitations of the clustering validation \nmethod figure of merit. 45th ACM Southeast Annual Conference, Winston-Salem, \nNC. \n \nOrd, J.K., Getis, A. (2001). Testing for local spatial autocorrelation in the presence of \nglobal autocorrelation. Journal of Regional Science, Vol. 41 (3), 411–432. \n \nPalla, G., Derényi, I., Farkas, I., Vicsek, T. (2005). Uncovering the overlapping \ncommunity structure of complex networks in nature and society. Nature, 435,  \n814–818.  \n \nPipino, L.L., Funk, J.D., Wang, R.Y. (2006). Journey to Data Quality. MIT Press Ltd, \n2006. \n \nPison, G., Struyf, A., Rousseeuw, P.J. (1999). Displaying a clustering with \nCLUSPLOT. Comput. Stat. Data Anal., 30, 381–392 \nftp://ftp.win.ua.ac.be/pub/preprints/99/Disclu99.pdf. \n \nPremo, L.S. (2004). Local spatial autocorrelation statistics quantify multi=scale \npatterns in distributional data: an example from the Maya Lowlands. Journal of \nArchaeological Science, Vol. 31, 855–866. \n \nRaftery A.E., Dean, N. (2006). Variable Selection for Model-Based Clustering. \nJournal of the American Statistical Association, Vol. 101 (473), 168–178. \n \nRahm, E., Do, H.H. (2000). Data cleaning: Problems and current approaches. IEEE \nData Engineering Bulletin, 23 (4), 3–13.\n\n67 \n \nRoth, V., Lange, T., Braun, M., Buhmann, J. (2002). A Resampling Approach to \nCluster Validation. \nhttp://informatik.unibas.ch/personen/roth_volker/PUB/compstat02.pdf \n \nRousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and \nvalidation of cluster analysis. J. Comput. Appl. Math, 20, 53–65. \n \nRousseeuw, P.J., Ruts, I., Tukey, J.W. (1999). The bagplot: A bivariate boxplot. Am. \nStat., 53 (4), 382–387. \n \nRousseeuw, P.J., Debruyne, M., Engelen, S., Hubert, M. (2006). Robustness and \noutlier detection in chemometrics, Critical Reviews in Analytical Chemistry, 36,  \n221–242. \n \nRunkler, T.A. (2000). Information Mining - Methoden, Algorithmen und \nAnwendungen intelligenter Datenanalyse. Vieweg, Wiesbaden, 2000.  \n \nSaeys, Y., Inza, I., Larrañaga, P. (2007). A review of feature selection techniques in \nbioinformatics. Bioinformatics. Vol. 23 (19), 2507–17. \n \nSaitta, S., Raphael, B., Smith, I.F.C. (2007). A Bounded Index for Cluster Validity. \nIn: P. Perner (Ed.), Machine Learning and Data Mining in Pattern Recognition, LNAI \n4571, Springer Verlag, Heidelberg, pp. 174–187, 2007. \n \nSaitta, S., Raphael, B., Smith, I.F.C. (2008). A Comprehensive Validity Index for \nClustering. accepted for publication in the Journal of Intelligent Data Analysis, 2008. \n \nSalvador, S., Chan, P. (2004). Determining the Number of Clusters/Segments in \nHierarchical Clustering/Segmentation Algorithms. Proc. 16th IEEE Intl. Conf. on \nTools with AI, 576–584, 2004. \n \nSaraiya, P., North, C., Duca, K. (2005). An insight-based methodology for evaluating \nbioinformatics vizualizations. IEEE Trans, on Visualization and Computer Graphics, \nVol. 11, 443–456. \n \nScharl, T., Leisch, F. (2009). gcExplorer: Interactive Exploration of Gene Clusters. \nBioinformatics, Vol. 25 (8), 1089–1090. \n \nSchölkopf, B., Smola, A., Müller, K.-R. (1999). Kernel Principal Component \nAnalysis, In: Bernhard Schölkopf, Christopher J. C. Burges, Alexander J. Smola \n(Eds.), Advances in Kernel Methods-Support Vector Learning, 1999, MIT Press \nCambridge, MA, USA, 327-352. ISBN 0-262-19416-3.  \n \nScholz, M., Kaplan, F., Guy, C.L., Kopka, J., Selbig, J. (2005). Non-linear PCA:  \na missing data approach. Bioinformatics, 21, 3887–3895.  \n \nSchonlau, M. (2002). The clustergram: a graph for visualizing hierarchical and non-\nhierarchical cluster analyses. The Stata Journal, 2002, 2 (4), 391–402.\n\n68 \n \nSchonlau, M. (2004). Visualizing Hierarchical and Non-Hierarchical Cluster Analyses \nwith Clustergrams. Computational Statistics: 2004, 19 (1), 95–111. \n \nShi, J., Malik, J. (2000). Normalized cuts and image segmentation. IEEE Trans. \nPAMI, 22 (8), 888–905. \n \nShi,T., et al. (2005). Tumor classification by tissue microarray profiling: random \nforest clustering applied to renal cell carcinoma. Modern Pathology., 18, 547–557. \n \nShi, T., Horvath, S. (2006). Unsupervised Learning with Random Forest Predictors. \nJournal of Computational and Graphical Statistics. Vol. 15 (1), 118–138(21). \n \nSietz, D., Lüdeke, M.K.B., Walther, C. (2011). Categorisation of typical vulnerability \npatterns in global drylands. Global Environmental Change, 21, 431–440. \n \nSilver, M. (1995). Scales of Measurement and Cluster Analysis: An Application \nConcerning Market Segments in the Babyfood market. The Statistician, Vol. 44 (1), \n101–112.  \nSmyth, C.W., Coomans, D.H. (2006). Parsimonious Ensembles for Regression. The \n38th Symposium on the Interface of Statistics, Computing Science and Applications: \nMassive Data Sets and Streams Interface Foundation of North America, Pasadena, \nCalifornia 54 – 54.  \nSmyth, C.W., Coomans, D.H., Everingham, Y.L. (2006a). Clustering noisy data in a \nreduced dimension space via multivariate regression trees. Pattern Recognition,  \nVol. 39, 424–431.  \nSmyth, C.W., Coomans, D.H., Everingham, Y.L., Hancock, T.P. (2006b). Auto-\nassociative Multivariate Regression Trees for Cluster Analysis. Chemometrics and \nIntelligent Laboratory Systems, Vol. 80, 120–129.  \nSmyth, C.W., Coomans, D.H. (2007). Predictice weighting for cluster ensembles. \nJournal of Chemometrics, Vol. 21, 364–375. \nSpaans, M., Heiser, W.J. (2005). Instability of hierarchical cluster analysis due to \ninput order of the data: The PermuCLUSTER solution. Psychological Methods,  \n10 (4), 468–476. \n \nSteinley, D. (2004). Properties of the Hubert-Arabie adjusted Rand index. \nPsychological Methods, 9, 386–396. \n \nSteinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of \nMathematical and Statistical Psychology, 59, 1–34. \n \nSteinley, D. (2008). Stability analysis in K-means clustering. British Journal of \nMathematical and Statistical Psychology, 61, 255–273. \n \nSteinley, D., Brusco, M.J. (2007). Initializing K-means batch clustering: A critical \nevaluation of several techniques. Journal of Classification, 24, 99–121.\n\n69 \n \n \nSteinley, D., Brusco, M.J. (2008a). A new variable weighting and selection procedure \nfor K-means cluster analysis. Multivariate Behavioral Research, Vol. 43, 77–108. \n \nSteinley, D., Brusco, M.J. (2008b). Selection of Variables in Cluster Analysis: An \nEmpirical Comparison of Eight Procedures. Psychometrika, Vol. 73 (1), 125–144. \n \nStrehl, A., Ghosh, J. (2002). Cluster ensembles – a knowledge reuse framework for \ncombining multiple partitions. Journal of Machine Learning Research, 3, 583–617. \n \nStrobl, C., Boulesteix, A.L., Kneib, T., Augustin, T, Zeileis, A. (2008). Conditional \nvariable importance for random forests. BMC Bioinformatics 2008, 9, 307. \n \nStrobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T. (2007). Bias in Random Forest \nVariable Importance Measures: Illustrations, Sources and a Solution. BMC \nBioinformatics 2007, 8, 25. \n \nSvetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P. (2003). \nRandom Forest: A Classification and Regression Tool for Compound Classification \nand QSAR Modeling. Journal of Chemical Information and Computer Sciences, 43, \n1947–1958. \n \nTan, J., Zhang, J., Li, W. (2010). An Improved Clustering Algorithm Based on \nDensity Distribution Function. Computer and Information Science, Vol. 3 (3), August \n2010, 23–29. URL (31-08-2010): \nhttp://ccsenet.org/journal/index.php/cis/article/viewFile/6891/5426. \n \nTanay, R., Sharan, R., Shamir, R. (2002). Discovering statistically significant \nbiclusters in gene expression data. Bioinformatics Vol. 18 (9), S136–S144. \n \nTenenbaum, J.B., de Silva, V., Langford, J.C. (2000). A global geometric framework \nfor nonlinear dimensionality reduction. Science, Vol. 290, 2319–2323. \n \nTian, T., James, G., Wilcox, R. (2009). A Multivariate Adaptive Stochastic Search \nMethod for Dimensionality Reduction in Classification. Annals of Applied Statistics, \n4, 339–364. \n \nTibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a \ndataset via gap statistic. Journal of Royal Statistical Society B 2001, 63, 411–423. \n \nTibshirani, R., Walther, G. (2005). Cluster Validation by Prediction \nStrength. Journal of Computational & Graphical Statistics, 14, 511–528. \n \nTsai, C.Y., Chiu, C.C. (2008). Developing a feature weight self-adjustment \nmechanism for a K-means clustering algorithm. Computational Statistics & Data \nAnalysis, Vol. 52 (10), 4658–4672. \n \nvan der Laan, M. (2006). Statistical Inference for Variable Importance. International \nJournal of Biostatistics, 2 (1), 1008–1008.\n\n70 \n \nVarshavsky, R., Gottlieb, A., Linial, M., Horn, D. (2006). Novel unsupervised feature \nfiltering of biological data. Bioinformatics, Vol. 22, e507–e513. \n \nVarshavsky, R., Gottlieb, A., Horn, D., Linial, M. (2007). Unsupervised feature \nselection under perturbations: meeting the challenges of biological data. \nBioinformatics, Vol. 23 (24), 3343–3349. \n \nVesanto, J. (1999). SOM-based data visualization methods, Intelligent Data Analysis, \n3 (2), 111–126. \n \nVesanto, J., Alhoniemi, E. (2000). Clustering of the Self-Organizing Map. \nIEEE Trans. On Neural Networks, Vol. 11, 586–600. \n \nVinh, N.X., Epps, J., Bailey, J. (2009). Information Theoretic Measures for \nClusterings Comparison: Is a Correction for Chance Necessary? Proceedings of the \n26th International Conference on Machine Learning, 2009. \n \nXing, E.P. (2003). Feature Selection in Microarray Analysis, in D.P. Berrar, W. \nDubitzky and M. Granzow (Eds.), A Practical Approach to Microarray Data Analysis, \nKluwer Academic Publishers, 2003. \n \nXu, R., Wunsch, D. (2008). Clustering. IEEE Press Series on Computational \nIntelligence. John Wiley and Sons. \n \nYan, D., Huang, L., Jordan, M.I. (2009). Fast approximate spectral clustering. \nInternational Conference on Knowledge Discovery and Data Mining \nProceedings of the 15th ACM SIGKDD international conference on Knowledge \ndiscovery and data Paris, France Pages 907–916.  \n \nYeung, K.Y., Haynor, D.R., Ruzzo, W.L. (2001). Validating clustering for analysis \nfor clustering gene expression data. Bioinformatics, Vol. 17 (4), 309–318. \n \nYeung, K.Y., Ruzzo, W.L. (2001). Principal component analysis for clustering gene \nexpression data. Bioinformatics, Vol. 17 (9), 763–774. \n \nYiang, M.K.A., Kumar, A. (2005). A comparative analysis of an extended SOM \nnetwork and K-means analysis. Journal International Journal of Knowledge-Based \nand Intelligent Engineering Systems, Vol. 8 (1), 9–15. \n \nYu, L. (2007). Feature Selection for Genomic Data Analysis. In H. Liu, editor, \nComputational Methods for Feature Selection, Chapman and Hall/CRC Press, 2007. \n \nWagstaff, K.L., Laidler, V. (2005). Making the Most of Missing Values: Object \nClustering with Partial Data in Astronomy. Astronomical Data Analysis Software and \nSystems XIV; ASP Conference Series 2005, P 2.1.25. \n \nWaller, N.G., Kaiser, H.A., Illian, J.B., Manry, M. (1998). Cluster analysis with \nKohonen neural networks. Psychometrika, Vol. 63, 5–22.\n\n71 \n \nWinters-Hilt, S., Yelundur, A., McChesney, C., Landry, M. (2006). Support Vector \nMachine Implementations for Classification & Clustering. BMC Bioinformatics 2006, \n7(Suppl 2):S4 doi:10.1186/1471-2105-7-S2-S4. \n \nWinters-Hilt, S., Merat, S. (2007). SVM clustering. BMC Bioinformatics 2007,  \n8 (Suppl 7):S18 doi:10.1186/1471-2105-8-S7-S18. \n \nWu, C.-J., Kasif, S. (2005). GEMS: a web server for biclustering analysis of \nexpression data. Nucleic Acids Research 2005 33(Web Server Issue):W596-W599.  \n \nWu, K.L., Yang, M.S. (2005). A cluster validity index for fuzzy clustering, Pattern \nRecognition Lett., Vol. 26, 1275–1291. \n \nWu, K.L., Yang, M.S., Hsieh, J.N. (2009). Robust cluster validity indexes. Pattern \nRecognition. Vol. 42 (11), 2541–2550.  \n \nZadeh, R.B., Ben-David, S. (2009). A Uniqueness Theorem for Clustering. \nProceedings of UAI 2009.\n\n72 \n \nAppendix A: The R software environment \n \nR is a software environment for data manipulation, calculation, and graphical display, and \nserves both as an environment and a programming language. R is available as Free Software \nin source code form under the terms of the Free Software Foundation’s GNU General Public \nLicense. R runs on a wide variety of platforms (Unix, Linux,Windows, MacOS, FreeBSD). \nSources and binaries of R can be downloaded at http://www.r-project.org. Installation of R is \nvery simple and a variety of packages can be added directly from the web site (e.g. Brock et \nal., 2008). R has a very active development community and many resources can be found \nincluding user guides, manuals, script samples, newsgroups, and mailing lists (e.g. Venables \net al., 2002). Further an extensive amount of publications like Paradis (2002) or Maindonald \n(2008) exists. R is a command line application. Its integrated object oriented language allows \nfor efficient data manipulation. Whereas use of R does require programming, scripts can be \ndeveloped and used to automate analyses and provide additional functionality. Graphical user \ninterface (GUI)s have been developed for certain applications to avoid user programming \n(see, for example, Rcommander). \nR has an amazing variety of functions for cluster analysis, which is illustrated at the web-page \nhttp://cran.r-project.org/web/views/Cluster.html. In this background document we will present \na number of examples implemented in R. See also appendix A, which illustratively highlights \nsome functionality of R for performing cluster analysis. \n \nCiting R: \nR Development Core Team (2005). R: A language and environment for statistical computing. R \nFoundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL: http://www.R-\nproject.org. \n \nBrock, Pihur, Datta Su., Datta So, clValid: An R Package for Cluster Validation, Journal of Statistical \nSoftware, Volume 25, Issue 4, 2008 \n \nMaindonald, Using R for Data Analysis and Graphics - Introduction, Code and Commentary, Centre \nfor Mathematics and Its Applications, Australian National University. 2008 \n \nParadis, E., R for Beginners, Montpellier, 2002 \n \nVenables, Smith and the R Development Core Team An Introduction to R, Network Theory Limited, \nBristol, 2002\n\n73 \n \nAppendix B: Cluster analysis in R27 \n \nR has an amazing variety of functions for performing cluster analysis. In this appendix three \nof the many approaches will be described: hierarchical agglomerative, partitioning, and model \nbased. While there are no best solutions for the problem of determining the number of clusters \nto extract, several approaches are given below.  \n \nData preparation  \nPrior to clustering data, you may want to remove or estimate missing data and rescale \nvariables for comparability. \n \n# Prepare Data \nmydata <- na.omit(mydata) # listwise deletion of missing \nmydata <- scale(mydata) # standardize variables  \n \nPartitioning \nK-means clustering is the most popular partitioning method. It requires the analyst to specify \nthe number of clusters to extract. A plot of the within groups sum of squares by number of \nclusters extracted can help determine the appropriate number of clusters. The analyst looks for \na bend in the plot similar to a scree test in factor analysis. See Everitt & Hothorn (pg. 251).  \n \nDetermine number of clusters \n \n# Determine number of clusters \nwss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) \nfor (i in 2:15) wss[i] <- sum(kmeans(mydata,  \n   centers=i)$withinss) \nplot(1:15, wss, type=\"b\", xlab=\"Number of Clusters\", \n  ylab=\"Within groups sum of squares\")  \n \n \n \n \n \n                                                     \n27 This appendix is taken from the information about QuickR (see \nhttp://www.statmethods.net/advstats/cluster.html). See also \nhttp://inference.us/SolutionPlatform/Documents/R/Cluster%20Analysis.pdf\n\n74 \n \nK-Means cluster analysis \n \n# 5 cluster solution \nfit <- kmeans(mydata, 5) # 5 cluster solution \n \n# get cluster means  \naggregate(mydata,by=list(fit$cluster),FUN=mean) \n \n# append cluster assignment \nmydata <- data.frame(mydata, fit$cluster)  \n \nA robust version of K-means based on mediods can be invoked by using pam( ) instead of \nkmeans( ). The function pamk( ) in the fpc package is a wrapper for pam that also prints the \nsuggested number of clusters based on optimum average silhouette width.  \n \nHierarchical agglomerative \n \nThere are a wide range of hierarchical clustering approaches, and Ward's method described \nbelow is a popular one.  \n \nWard hierarchical clustering \n \n# Ward Hierarchical Clustering \n \n# distance matrix \nd <- dist(mydata, method = \"euclidean\")  \n \nfit <- hclust(d, method=\"ward\")  \n \n# display dendogram  \nplot(fit)  \n \n# cut tree into 5 clusters \ngroups <- cutree(fit, k=5)  \n \n# draw dendogram with red borders around the 5 clusters  \nrect.hclust(fit, k=5, border=\"red\")  \n \n \nThe pvclust( ) function in the pvclust package provides p-values for hierarchical clustering \nbased on multiscale bootstrap resampling. Clusters that are highly supported by the data will\n\n75 \n \nhave large p values. Interpretation details are provided Suzuki28. Be aware that pvclust \nclusters columns, not rows. Transpose your data before using.  \n \n \nWard hierarchical clustering with bootstrapped p values \n \n# Ward Hierarchical Clustering with Bootstrapped p values \n \nlibrary(pvclust) \nfit <- pvclust(mydata, method.hclust=\"ward\", \n   method.dist=\"euclidean\") \n \n# dendogram with p values \n \nplot(fit)  \n \n# add rectangles around groups highly supported by the \ndata \npvrect(fit, alpha=.95)  \n \n \n \n \nModel based approaches \n \nModel based approaches assume a variety of data models and apply maximum likelihood \nestimation and Bayes criteria to identify the most likely model and number of clusters. \nSpecifically, the Mclust( ) function in the mclust package selects the optimal model \naccording to BIC for EM initialized by hierarchical clustering for parameterized Gaussian \nmixture models. (phew!). One chooses the model and number of clusters with the largest BIC. \nSee help(mclustModelNames) to details on the model chosen as best.  \n \n \n                                                     \n28 See http://www.is.titech.ac.jp/~shimo/prog/pvclust/\n\n76 \n \nModel based clustering \n \n# Model Based Clustering \nlibrary(mclust) \nfit <- Mclust(mydata) \n \n# plot results  \nplot(fit, mydata)  \n \n# display the best model  \nprint(fit)\n\n77\n\n78 \n \nPlotting cluster solutions  \n \nIt is always a good idea to look at the cluster results. \n \nK-Means clustering with 5 clusters \n \n# K-Means Clustering with 5 clusters \nfit <- kmeans(mydata, 5) \n \nCluster plot against 1st 2 principal components \n \n# Cluster Plot against 1st 2 principal components \n \n# vary parameters for most readable graph \nlibrary(cluster)  \nclusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,  \n   labels=2, lines=0)\n\n79 \n \n \n \nCentroid plot against 1st 2 discriminant functions \n \n# Centroid Plot against 1st 2 discriminant functions \nlibrary(fpc) \nplotcluster(mydata, fit$cluster)  \n \n \n \nValidating cluster solutions \n \nThe function cluster.stats() in the fpc package provides a mechanism for comparing the \nsimilarity of two cluster solutions using a variety of validation criteria (Hubert's gamma \ncoefficient, the Dunn index and the corrected rand index)  \n \ncomparing 2 cluster solutions \n \n# comparing 2 cluster solutions \nlibrary(fpc) \ncluster.stats(d, fit1$cluster, fit2$cluster)  \n \nwhere d is a distance matrix among objects, and fit1$cluster and fit$cluster are integer \nvectors containing classification results from two different clusterings of the same data.\n\n80 \n \nExample R-script for clustering \n \nThe following R-Script is divided into four functions, called “consistency”, “sPIKcentres”, \n“initial” and “clus_graphs”. In the first function calls the loop for the overall repeating of the \npair wise dissimilarity calculation and the loop for the size of the clustered partition. Further it \nperforms the two of clusterings. The second function makes the dissimilarity calculation \nitself. The “initial” function is responsible for initialization of kmeans with hclust. And the \nlast function delivers graphical representations of the cluster result.  \nIn the last part of the script the user settings have to be chosen. The script can be used for \ncalculation of the consistency measure and for the clustering of the subsequent best number of \nclusters. \nThe format of data has to be: rows represent the objects and columns represent the features of \nthe objects. \n \n## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \nconsistency <- function(Xdata,NmaxCluster,master,cm)  {  \n \n \n  Ndata <- dim(Xdata)[1]              # total number of datapoints \n  Clust_cl <- matrix(0,Ndata,2)     # storing the cluster-class results for the two trial-clusterings \n  X0 <- matrix(0,NmaxCluster,master)           # storing info on every run for consist.meas. \n  C0 <- matrix(0,NmaxCluster,1)                    # Init.matrix with average consist.meas. \n  G0 <- matrix(0,Ndata,1)  \n \n# Matrix for best cluster result \n  ResMat <- list(MeanC=C0,SpecR=X0,Gold=G0) # global list for returning after calculation \n  ifelse(cm,whl<-2,whl<-1) \n \n  ifelse(cm,NminCluster<-2,NminCluster<-NmaxCluster) \n \n  for (iOuter in 1:master) {    \n# outer-loop for comparing pairs of clusterings  \n    for (iClus in NminCluster:NmaxCluster) {   \n# Number of Clusters to be analysed \n      for (iInner in 1:whl) {  \n        N_sel <- max(NmaxCluster,round(Ndata/200,0)) \n        ss  <- sample(1:Ndata)     \n  \n# random permutation of data set            \n        sss <- ss[1:N_sel]      \n \n# First N_sel indices of random permutation \n        Xdata_sel <- Xdata[sss,]   \n        while( length(unique(rowSums(Xdata_sel)))<iClus ) { \n          ss  <- sample(1:Ndata)   ;  sss <- ss[1:N_sel]   ;   Xdata_sel <- Xdata[sss,] }   \n        centro    <- initial(Xdata_sel,iClus)      \n        indRand <- sample(1:Ndata)  \n \n# reshuffling  \n        Xdata_shuffle <- Xdata[indRand,]           \n# shuffled data \n        cl_kmeans <- kmeans(Xdata_shuffle,centro,iter.max=50)  # clustering with centro initializatin        \n        Clust_cl[indRand,iInner] <- cl_kmeans$cluster  # assign classes as indexed by non-shuffled data \n      }    \n      ifelse(cm , {  \n \n# Evaluate dissimilarities for the clustering-pairs \n        ResMat$SpecR[iClus,iOuter] <- sPIKcentres(Xdata,Clust_cl[,1],Clust_cl[,2],Iheur=1)  \n        } , { \n        for (j in 1:iClus) { \n \n# withinclustersum ~~~~~~~~~~~~~ \n          clu_diff <- 0  \n          clu_diff <- Xdata_shuffle[which(cl_kmeans$cluster==j),]-(matrix(1,cl_kmeans$size[j],1)  \n \n%*%colMeans(Xdata_shuffle[which(cl_kmeans$cluster==j),]))  \n          ResMat$SpecR[iClus,iOuter] <- ResMat$SpecR[iClus,iOuter] + sum(clu_diff*clu_diff) } \n        ifelse(ResMat$SpecR[iClus,iOuter]==min(ResMat$SpecR[iClus,1:iOuter]) , gold <- Clust_cl[,1]         \n        } \n)   }   }  \n        ifelse (cm , { for (iClus in 2:NmaxCluster) { ResMat$MeanC[iClus,] <-  \n \nwith(ResMat,mean(SpecR[iClus,1:master])) } },# average-value for consistency measure     \n \n \n{ ResMat$Gold <- gold } ) \n \nreturn(ResMat) \n}     # end function consistency \n \n## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \nsPIKcentres <-   function(dataCl,clust1,clust2,Iheur=1)  {\n\n81 \n \n             \n  Ncl1 <- max(clust1)      # maximum number of clusterclasses   \n  Ncl2 <- max(clust2)      # maximum number of clusterclasses \n  Nclmin <- min(Ncl1,Ncl2) # minimum of Ncl1 and Ncl2 \n  \n  ## Determine cluster centers       --> # matrix of cluster-centres for clustering 1 and 2 \n  cent1=rbind() ; for (i in 1:Ncl1) { ifelse(length(which(clust1==i))<2 , cent1 <-  \nrbind(cent1,dataCl[which(clust1==i),]) , cent1 <-   \nrbind(cent1,colMeans(dataCl[which(clust1==i),])) )}     \n  cent2=rbind() ; for (i in 1:Ncl2) { ifelse(length(which(clust2==i))<2 , cent2 <-  \nrbind(cent2,dataCl[which(clust2==i),]) , cent2 <-  \nrbind(cent2,colMeans(dataCl[which(clust2==i),])) )}     \n                  \n  ## Determine the distance matrix  of cluster-centers \n  Distmat <- matrix(0,Ncl1,Ncl2) \n  Distmat <- as.matrix(dist(rbind(cent1,cent2)))[1:Ncl1,(1:Ncl2)+Ncl1]  \n  ## Determination of association on basis of distances between clusters \n  match.listb <- array(0,length<-Ncl2)     # initialising list for renaming clusters \n  xft_tmp <- Distmat                       # storing Distmat in intermediate matrix \n  xft_max <- max(xft_tmp)+1        # setting an upperlimit to values of xft_tmp \n  for (d2 in 1:Nclmin) { \n    cc <- which(xft_tmp==min(xft_tmp),arr.ind=T)[1,2]    # in which column is minimum (ref to clu1) \n    rr <- which(xft_tmp==min(xft_tmp),arr.ind=T)[1,1]    # in which row is minimum (ref to clu2) \n    match.listb[cc] <- rr   ## the cc-th cluster of clus.2  corresponds the to rr-th  cluster of the clus.1  \n    xft_tmp[rr,] <- xft_max ; xft_tmp[,cc] <- xft_max }   \n  match.listb[which(match.listb==0)] <-  max.col(-t(Distmat[,which(match.listb==0)])) \n  clust2A <-  match.listb[clust2]    # second clustering in terms of its association with the first clus.          \n  res <- length(which(clust2A==clust1))/(length(clust1))  # count of fraction of replicates                  \n  return(res) } \n \n## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \ninitial <- function(Xdata_sel,k) { # function for initializing Kmeans \n \n  geo_dist <- (dist(Xdata_sel))               \n# distance matrix of part of data set \n  cl_hcl   <- hclust(geo_dist,method=\"ward\") # hclust with method: ward \n  ser      <- as.vector(cutree(cl_hcl,k))     \n# cut the tree into k clusters \n  cluster  <- list()                          \n# initializing to empty list \n  for (i in 1:k) { cluster[[i]] <- which(ser==i) } \n  centro <- matrix(ncol=ncol(Xdata_sel),nrow=k)   # storing cluster-centers \n  for (i in 1:length(cluster)){ \n    for (j in 1:ncol(Xdata_sel)){ \n      centro[i,j] <- mean(Xdata_sel[cluster[[i]],j]) \n    }  }  \n   return(centro)  } \n \n## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \nclus_graphs <- function(gold,clu,clu_dim) { \n \n \n  ## worldmap \n  world <- matrix(scan(\"~/AT_CLUSTERUNG/R-Script+Data/geo_maske.dat\"),ncol=1) ## land mask \n  for (z in 1:clu_dim[1]) {world[clus_dat[z,1]] <- gold[z]}              \n  x11(11,8) ; par(mar=c(2,2,2,1)) \n  is.na(world)<-which(world==0,arr.ind=T)       ## all zeros out \n  z.a <- matrix(world,720,360)[,360:1] \n  for (i in 0:(clu-1)) z.a[(i*20):(i*20+20),51:70]<-i+1 \n  farb<-c(rgb(0,0,0),rgb(1,0.6,0),rgb(1,1,0.3),rgb(0.5,0.5,0.5),rgb(0,1,0),rgb(0.5,0,0.5)  \n \n,rgb(1,0,0.3),rgb(0,0,1),rgb(0.2,1,1),rgb(1,0.5,0)) \n  farb <- farb[c(9,4,7,3,8,6,1,5,2,10)] \n  image(1:720,1:360,z.a,col=c(grey(0.9),farb[1:clu]),xlim=c(0,720),ylim=c(50,360), \nmain=paste(\"run.ident: \",round(min(clus_res$SpecR[clu,]),4),sep=\"\"))\n\n82 \n \n  x11(11,4);par(mfrow=c(1,clu),mar=c(2.5,1.8,1.4,0.3))  \n  size <- array(0,clu) \n  for (j in 1:clu) {size[j]<-length(which(gold==j))} \n  for (k in 1:clu) { \n    bpdata <- as.data.frame(clus_dat[,feat]) \n    bpdata[which(gold!=k,arr.ind=T),]<-NA \n    bpdata <- na.omit(bpdata) \n    boxplot(bpdata, whisklty=0, staplelty=0, col=farb[k], outline=F, main=paste(\"C\",k,\": \", \nsize[k]))->boxinfo \n    for (dd in 1:ncol(bpdata)) { \n      cen <- quantile(bpdata[,dd],  probs=c(5,95)/100) \n      segments(dd,boxinfo$stats[4,dd],dd,as.numeric(cen[2]),col=\"black\",lwd=1,lty=3) \n      segments(dd,boxinfo$stats[2,dd],dd,as.numeric(cen[1]),col=\"black\",lwd=1,lty=3) \n      points(dd,as.numeric(cen[1]),col=\"black\",pch=1) \n      points(dd,as.numeric(cen[2]),col=\"black\",pch=1) \n    } \n     mean.cl <- c(colMeans(bpdata,na.rm=T)) \n     points(c(1:length(feat)),mean.cl,pch=1,col=9,cex=1.4)   \n  } } \n############################################################# \n## PARAMETERS THAT HAVE TO BE DEFINED BY USER ~~~~~~~~ \n## \n  namIndicat <- \"choose name\"                     \n  namIndDir =   \"choose directory\" \n  colIndFile <- 9 \n  featurenames <- c(\"choose list of feature names\") \n  feat <- c(3:9)    \n## feature columns - for clustering ! \n  NmaxCluster <- 8            ## choose as upper boundary for consistency measure calculation or already  \n \n \n \n## as value for best cluster result  \n  cm = T  \n \n## consistency measure calculation or only best cluster number clustering \n## \n########################## \n \nnamIndFile = paste(namIndDir,namIndicat,sep=\"\")    ## reading data ~~~~~~~~~~~~~~~~~ \nclus_dat <- matrix(scan(namIndFile,sep=\"\"),ncol=colIndFile,byrow=T) \nclu_dim  <- dim(clus_dat)   \n \nis.na(clus_dat) <- which(clus_dat==-9999,arr.ind=T) ## erase missing values ~~~~~~~~~~~ \nclus_dat <- na.omit(clus_dat) \n \nx11(7,4);par(mar=c(2.1,4,2.3,0.5),mfrow=c(3,3))  \n## Histogramm of Cluster Data ~~~~~ \nfor (i in feat)  hist(clus_dat[,i],main=featurenames[i]) \n \nifelse(cm,master<-200,master<-50) \nclus_res <- consistency(clus_dat[,feat],NmaxCluster,master,cm) ## Clustering ~~~~~~~~~~ \n \n## Ploting of Result ~~~~~~~~~~~~~~ \nif(cm) {x11(6,4);plot(c(2:NmaxCluster),clus_res$MeanC[2:NmaxCluster],cex.main=0.9,xlab=\"# \nCluster\",ylab=paste(master,\"-Loops\"),panel.first=grid())} else { \n \nclus_graphs(clus_res$Gold,NmaxCluster,clu_dim)}\n\n83 \n \nAppendix C: Data for comparing clustering methods \n \n(see http://www.ima.umn.edu/~iwen/REU/REU_cluster.html#code) \nMatlab code for generating random datasets  \n \n• An example `.m' file that creates a 2D dataset with 3 clusters. It can also be \nmodified to generate other artificial data (with different numbers of clusters, \ndimensions, and underlying distributions).  \n• The following matlab package contains a file called \"generate_samples.m\" for \ngenerating hybrid linear models. It is part of the larger GPCA package. In order to \navoid intersection of subspaces (so that standard clustering could be applied) one \nneeds to set the parameter avoidIntersection = TRUE (and also have affine \nsubspaces instead of linear).  \n \n \nOther data and data repositories  \n \n• Clustering datasets at UCI Repository  \n• Complete UCI Machine Learning Repository  \n• Yale Face Database B  \n• Some processed face datasets saved as Matlab data can be found here. Two \nmatrices, X and Y, are included. If you plot Y(1:3,:) you will see three clearly \nseparated clusters. The first 64 points are in one cluster, the next 64 points in \nanother cluster, etc.. The original files are on the Yale Face Database B webpage \n(above). The folder names are yaleB5_P00, yaleB8_P00, yaleB10_P00. They \nhave been processed following the steps described in Section 4.2.2 of the \nfollowing paper. The matlab code used for processing them is here.  \n• Here is an example of spectral clustering data. It contains points from 2 noisy \ncircles: after loading the `.mat' file type \"plot(X(:,1),X(:,2),'LineStyle','.');\" to see \nthem. You can embed them into 2D space for clustering with EmbedCircles.m. \nNote that changing sigma in this file will lead to different problems. \n• See also http://dbkgroup.org/handl/generators/\n\n84 \n \n \nAppendix D: On determining variable importance for \nclustering \n \nA plethora of methods has been proposed to select informative subsets of variables/features in \nthe context of clustering analysis, as illustrated by recent literature on feature/variable \nselection (cf. Saeys et al., 2007, Steinley and Brusco, 2008b, Varshavsky et al. 2006, 2006). \n \nBelow we discuss three straightforward (univariate) methods which can be applied easily to \nexpress variable importance in a clustering context. In presenting the methods, we restrict \nourselves to continuous variables. \nWe notice beforehand that the proposed techniques are univariate and consider each variable \nseparately, thereby ignoring variable dependencies. This may lead to worse clustering \nperformance when compared to other more advanced feature selection techniques (see e.g. \nSaeys et al. 2007).  \n \n \n \nA. ANOVA-based method (for complete cluster-partitioning) \n \nThis method is based on comparing what a specific variable/feature contributes to the within-\ncluster variability as compared to the between cluster variability. The resulting importance-\nindex is expressed as the ratio BSS(j)/WSS(j) (see also Dudoit et al., 2002), defined by \n \n \n \n \nwhere BSS refers to the between sums of squares variability and WSS to the within sum of \nsquares variability. The ratio is used as an indication of the contribution of the variable j to the \noverall clustering.  \nHere j refers to the features/variables, k to the clusters, and i to the Nk objects within the k-th \ncluster. \n)\n(\n,\nj\nx\ni\nk\nrefers to the value of the j-th variable (feature/component) of object i in cluster \nk; \n)\n(\n.. j\nx\nrefers to the j-th component of the overall mean (population mean), while \n)\n(\n. j\nxk\n \nrefers to the j-th component of the cluster-mean of the k-th cluster.  \n \nVariables with the highest BSS(j)/WSS(j) are considered to have the largest ‘explanatory \nperformance’ in respect to the ‘unexplained one’, and therefore are labeled as more important. \nSee also the following textbox, which puts some caution in using these kind of indicators. \n \n \n \n \n \n \n \n \n \n \n \n)1(\n))\n(\n)\n(\n(\n))\n(\n)\n(\n(\n)\n(\n)\n(\n1\n1\n2\n.\n,\n1\n1\n2\n..\n.\n\n\n=\n=\n=\n=\n−\n−\n=\nn\nk\nN\ni\nk\ni\nk\nn\nk\nN\ni\nk\nk\nk\nj\nx\nj\nx\nj\nx\nj\nx\nj\nWSS\nj\nBSS\n\n85 \n \n \n \nRemark: On the relation with ANOVA: \n (a) Note that the total sums of squares can be written as the sum of the sums of squares of all \nvariables/components, and be split into a within- and between-cluster part: \n(\n)\n(\n)\n\n\n\n\n\n\n\n\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n=\n+\n=\n=\n+\n=\n=\n\n\n\n\n\n\n\n\n\n\n\n\n−\n+\n−\n=\n=\n−\n+\n−\n=\n−\n+\n−\n=\n=\n−\n=\n=\np\nj\np\nj\nn\nk\nk\nk\np\nj\nn\nk\nN\ni\nN\ni\nk\nk\ni\nk\np\nj\nn\nk\nN\ni\nk\nk\ni\nk\np\nj\nn\nk\nN\ni\nk\nk\ni\nk\np\nj\nn\nk\nN\ni\ni\nk\np\nj\nj\nBSS\nj\nWSS\nj\nBSS\nj\nWSS\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nx\nj\nTSS\nTSS\nk\nk\nk\nk\nk\n1\n1\n1\n1\n1\n1\n1\n2\n..\n.\n2\n.\n,\n1\n1\n1\n2\n..\n.\n2\n.\n,\n1\n1\n1\n2\n..\n.\n.\n,\n1\n1\n1\n2\n..\n,\n1\n)\n(\n)\n(\n)\n(\n)\n(\n))\n(\n)\n(\n(\n))\n(\n)\n(\n(\n))\n(\n)\n(\n(\n))\n(\n)\n(\n(\n))\n(\n)\n(\n)\n(\n)\n(\n(\n))\n(\n)\n(\n(\n)\n(\n \nwhere BSS(j) refers to the explained part and WSS(j) to the unexplained part of the sums of squares. \nThe k-means method is intended to minimize the total within-sum of squares WSS (= Σj WSS(j)) \n(unexplained) and thus in fact maximizes the in-between differences BSS ((= Σj BSS(j)) (explained). \nThis however does not imply that the various components WSS (j) are minimized individually (or, \nequivalently, the BSS(j) are maximized individually), since trade-offs between the various  WSS (j) can \nbe involved in minimizing their sum. \n \n(b) The ratio BSS(j)/WSS(j) is in fact directly related to the F-ratio in the context of an ANOVA for the \nspecific j-th variable \n)\n(\n,\nj\nx\ni\nk\n. The F-ratio is \n)\n(\n/)\n(\nj\nMSS\nj\nMSS\nwithin\nbetween\n where the various mean-\nsum of squares are defined as \n)1\n/(\n)\n(\n)\n(\n−\n=\nn\nj\nBSS\nj\nMSS between\n and \n)\n/(\n)\n(\n)\n(\nn\nN\nj\nWSS\nj\nMSS within\n−\n=\n where \n\n=\n=\nn\nk\nk\nN\nN\n1\n. \nThe F-ratio test is applied to test whether the underlying cluster-means \n)\n(\n. j\nk\nμ\nof \n)\n(\n,\nj\nx\ni\nk\nare all equal \nfor k=1, …, n, in which case F should be nearly equal to 1. Notice that BSS(j)/WSS(j)=(n-1)/(N-n) × F. \n \n(c) One should however be careful to interpret this ratio completely in terms of ANOVA, since the \nunderlying assumptions – concerning independence, normality and equal variance - for ANOVA are \ntypically not valid in a clustering context where the clusters have been determined deliberately so as to \nminimize the within sum-of-squares (cf. Milligan and Mahajan (1980). Milligan and Cooper (1987)). \nCompare also Hartigan (1975) and Aldenderfer and Blashfield (1976) who illustrate the statistical \ninappropriateness of the use of (M)ANOVAs for indicating existence of clusters. \n \n \nB. t-test based method (cluster-wise) \n \nAnother way to express the variable importance of the j-th variable in a specific cluster is by \nusing the t-statistic, in fact checking to what extend the mean-value of the specific variable - \nwhen constrained to this cluster - differs from the overall mean-value. The corresponding \nimportance index can be expressed as29:  \n \n                                                     \n29 As implemented in the TwoStep cluster method in SPSS.\n\n86 \n \n)\n2\n(\n)\n(\nˆ\n)\n(\n)\n(\n)\n(\n..\n.\nj\ns\nj\nx\nj\nx\nj\nt\nk\nk\nk\n−\n=\n \n \n \nwhere \n)\n(\nˆ\nj\nsk\n is the standard deviation, defined as: \n \n)\n3\n(\n)1\n(\n))\n(\n)\n(\n(\n)\n(\nˆ\n1\n2\n.\n,\n−\n−\n= \n=\nk\nN\ni\nk\ni\nk\nk\nN\nj\nx\nj\nx\nj\ns\nk\n \n \nThe idea is that the importance of a variable for a cluster can be measured by the absolute \nvalue of this t-statistic, where variables with larger absolute t-statistics are considered as more \nimportant then variables for which the t-statistic is smaller. This measure is therefore initially \nrelated to a specific cluster (cluster-wise). A measure for the overall importance of the j-th \nvariables for all clusters can e.g. be obtained by summing the absolute value |tk(j)| for all \nclusters k=1, …, n. Another possibility is to consider the maximum-value of the |tk(j)| over all \nclusters k=1, …, n., as a measure for the variable importance. See also Gat-Viks et al. (2003) \nwho apply an ANOVA based test of equality of means amongst the cluster members. \n \n \nC. ‘Fraiman’ index (for complete cluster-partitioning) \n \nFraiman et al. (2008) propose to ‘blind’ (subset of) variables, by fixing them at their mean-\nvalue, and to repeat the clustering analysis subsequently. Then the pairwise agreement (e.g. \nby means of the adjusted Rand index introduced by Hubert and Arabie (1985)) is determined \nbetween the partition thus obtained and the original partition with all variables fully included. \nThis index serves as an indication for the importance of the blinded variable(s). The adjusted \nRand index is a value between 0 and 1, where large values (near 1) mean that there is a large \nagreement between the partitions with and without blinding the specific variable. To identify \nthe most important variables one therefore should look for variables with small Fraiman-\nindices. \n \nFraiman-measure to identify the importance of the different variables for the total cluster \npartition (low values indicate high importance). \n \nFraiman el al. 2008 show that this univariate procedure will falter if there are strong \ncorrelations between variables, since the effects of omitting one variable will be compensated \nby the other (non-blinded) related variable. This will typically result in a large agreement of \nthe clustering partitions in the blinded and non-blinded case.\n\n87 \n \nTherefore, in case of dependencies Fraiman et al. (2008) propose an alternative measure, \nwhere the blinded variable is not replaced by its marginal mean, but by its conditional mean \nover the set of other (non-blinded) variables.  \n \n \nIntermezzo: Promising alternatives \n \n“Ensemble learning” methods that generate many classifiers and aggregate their results have \nbeen proposed during the last decade as efficient methods for analyzing the structure in data. \nEspecially the procedure of random forests (RF), which uses a multitude of regression trees \non different bootstrap samples of the data (cf. Breiman (2001)) is a popular and user-friendly \nmethod. This method renders a measure for the variable importance of the involved \n(predictor) variables, and gives also a measure of the internal structure of the data (proximity \nof different data points to one another).  \nAlthough this method was first established for classification and regression problems (i.e. \nforms of supervised learning) the random-forest idea can also be applied for clustering \npurposes (unsupervised learning). The trick for this is to distinguish two datasets: the original \ndataset is called “class 1”, while a synthetic dataset, using information on the marginal \ndistributions of the original data, is constructed which is called “class 2”. Next one uses the \nrandom-forest machinery to classify the combined data with a random forest. The underlying \nidea is that real data points that are similar to one another will tend to be classified in the same \nterminal node of the tree, as measured by the proximity matrix that can be returned using the \nRF-technique. Thus the proximity matrix can be taken as a similarity measure30, which can be \napplied for dividing the original matrix into groups for visual exploration on basis of \nclustering or multi-dimensional scaling. See the example in Liaw and Wiener (2002) as a \nwork-out how to perform this analysis with the randomForest package in R.  \nAlong similar lines this method has been further applied and analysed by Horvath and Shi in a \nseries of papers (Shi et al. 2005, 2006). They underline the attractiveness of the method since \nit enables handling mixed variable types, is invariant to monotonic transformation of the input \nvariables and is robust to outlying observations. Moreover the RF-based dissimilarity easily \ndeals with a large number of variables. \n \nThe above reframing of clustering in terms of random forest procedure offers a link to recent \ninteresting literature (Strobl et al. 2007, 2008) on measuring the importance of variables in a \nrandom forest context explicitly accounting for the (conditional) effects of correlated \nvariables. These results suggest ways to do this also for clustering, but this will not be worked \nout here. See also R-software like part(y)itioning (Hothorn et al. 2006) which can be applied \nin this context. \n \nAnother interesting related approach which deserves further exploration is offered by Questier \net al. (2005), Smyth et al. (2006a) who put forward an extension of classification and \nregression trees, namely multivariate regression trees31, for (supervised and unsupervised) \nfeature selection as well as for cluster analysis. The idea is to use the original data (x) as \nexplanatory variables (x) and also as response variables (y=x), giving rise to so-called Auto-\nAssociative Multivariate Regression Trees. The suitability of this approach for clustering is \nfurther explored in Smyth et al. (2006b), while in Smyth et al. (2007) proposals are given to \nenhance the performance of the method by weighing the resulting cluster ensemble \nappropriately on basis of the prediction quality of the individual model. Also suggestions are \n                                                     \n30 Concerning this similarity measure provided by the random forest method, one should realize that \nideally the choice of the (dis)similarity measure ideally should be determined by the kind of patterns \none hopes to find, which makes that there are situations where other dissimilarities are preferable. \n31 R-software has been developed for multivariate regression trees, namely MVPART\n\n88 \n \ngiven for determining the variable importance and the number of clusters. For R-software on \nmultivariate regression trees see the CRAN package mvpart32.  \n \n \n \nAppendix E: Commonly used internal validation \nindexes \n \n \nIn the sequel we present various internal validation indices (see also Günter, S, Bunke, H., \n2003): \n \n• \nSilhouette index: this composite index reflects the compactness and separation of clusters. \nA larger Silhouette index indicates a better overall quality of the clustering result \n(Kaufman & Rousseeuw, 1990). \nThe Silhouette index (SI) calculates for each point a width depending on its membership \nin any cluster. This silhouette index is then the average of the silhouette widths of all \npoints/objects: \n\n=\n−\n=\nN\ni\ni\ni\ni\ni\nb\na\na\nb\nN\nSI\n1\n)\n,\nmax(\n)\n(\n1\n \nwhere bi is the minimum of the average distances between the specific point i and the \npoints in the other clusters, and ai is the average distance between the point i and all other \npoints in the cluster where i is member of. The values s(i)=[b(i)-a(i)]/max[a(i),b(i)] vary \nbetween -1 and 1, where values close to -1 mean that the point is on average closer to \nanother cluster than the one it belongs to, in fact indicating that the object i is \n‘misclassified’. Values close to 1 mean that the average distance to its own cluster is \nsignificantly smaller than to any other cluster, indicating that object i is ‘well classified’. \nWhen the width is near zero it is not clear whether the object should have been assigned \nto its current cluster or to the neighbouring cluster. The higher the silhouette index, the \nmore compact and separated are the clusters. Kaufman and Rousseeuw, 1990, give \nguidance for the desirable size of the silhouette width; they consider a reasonable \nclassification to be characterized by an average silhouette width above 0.5. Small \nsilhouette width below 0.2 should be interpreted as a lack of substantial cluster structure. \n• \nDavies-Bouldin index: This measure tries to maximize the between-cluster distance while \nminimizing the distance between the cluster centroid and the other points. It expresses the \naverage similarity between each cluster and its most similar one. Small values correspond \nto clusters that are compact and have well-separated centres. Therefore its minimum value \ndetermines the optimal number of clusters. \n• \nCalinski-Harabasz index: This index measures the between-cluster isolation and the \nwithin-cluster compactness, in terms of: \n1\n)\n(\n1\n)\n(\n)\n(\n−\n−\n=\nK\nS\nTrace\nK\nS\nTrace\nK\nCH\nW\nB\n \nwith N being the number of objects and SB and SW being the between and within-class \nscatter matrix \n\n\n=\n=\n=\n−\n−\n=\n−\n−\n=\nK\ni\nT\ni\ni\ni\nB\nK\ni\nN\nj\ni\ni\nj\nj\nij\nW\nm\nm\nm\nm\nN\nS\nm\nx\nm\nx\nS\n1\n1\n1\n)\n)(\n(\n;\n)\n)(\n(\nγ\n \nwhere Г={γij} is a partition matrix, with γij =1 if xj belongs to cluster i and 0 otherwise, \nwhere moreover \n1\n1\n=\n\n=\nK\ni\nij\nγ\n for all j. M=[m1, …, mK] is the cluster prototype or centroid \n                                                     \n32 http://cran.nedmirror.nl/web/packages/mvpart/index.html\n\n89 \n \nmatrix, and \n\n=\n=\nN\nj\nj\nij\ni\ni\nx\nN\nm\n1\n1\nγ\nis the mean for the i-th cluster with Ni objects. The optimal \nnumber of clusters is determined by maximizing the CH-index. \n• \nDunn index: this index is defined as the ratio between the minimum distance between two \nclusters and the size of the largest cluster. Depending on the choice of the distance \nmeasure and the size of the cluster, various Dunn indices can be defined. Maximizing this \nindex reflects to a certain extent the maximization of the inter-cluster-distances while \nsimultaneously minimizing the intra-cluster distances.  \n• \nRMSSTD index (Root Mean Square Standard Deviation): This index is designed for \nhierarchical clustering, but can equally well be used for any clustering algorithm, and \nmeasures the homogeneity of the formed clusters (or the variance of clusters) at each step \nof the hierarchical clustering algorithm. A lower RMSSTD value indicates better \nclustering. \n• \nC index: This index (Hubert and Schultz, 1976) is defined as follows: \nmin\nmax\nmin\nS\nS\nS\nS\nC\n−\n−\n=\n \n \nwhere S is the sum of distances over all pairs of objects from the same cluster. Let r be \nthe number of those pairs. Then Smin is the sum of the r smallest distances if all pairs of \nobjects are considered (i.e. also objects that can belong to different clusters). Similarly \nSmax is the sum of the r largest distances out of all pairs. Hence a small value of C \nindicates a good clustering. \n• \nMaulik-Bandyopadhyay index: This index is a combination of three terms \n                     \np\nk\nk\nk\nD\nE\nE\nk\nMB\n\n\n\n\n\n\n⋅\n⋅\n=\n1\n1\n \nwhere the intra-cluster distance is defined by \n\n=\n∈\n−\n=\nk\ni\nc\nx\ni\nk\ni\nz\nx\nE\n1\n and the inter-cluster \ndistance by \nj\ni\nk\nj\ni\nk\nz\nz\nD\n−\n=\n=1\n,\nmax\n, where zi is the centre of cluster ci. p is chosen to be \ntwo and the number of clusters k is determined by maximizing MBk. \n• \nThe Cophenetic correlation coefficient (CPCC) is an index to validate hierarchical \nclustering structures, and is based on the proximity matrix P={pij}, of the data X. It \nmeasures the degree of similarity between P and the cophenetic matrix Q={qij}, the \nelements of which express the proximity level where pairs of data points are grouped in \nthe same cluster. \nCPCC is defined as: \n\n\n\n\n\n\n−\n⋅\n\n\n\n\n\n−\n−\n=\n\n\n\n−\n=\n+\n=\n−\n=\n+\n=\n−\n=\n+\n=\n1\n1\n1\n2\n2\n1\n1\n1\n2\n2\n1\n1\n1\n1\n1\n1\nN\ni\nN\ni\nj\nQ\nij\nN\ni\nN\ni\nj\nP\nij\nN\ni\nN\ni\nj\nQ\nP\nij\nij\nq\nM\np\nM\nq\np\nM\nCPCC\nμ\nμ\nμ\nμ\n \nWhere μP and μQ are the means of P and Q: \n\n\n−\n=\n+\n=\n−\n=\n+\n=\n=\n=\n1\n1\n1\n1\n1\n1\n1\n;\n1\nN\ni\nN\ni\nj\nij\nQ\nN\ni\nN\ni\nj\nij\nP\nq\nM\np\nM\nμ\nμ\n\n90 \n \nwith M=N(N-1)/2. The value of CPCC lies in the range of [-1,1] with an index value \nclose to 1 indicating a significant similarity between P and Q. However for group average \nlinkage (UPGMA) even large CPCC values (such as 0.9) cannot assure sufficient \nsimilarity between the two matrices. \nRemark: Also for Fuzzy clustering internal validation indices have been proposed, such as \nthe partition coefficient (PC) and partition entropy (PE), the (extended) Xie-Beni index and \nthe Fukuyama-Sugeno index, c.f. Pal-Bezdek (1995), Hammah and Curran (2000), Wu and \nYang, 2005; cf. section 10.4.3 in Xue and Wunsch (2008). Wang and Zhang (2007) \nperformed an extensive evaluation of the fuzzy clustering indices, while Zhang et al. 2008 \ntested a newly proposed index. They conclude that cluster validation is a very difficult task \nand that ‘no matter how good your index is, there is a dataset out there waiting to trick it (and \nyou)’ (Pal and Bezdek (1997)). Wu et al. (2009) recently analyse the robustness of the cluster \nindices for noise and outliers, and propose ways to robustify them.\n\nPIK Report-Reference: \nNo. 1 \n3. Deutsche Klimatagung, Potsdam 11.-14. April 1994 \nTagungsband der Vorträge und Poster (April 1994) \nNo. 2 \nExtremer Nordsommer '92 \nMeteorologische Ausprägung, Wirkungen auf naturnahe und vom Menschen beeinflußte \nÖkosysteme, gesellschaftliche Perzeption und situationsbezogene politisch-administrative bzw. \nindividuelle Maßnahmen (Vol. 1 - Vol. 4) \nH.-J. Schellnhuber, W. Enke, M. Flechsig (Mai 1994) \nNo. 3 \nUsing Plant Functional Types in a Global Vegetation Model \nW. Cramer (September 1994) \nNo. 4 \nInterannual variability of Central European climate parameters and their relation to the large-\nscale circulation \nP. C. Werner (Oktober 1994) \nNo. 5 \nCoupling Global Models of Vegetation Structure and Ecosystem Processes - An Example from \nArctic and Boreal Ecosystems \nM. Plöchl, W. Cramer (Oktober 1994) \nNo. 6 \nThe use of a European forest model in North America: A study of ecosystem response to \nclimate gradients \nH. Bugmann, A. Solomon (Mai 1995) \nNo. 7 \nA comparison of forest gap models: Model structure and behaviour \nH. Bugmann, Y. Xiaodong, M. T. Sykes, Ph. Martin, M. Lindner, P. V. Desanker, \nS. G. Cumming (Mai 1995) \nNo. 8 \nSimulating forest dynamics in complex topography using gridded climatic data \nH. Bugmann, A. Fischlin (Mai 1995) \nNo. 9 \nApplication of two forest succession models at sites in Northeast Germany \nP. Lasch, M. Lindner (Juni 1995) \nNo. 10 \nApplication of a forest succession model to a continentality gradient through Central Europe \nM. Lindner, P. Lasch, W. Cramer (Juni 1995) \nNo. 11 \nPossible Impacts of global warming on tundra and boreal forest ecosystems - Comparison of \nsome biogeochemical models \nM. Plöchl, W. Cramer (Juni 1995) \nNo. 12 \nWirkung von Klimaveränderungen auf Waldökosysteme \nP. Lasch, M. Lindner (August 1995) \nNo. 13 \nMOSES - Modellierung und Simulation ökologischer Systeme - Eine Sprachbeschreibung mit \nAnwendungsbeispielen \nV. Wenzel, M. Kücken, M. Flechsig (Dezember 1995) \nNo. 14 \nTOYS - Materials to the Brandenburg biosphere model / GAIA \nPart 1 - Simple models of the \"Climate + Biosphere\" system \nYu. Svirezhev (ed.), A. Block, W. v. Bloh, V. Brovkin, A. Ganopolski, V. Petoukhov, \nV. Razzhevaikin (Januar 1996) \nNo. 15 \nÄnderung von Hochwassercharakteristiken im Zusammenhang mit Klimaänderungen - Stand  \nder Forschung \nA. Bronstert (April 1996) \nNo. 16 \nEntwicklung eines Instruments zur Unterstützung der klimapolitischen Entscheidungsfindung \nM. Leimbach (Mai 1996) \nNo. 17 \nHochwasser in Deutschland unter Aspekten globaler Veränderungen - Bericht über das DFG-\nRundgespräch am 9. Oktober 1995 in Potsdam \nA. Bronstert (ed.) (Juni 1996) \nNo. 18 \nIntegrated modelling of hydrology and water quality in mesoscale watersheds \nV. Krysanova, D.-I. Müller-Wohlfeil, A. Becker (Juli 1996) \nNo. 19 \nIdentification of vulnerable subregions in the Elbe drainage basin under global change impact \nV. Krysanova, D.-I. Müller-Wohlfeil, W. Cramer, A. Becker (Juli 1996) \nNo. 20 \nSimulation of soil moisture patterns using a topography-based model at different scales \nD.-I. Müller-Wohlfeil, W. Lahmer, W. Cramer, V. Krysanova (Juli 1996) \nNo. 21 \nInternational relations and global climate change \nD. Sprinz, U. Luterbacher (1st ed. July, 2n ed. December 1996) \nNo. 22 \nModelling the possible impact of climate change on broad-scale vegetation structure - \nexamples from Northern Europe \nW. Cramer (August 1996)\n\nNo. 23 \nA methode to estimate the statistical security for cluster separation \nF.-W. Gerstengarbe, P.C. Werner (Oktober 1996) \nNo. 24 \nImproving the behaviour of forest gap models along drought gradients \nH. Bugmann, W. Cramer (Januar 1997) \nNo. 25 \nThe development of climate scenarios \nP.C. Werner, F.-W. Gerstengarbe (Januar 1997) \nNo. 26 \nOn the Influence of Southern Hemisphere Winds on North Atlantic Deep Water Flow \nS. Rahmstorf, M. H. England (Januar 1977) \nNo. 27 \nIntegrated systems analysis at PIK: A brief epistemology \nA. Bronstert, V. Brovkin, M. Krol, M. Lüdeke, G. Petschel-Held, Yu. Svirezhev, V. Wenzel \n(März 1997) \nNo. 28 \nImplementing carbon mitigation measures in the forestry sector - A review \nM. Lindner (Mai 1997) \nNo. 29 \nImplementation of a Parallel Version of a Regional Climate Model \nM. Kücken, U. Schättler (Oktober 1997) \nNo. 30 \nComparing global models of terrestrial net primary productivity (NPP): Overview and key results \nW. Cramer, D. W. Kicklighter, A. Bondeau, B. Moore III, G. Churkina, A. Ruimy, A. Schloss, \nparticipants of \"Potsdam '95\" (Oktober 1997) \nNo. 31 \nComparing global models of terrestrial net primary productivity (NPP): Analysis of the seasonal \nbehaviour of NPP, LAI, FPAR along climatic gradients across ecotones \nA. Bondeau, J. Kaduk, D. W. Kicklighter, participants of \"Potsdam '95\" (Oktober 1997) \nNo. 32 \nEvaluation of the physiologically-based forest growth model FORSANA \nR. Grote, M. Erhard, F. Suckow (November 1997) \nNo. 33 \nModelling the Global Carbon Cycle for the Past and Future Evolution of the Earth System \nS. Franck, K. Kossacki, Ch. Bounama (Dezember 1997) \nNo. 34 \nSimulation of the global bio-geophysical interactions during the Last Glacial Maximum \nC. Kubatzki, M. Claussen (Januar 1998) \nNo. 35 \nCLIMBER-2: A climate system model of intermediate complexity. Part I: Model description and \nperformance for present climate \nV. Petoukhov, A. Ganopolski, V. Brovkin, M. Claussen, A. Eliseev, C. Kubatzki, S. Rahmstorf \n(Februar 1998) \nNo. 36 \nGeocybernetics: Controlling a rather complex dynamical system under uncertainty \nH.-J. Schellnhuber, J. Kropp (Februar 1998) \nNo. 37 \nUntersuchung der Auswirkungen erhöhter atmosphärischer CO2-Konzentrationen auf \nWeizenbestände des Free-Air Carbondioxid Enrichment (FACE) - Experimentes Maricopa \n(USA) \nT. Kartschall, S. Grossman, P. Michaelis, F. Wechsung, J. Gräfe, K. Waloszczyk, \nG. Wechsung, E. Blum, M. Blum (Februar 1998) \nNo. 38 \nDie Berücksichtigung natürlicher Störungen in der Vegetationsdynamik verschiedener \nKlimagebiete \nK. Thonicke (Februar 1998) \nNo. 39 \nDecadal Variability of the Thermohaline Ocean Circulation \nS. Rahmstorf (März 1998) \nNo. 40 \nSANA-Project results and PIK contributions \nK. Bellmann, M. Erhard, M. Flechsig, R. Grote, F. Suckow (März 1998) \nNo. 41 \nUmwelt und Sicherheit: Die Rolle von Umweltschwellenwerten in der empirisch-quantitativen \nModellierung \nD. F. Sprinz (März 1998) \nNo. 42 \nReversing Course: Germany's Response to the Challenge of Transboundary Air Pollution \nD. F. Sprinz, A. Wahl (März 1998) \nNo. 43 \nModellierung des Wasser- und Stofftransportes in großen Einzugsgebieten. Zusammenstellung \nder Beiträge des Workshops am 15. Dezember 1997 in Potsdam \nA. Bronstert, V. Krysanova, A. Schröder, A. Becker, H.-R. Bork (eds.) (April 1998) \nNo. 44 \nCapabilities and Limitations of Physically Based Hydrological Modelling on the Hillslope Scale \nA. Bronstert (April 1998) \nNo. 45 \nSensitivity Analysis of a Forest Gap Model Concerning Current and Future Climate Variability \nP. Lasch, F. Suckow, G. Bürger, M. Lindner (Juli 1998) \nNo. 46 \nWirkung von Klimaveränderungen in mitteleuropäischen Wirtschaftswäldern \nM. Lindner (Juli 1998)\n\nNo. 47 \nSPRINT-S: A Parallelization Tool for Experiments with Simulation Models \nM. Flechsig (Juli 1998) \nNo. 48 \nThe Odra/Oder Flood in Summer 1997: Proceedings of the European Expert Meeting in \nPotsdam, 18 May 1998 \nA. Bronstert, A. Ghazi, J. Hladny, Z. Kundzewicz, L. Menzel (eds.) (September 1998) \nNo. 49 \nStruktur, Aufbau und statistische Programmbibliothek der meteorologischen Datenbank am \nPotsdam-Institut für Klimafolgenforschung \nH. Österle, J. Glauer, M. Denhard (Januar 1999) \nNo. 50 \nThe complete non-hierarchical cluster analysis \nF.-W. Gerstengarbe, P. C. Werner (Januar 1999) \nNo. 51 \nStruktur der Amplitudengleichung des Klimas \nA. Hauschild (April 1999) \nNo. 52 \nMeasuring the Effectiveness of International Environmental Regimes \nC. Helm, D. F. Sprinz (Mai 1999) \nNo. 53 \nUntersuchung der Auswirkungen erhöhter atmosphärischer CO2-Konzentrationen innerhalb des \nFree-Air Carbon Dioxide Enrichment-Experimentes: Ableitung allgemeiner Modellösungen \nT. Kartschall, J. Gräfe, P. Michaelis, K. Waloszczyk, S. Grossman-Clarke (Juni 1999) \nNo. 54 \nFlächenhafte Modellierung der Evapotranspiration mit TRAIN \nL. Menzel (August 1999) \nNo. 55 \nDry atmosphere asymptotics \nN. Botta, R. Klein, A. Almgren (September 1999) \nNo. 56 \nWachstum von Kiefern-Ökosystemen in Abhängigkeit von Klima und Stoffeintrag - Eine \nregionale Fallstudie auf Landschaftsebene \nM. Erhard (Dezember 1999) \nNo. 57 \nResponse of a River Catchment to Climatic Change: Application of Expanded Downscaling to \nNorthern Germany \nD.-I. Müller-Wohlfeil, G. Bürger, W. Lahmer (Januar 2000) \nNo. 58 \nDer \"Index of Sustainable Economic Welfare\" und die Neuen Bundesländer in der \nÜbergangsphase \nV. Wenzel, N. Herrmann (Februar 2000) \nNo. 59 \nWeather Impacts on Natural, Social and Economic Systems (WISE, ENV4-CT97-0448) \nGerman report \nM. Flechsig, K. Gerlinger, N. Herrmann, R. J. T. Klein, M. Schneider, H. Sterr, H.-J. Schellnhuber \n(Mai 2000) \nNo. 60 \nThe Need for De-Aliasing in a Chebyshev Pseudo-Spectral Method \nM. Uhlmann (Juni 2000) \nNo. 61 \nNational and Regional Climate Change Impact Assessments in the Forestry Sector \n- Workshop Summary and Abstracts of Oral and Poster Presentations \nM. Lindner (ed.) (Juli 2000) \nNo. 62 \nBewertung ausgewählter Waldfunktionen unter Klimaänderung in Brandenburg \nA. Wenzel (August 2000) \nNo. 63 \nEine Methode zur Validierung von Klimamodellen für die Klimawirkungsforschung hinsichtlich \nder Wiedergabe extremer Ereignisse \nU. Böhm (September 2000) \nNo. 64 \nDie Wirkung von erhöhten atmosphärischen CO2-Konzentrationen auf die Transpiration eines \nWeizenbestandes unter Berücksichtigung von Wasser- und Stickstofflimitierung \nS. Grossman-Clarke (September 2000) \nNo. 65 \nEuropean Conference on Advances in Flood Research, Proceedings, (Vol. 1 - Vol. 2) \nA. Bronstert, Ch. Bismuth, L. Menzel (eds.) (November 2000) \nNo. 66 \nThe Rising Tide of Green Unilateralism in World Trade Law - Options for Reconciling the \nEmerging North-South Conflict \nF. Biermann (Dezember 2000) \nNo. 67 \nCoupling Distributed Fortran Applications Using C++ Wrappers and the CORBA Sequence  \nType \nT. Slawig (Dezember 2000) \nNo. 68 \nA Parallel Algorithm for the Discrete Orthogonal Wavelet Transform \nM. Uhlmann (Dezember 2000) \nNo. 69 \nSWIM (Soil and Water Integrated Model), User Manual \nV. Krysanova, F. Wechsung, J. Arnold, R. Srinivasan, J. Williams (Dezember 2000)\n\nNo. 70 \nStakeholder Successes in Global Environmental Management, Report of Workshop, \nPotsdam, 8 December 2000 \nM. Welp (ed.) (April 2001) \nNo. 71 \nGIS-gestützte Analyse globaler Muster anthropogener Waldschädigung - Eine sektorale \nAnwendung des Syndromkonzepts \nM. Cassel-Gintz (Juni 2001) \nNo. 72 \nWavelets Based on Legendre Polynomials \nJ. Fröhlich, M. Uhlmann (Juli 2001) \nNo. 73 \nDer Einfluß der Landnutzung auf Verdunstung und Grundwasserneubildung - Modellierungen \nund Folgerungen für das Einzugsgebiet des Glan \nD. Reichert (Juli 2001) \nNo. 74 \nWeltumweltpolitik - Global Change als Herausforderung für die deutsche Politikwissenschaft \nF. Biermann, K. Dingwerth (Dezember 2001) \nNo. 75 \nAngewandte Statistik - PIK-Weiterbildungsseminar 2000/2001 \nF.-W. Gerstengarbe (Hrsg.) (März 2002) \nNo. 76 \nZur Klimatologie der Station Jena \nB. Orlowsky (September 2002) \nNo. 77 \nLarge-Scale Hydrological Modelling in the Semi-Arid North-East of Brazil \nA. Güntner (September 2002) \nNo. 78 \nPhenology in Germany in the 20th Century: Methods, Analyses and Models \nJ. Schaber (November 2002) \nNo. 79 \nModelling of Global Vegetation Diversity Pattern \nI. Venevskaia, S. Venevsky (Dezember 2002) \nNo. 80 \nProceedings of the 2001 Berlin Conference on the Human Dimensions of Global Environmental \nChange “Global Environmental Change and the Nation State” \nF. Biermann, R. Brohm, K. Dingwerth (eds.) (Dezember 2002) \nNo. 81 \nPOTSDAM - A Set of Atmosphere Statistical-Dynamical Models: Theoretical Background \nV. Petoukhov, A. Ganopolski, M. Claussen (März 2003) \nNo. 82 \nSimulation der Siedlungsflächenentwicklung als Teil des Globalen Wandels und ihr Einfluß auf \nden Wasserhaushalt im Großraum Berlin \nB. Ströbl, V. Wenzel, B. Pfützner (April 2003) \nNo. 83 \nStudie zur klimatischen Entwicklung im Land Brandenburg bis 2055 und deren Auswirkungen \nauf den Wasserhaushalt, die Forst- und Landwirtschaft sowie die Ableitung erster Perspektiven \nF.-W. Gerstengarbe, F. Badeck, F. Hattermann, V. Krysanova, W. Lahmer, P. Lasch, M. Stock, \nF. Suckow, F. Wechsung, P. C. Werner (Juni 2003) \nNo. 84 \nWell Balanced Finite Volume Methods for Nearly Hydrostatic Flows \nN. Botta, R. Klein, S. Langenberg, S. Lützenkirchen (August 2003) \nNo. 85 \nOrts- und zeitdiskrete Ermittlung der Sickerwassermenge im Land Brandenburg auf der Basis \nflächendeckender Wasserhaushaltsberechnungen \nW. Lahmer, B. Pfützner (September 2003) \nNo. 86 \nA Note on Domains of Discourse - Logical Know-How for Integrated Environmental Modelling, \nVersion of October 15, 2003 \nC. C. Jaeger (Oktober 2003) \nNo. 87 \nHochwasserrisiko im mittleren Neckarraum - Charakterisierung unter Berücksichtigung \nregionaler Klimaszenarien sowie dessen Wahrnehmung durch befragte Anwohner \nM. Wolff (Dezember 2003) \nNo. 88 \nAbflußentwicklung in Teileinzugsgebieten des Rheins - Simulationen für den Ist-Zustand und für \nKlimaszenarien \nD. Schwandt (April 2004) \nNo. 89 \nRegionale Integrierte Modellierung der Auswirkungen von Klimaänderungen am Beispiel des \nsemi-ariden Nordostens von Brasilien \nA. Jaeger (April 2004) \nNo. 90 \nLebensstile und globaler Energieverbrauch - Analyse und Strategieansätze zu einer \nnachhaltigen Energiestruktur \nF. Reusswig, K. Gerlinger, O. Edenhofer (Juli 2004) \nNo. 91 \nConceptual Frameworks of Adaptation to Climate Change and their Applicability to Human \nHealth \nH.-M. Füssel, R. J. T. Klein (August 2004)\n\nNo. 92 \nDouble Impact - The Climate Blockbuster ’The Day After Tomorrow’ and its Impact on the \nGerman Cinema Public \nF. Reusswig, J. Schwarzkopf, P. Polenz (Oktober 2004)  \nNo. 93 \nHow Much Warming are we Committed to and How Much Can be Avoided? \nB. Hare, M. Meinshausen (Oktober 2004) \nNo. 94 \nUrbanised Territories as a Specific Component of the Global Carbon Cycle \nA. Svirejeva-Hopkins, H.-J. Schellnhuber (Januar 2005) \nNo. 95 \nGLOWA-Elbe I - Integrierte Analyse der Auswirkungen des globalen Wandels auf Wasser, \nUmwelt und Gesellschaft im Elbegebiet \nF. Wechsung, A. Becker, P. Gräfe (Hrsg.) (April 2005) \nNo. 96 \nThe Time Scales of the Climate-Economy Feedback and the Climatic Cost of Growth \nS. Hallegatte (April 2005) \nNo. 97 \nA New Projection Method for the Zero Froude Number Shallow Water Equations \nS. Vater (Juni 2005) \nNo. 98 \nTable of EMICs - Earth System Models of Intermediate Complexity \nM. Claussen (ed.) (Juli 2005) \nNo. 99 \nKLARA - Klimawandel - Auswirkungen, Risiken, Anpassung \nM. Stock (Hrsg.) (Juli 2005) \nNo. 100 \nKatalog der Großwetterlagen Europas (1881-2004) nach Paul Hess und Helmut Brezowsky \n6., verbesserte und ergänzte Auflage \nF.-W. Gerstengarbe, P. C. Werner (September 2005) \nNo. 101 \nAn Asymptotic, Nonlinear Model for Anisotropic, Large-Scale Flows in the Tropics \nS. Dolaptchiev (September 2005) \nNo. 102 \nA Long-Term Model of the German Economy: lagomd_sim \nC. C. Jaeger (Oktober 2005) \nNo. 103 \nStructuring Distributed Relation-Based Computations with SCDRC \nN. Botta, C. Ionescu, C. Linstead, R. Klein (Oktober 2006) \nNo. 104 \nDevelopment of Functional Irrigation Types for Improved Global Crop Modelling \nJ. Rohwer, D. Gerten, W. Lucht (März 2007) \nNo. 105 \nIntra-Regional Migration in Formerly Industrialised Regions: Qualitative Modelling of Household \nLocation Decisions as an Input to Policy and Plan Making in Leipzig/Germany and \nWirral/Liverpool/UK \nD. Reckien (April 2007) \nNo. 106 \nPerspektiven der Klimaänderung bis 2050 für den Weinbau in Deutschland (Klima 2050) - \nSchlußbericht zum FDW-Vorhaben: Klima 2050 \nM. Stock, F. Badeck, F.-W. Gerstengarbe, D. Hoppmann, T. Kartschall, H. Österle, P. C. Werner, \nM. Wodinski (Juni 2007) \nNo. 107 \nClimate Policy in the Coming Phases of the Kyoto Process: Targets, Instruments, and the Role \nof Cap and Trade Schemes - Proceedings of the International Symposium, February 20-21, \n2006, Brussels \nM. Welp, L. Wicke, C. C. Jaeger (eds.) (Juli 2007) \nNo. 108 \nCorrelation Analysis of Climate Variables and Wheat Yield Data on Various Aggregation Levels \nin Germany and the EU-15 Using GIS and Statistical Methods, with a Focus on Heat Wave \nYears \nT. Sterzel (Juli 2007) \nNo. 109 \nMOLOCH - Ein Strömungsverfahren für inkompressible Strömungen - Technische Referenz 1.0 \nM. Münch (Januar 2008) \nNo. 110 \nRationing & Bayesian Expectations with Application to the Labour Market \nH. Förster (Februar 2008) \nNo. 111 \nFinding a Pareto-Optimal Solution for Multi-Region Models Subject to Capital Trade and  \nSpillover Externalities \nM. Leimbach, K. Eisenack (November 2008) \nNo. 112 \nDie Ertragsfähigkeit ostdeutscher Ackerflächen unter Klimawandel \nF. Wechsung, F.-W. Gerstengarbe, P. Lasch, A. Lüttger (Hrsg.) (Dezember 2008) \nNo. 113 \nKlimawandel und Kulturlandschaft Berlin \nH. Lotze-Campen, L. Claussen, A. Dosch, S. Noleppa, J. Rock, J. Schuler, G. Uckert  \n(Juni 2009) \nNo. 114 \nDie landwirtschaftliche Bewässerung in Ostdeutschland seit 1949 - Eine historische Analyse vor \ndem Hintergrund des Klimawandels \nM. Simon (September 2009)\n\nNo. 115 \nContinents under Climate Change - Conference on the Occasion of the 200th Anniversary of the \nHumboldt-Universität zu Berlin, Abstracts of Lectures and Posters of the Conference, \nApril 21-23, 2010, Berlin \nW. Endlicher, F.-W. Gerstengarbe (eds.) (April 2010) \nNo. 116 \nNach Kopenhagen: Neue Strategie zur Realisierung des 2°max-Klimazieles \nL. Wicke, H. J. Schellnhuber, D. Klingenfeld (April 2010) \nNo. 117 \nEvaluating Global Climate Policy - Taking Stock and Charting a New Way Forward \nD. Klingenfeld (April 2010) \nNo. 118 \nUntersuchungen zu anthropogenen Beeinträchtigungen der Wasserstände am Pegel \nMagdeburg-Strombrücke \nM. Simon (September 2010) \nNo. 119 \nKatalog der Großwetterlagen Europas (1881-2009) nach Paul Hess und Helmut Brezowsky \n7., verbesserte und ergänzte Auflage \nP. C. Werner, F.-W. Gerstengarbe (Oktober 2010) \nNo. 120 \nEnergy taxes, resource taxes and quantity rationing for climate protection \nK. Eisenack, O. Edenhofer, M. Kalkuhl (November 2010) \nNo. 121 \nKlimawandel in der Region Havelland-Fläming \nA. Lüttger, F.-W. Gerstengarbe, M. Gutsch, F. Hattermann, P. Lasch, A. Murawski, \nJ. Petraschek, F. Suckow, P. C. Werner (Januar 2011) \nNo. 122 \nAdaptation to Climate Change in the Transport Sector: A Review \nK. Eisenack, R. Stecker, D. Reckien, E. Hoffmann (Mai 2011) \nNo. 123 \nSpatial-temporal changes of meteorological parameters in selected circulation patterns \nP. C. Werner, F.-W. Gerstengarbe (November 2011) \nNo. 124 \nAssessment of Trade-off Decisions for Sustainable Bioenergy Development in the Philippines: \nAn Application of Conjoint Analysis  \nL. A. Acosta, D. B. Magcale-Macandog, W. Lucht, K. G. Engay, M. N. Q. Herrera, \nO. B. S. Nicopior, M. I. V. Sumilang, V. Espaldon (November 2011) \nNo. 125 \nHistorisch vereinbarte minimale mittlere Monatsabflüsse der Elbe im tschechisch-deutschen \nGrenzprofil bei Hřensko/Schöna – Eine Analyse der Niedrigwasseraufhöhung im Grenzprofil \ninfolge des Talsperrenbaus im tschechischen Einzugsgebiet der Elbe \nM. Simon, J. Böhme (März 2012) \nNo. 126 \nCluster Analysis to Understand Socio-Ecological Systems: A Guideline \nP. Janssen, C. Walther, M. Lüdeke (September 2012)"}},{"section":"Evidence","id":"sl-1267680-zaxgg","mode":"approved","raw_type":"section-title","component_id":null,"component_props":null,"has_approved":true,"has_discovery":false,"raw":{"id":"sl-1267680-zaxgg","mode":"approved","approved":{"id":"sl-1267680-zaxgg","sub":"0 claims · 1 passages retrieved","type":"section-title","headline":"Evidence","sectionNum":"2","speaker_notes":""},"speaker_notes":""}},{"section":"Evidence","id":"sl-1267680-mj0di","mode":"approved","raw_type":"bullets","component_id":null,"component_props":null,"has_approved":true,"has_discovery":false,"raw":{"id":"sl-1267680-mj0di","mode":"approved","approved":{"id":"sl-1267680-mj0di","type":"bullets","bullets":["PIK  Report\nNo. 126\nFOR\nPOTSDAM INSTITUTE\nCLIMATE IMPACT RESEARCH (PIK)\nCLUSTER ANALYSIS TO UNDERSTAND\nSOCIO-ECOLOGICAL …"],"headline":"Key Passages","speaker_notes":"- : "},"speaker_notes":"- : "}},{"section":"Summary","id":"sl-1267680-3k2vs","mode":"approved","raw_type":"bullets","component_id":null,"component_props":null,"has_approved":true,"has_discovery":false,"raw":{"id":"sl-1267680-3k2vs","mode":"approved","approved":{"id":"sl-1267680-3k2vs","type":"bullets","bullets":["Query: Test Content 1","Sources retrieved: 1","Question type: semantic","Generated by EngineHouse Interface"],"headline":"Summary","speaker_notes":"Full query: Test Content 1"},"speaker_notes":"Full query: Test Content 1"}}],"static_registry":{"stat-card":{"component_id":"stat-card","shadcn_name":"Card","radix_primitive":null,"slot_type":"chart","description":"Large statistic with label and optional trend indicator","props":{"value":{"type":"string","required":true,"example":"40%"},"label":{"type":"string","required":true,"example":"El Niño probability by May–July"},"trend":{"type":"string","required":false,"example":"+12% vs last forecast"},"context":{"type":"string","required":false,"example":"WMO GPC, Feb 2026"}}},"data-table":{"component_id":"data-table","shadcn_name":"Table","radix_primitive":null,"slot_type":"chart","description":"Structured data table with headers and rows","props":{"columns":{"type":"string[]","required":true,"example":["Period","La Niña %","ENSO-Neutral %","El Niño %"]},"rows":{"type":"string[][]","required":true,"example":[["Mar–May 2026","30","60","10"]]},"caption":{"type":"string","required":false,"example":"Source: WMO GPC"}}},"comparison-tabs":{"component_id":"comparison-tabs","shadcn_name":"Tabs","radix_primitive":"Radix Tabs (@radix-ui/react-tabs)","slot_type":"comparison_block","description":"Side-by-side scenario comparison with tabbed navigation","props":{"tabs":{"type":"Array<{label, content}>","required":true,"example":[{"label":"El Niño scenario","content":"Warmer SSTs, drought in SE Asia, flooding in South America"},{"label":"ENSO-Neutral","content":"Near-average conditions, reduced extreme weather probability"}]}}},"blockquote-card":{"component_id":"blockquote-card","shadcn_name":"Card + Separator","radix_primitive":"Radix Separator (@radix-ui/react-separator)","slot_type":"quote","description":"Pull quote with attribution and source","props":{"quote":{"type":"string","required":true,"example":"The climate system's uncertainty is becoming our economic uncertainty."},"attribution":{"type":"string","required":false,"example":"WMO El Niño/La Niña Update, February 2026"},"source_url":{"type":"string","required":false,"example":"https://wmo.int/..."}}},"alert-callout":{"component_id":"alert-callout","shadcn_name":"Alert","radix_primitive":null,"slot_type":"diagram","description":"Highlighted alert or key insight callout","props":{"title":{"type":"string","required":true,"example":"Critical Transition Window"},"description":{"type":"string","required":true,"example":"ENSO shifts are most predictable 3–4 months ahead; planning must begin now."},"variant":{"type":"string","required":false,"example":"destructive | default"}}},"progress-indicator":{"component_id":"progress-indicator","shadcn_name":"Progress + Badge","radix_primitive":"Radix Progress (@radix-ui/react-progress)","slot_type":"chart","description":"Probability or progress bar with percentage label","props":{"items":{"type":"Array<{label, value, color}>","required":true,"example":[{"label":"ENSO-Neutral","value":60,"color":"blue"},{"label":"El Niño","value":40,"color":"orange"},{"label":"La Niña","value":30,"color":"green"}]}}},"accordion-qa":{"component_id":"accordion-qa","shadcn_name":"Accordion","radix_primitive":"Radix Accordion (@radix-ui/react-accordion)","slot_type":"questionnaire","description":"Expandable Q&A or questionnaire with multiple items","props":{"items":{"type":"Array<{question, answer}>","required":true,"example":[{"question":"What drives food price volatility?","answer":"ENSO-driven crop failure uncertainty..."},{"question":"Which regions are most exposed?","answer":"SE Asia, East Africa, and South America..."}]}}},"image-card":{"component_id":"image-card","shadcn_name":"Card","radix_primitive":null,"slot_type":"image","description":"Image with caption and optional credit","props":{"imageUrl":{"type":"string","required":true,"example":"https://..."},"imageAlt":{"type":"string","required":true,"example":"ENSO probability map"},"caption":{"type":"string","required":false,"example":"WMO ENSO probability chart, February 2026"},"credit":{"type":"string","required":false,"example":"© WMO"}}},"badge-list":{"component_id":"badge-list","shadcn_name":"Badge","radix_primitive":null,"slot_type":"diagram","description":"Grouped labelled badges for entities, signals, or categories","props":{"headline":{"type":"string","required":true,"example":"Active Climate Signals"},"items":{"type":"Array<{label, variant}>","required":true,"example":[{"label":"La Niña fading","variant":"secondary"},{"label":"El Niño risk rising","variant":"destructive"},{"label":"ENSO-neutral likely","variant":"default"}]}}}}}