Recent Question/Assignment
I have an assignment related to Advanced Data Analytics course. This assignment should be done using RStudio Programming programm.
Cluster Analysis
(hierarchical & non-hierarchical)
• Grouping/clustering similar objects/cases (or also variables) into groups.
• Homogeneous/heterogeneous groups? • Segments? - Segmentation
• Profiles?
• Grouping variables?
[see also: N. K. Malhotra & D. F. Birks, 2007, Marketing Research: An Applied Approach (Chapter 23:
Cluster Analysis), 3rd European Edition, Prentice Hall, Inc., Pearson Education Limited, Essex, England.] Modul University Vienna
Aim
• Objects or variables are clustered into homogeneous groups that are similar to each other and dissimilar to other groups.
• Group/cluster membership is not known in advance. There is no a priori information. A data-driven grouping solution is produced.
• Number of clusters not fixed in advance when using hierarchical clustering but is selected subsequent to the procedure. Using nonhierarchical clustering the number of clusters has to be pre-specified. Different solutions should be compared.
• Optimum result for k clusters is not necessarily the same as hierarchical result for the kth step
• Result may heavily depend on the procedure chosen!
You will always get some cluster solution, also if there are no reasonable clusters!
Importing data in R: .csv-files
Locate the file and enter the path and file name to import the dataset
Modul University Vienna
Scatterplot
How many cultural and sporty activities would you plan for a one month trip?
Optional: Standardization
If variables used for cluster analysis are measured on different scales, they have to be standardized in the forefront (Z scores most frequently used). Otherwise measurement scale differences may have an influence on the result!
Standardization:
[Mean value deducted from every observation and divided by the standard deviation.]
Modul University Vienna
Hierarchical clustering procedure
Clustering procedure for hierarchical clustering can be
• agglomerative – every object starts in a separate cluster which are grouped into bigger and bigger clusters until all objects are in one cluster
• or divisive – a single cluster with all objects is split up until all objects are in separate clusters (also see Dendrogram) Linkage methods:
• Single linkage = nearest neighbour
• Complete linkage = farthest neighbour
• Average linkage = average distance between all pairs
• Centroid method = distance between cluster centroids
• Variance methods (minimize within-cluster variance) Ward‘s method – most frequently used! – combines clusters with smallest increase in overall sum of squared distance to the cluster means
Hierarchical clustering
Distance measure
• Similarity is determined by the distance between groups
• Default: Squared Euclidean distance - most often used – interval scale;
(v=number of variables, X and Y are the objects to be compared) various alternative distance measures available for interval, counts or binary data: e.g. City-Block or Manhattan-distance (sum of absolute distances), for binary data: -distance
Depending on the chosen distance measure results may change!
Modul University Vienna
Perform cluster analysis
Agglomeration schedule
• X1 and X2: If the values are negative, the two observations were merged at this stage (singleton agglomerations). If it is positive, it was merged at a former stage of the algorithm (non-singleton observations).
• cluster height: the criterion usedfor the agglomeration procedure (here the squared Euclidean distance).
• One can observe a dramatic increase in step 37. Further collapsing the 3 to two clusters will be problematic.
Modul University Vienna
Dendrogram
• Vertical lines represent distances between clusters that are put together.
• Coefficients are rescaled, here 0-50.
How many clusters
• Distances of last two stages are very large.
• Decision on three clusters? Or two? Depends on objectives!
• ...are relevant in terms of practical/managerial considerations?
• Theoretically based? Literature?
• Useful sizes?
• Meaningful interpretation of cluster characteristics possible?
• Distance between clusters?
Modul University Vienna
Dendrogram
• Distances of last two stages are very large.
• Decision on three clusters? Or two? Depends on objectives!
Cluster membership and information
• Cluster membership variable of the 3 cluster solution is produced.
• The 1st group has 15 observations, the 2nd and 3rd have 12.
• The 1st group is neither interested in culture nor sports. The 2nd group is interested in culture but not in sports. The 3rd group is interested in sports but not in culture.
Modul University Vienna
Non-hierarchical clustering: k-means
• Disadvantage: Number of clusters has to be a priori fixed!!!
• Advantage: computationally less burdensome compared with hierarchical cluster analysis if many observations are contained in the dataset
• Optimising partitioning: Objects are reassigned iteratively between clusters and do not necessarily stay within one cluster once assigned to it (contrary to hierarchical clustering)
• Iteration:
1. Each objects is assigned to the cluster with thenearest cluster center (least squared Euclidean distance)
2. Recalculation of cluster centers
3. Loop: Continue with step 1
Distance measure
• Similarity (between preferably interval scaled variables) is determined by the squared Euclidean distance
• Notation: n=number of observations i=1,…,n
x and y are the objects to be compared
• The variance (squared Euclidean distances between all clustering variables and the centroid of each cluster), or socalled within cluster variation, is minimized.
Modul University Vienna
Number of clusters, iteration and random starts
• The number of clusters must be specified a priori!!!
• k-means uses an iterative algorithm to determine the cluster centers (1. objects are assigned to nearest cluster center, 2. calculation of cluster center, 3. continues with step 1). iter.max sets the maximum number of iterations. During classification the algorithm will continue iterating until iter.max iterations have been conducted or the convergence criterion is reached.
• Hint: A high iter.max value is recommended (e.g. 1,000) to allow for a high number of iteration steps and the algorithm to converge.
• As the final result depends on the starting values, k-means clustering should be run with several random starting values, here 25. The one with the lowest within-cluster variation will automatically be selected.
Random starts
• The a priori selected number of clusters must be specified!!!
• k-means uses an iterative algorithm to determine the cluster centers. iter.max sets the maximum number of iterations. During classification the algorithm will continue iterating until iter.max iterations have been done or the convergence criterion is fulfilled.
• If convergence criteria is not achieved the number of maximum iterations has to be increased until enough iteration steps (1. objects are assigned to nearest cluster center, 2. calculation of cluster center, 3. continues with step 1) are processed.
• Hint: A high max_iter value is recommended (e.g. 1,000) to allow for a high number of iteration steps and the algorithm to converge.
Modul University Vienna
Perform k-means clustering
• The number of cases in each cluster shows the size of each cluster in the dataset.
• Cluster means are the means of variables within clusters.
• Cluster vector = cluster membership
Cluster membership
• The cluster membership shows the case number in the rownames
(values 1 to 39) and the cluster number in the kcluster.cluster column.
• Case number 1 belongs to cluster 3, case number 13 belongs to cluster 1...
Modul University Vienna
Print k-means solution and cluster center
• Final Cluster Centers are the means of variables within clusters.
Cluster comparison
• Attention!
Judgement of differences between clusters for variables used in the algorithm via t-test or ANOVA?
No hypothesis test in the usual meaning, just descriptive!
Just an indicator which variables are relevent for clustering.
= Proper validation only by means of an external criterion not involved in cluster analysis!
= Profiling
Modul University Vienna
Profiling
• First, groups are described on the basis of the variables used for k-means clustering.
• Second, profiling describes clusters by means of other relevant variables not used during the clustering procedure (e.g. demographic, psychographic, geographic... characteristics).