Back to Portfolio

Hotel Market Segmentation Analysis

Unsupervised Learning with PCA, Hierarchical Clustering and K-Means

Project Overview

This data science project applies unsupervised learning techniques to segment hotels based on their service offerings, amenities, and pricing structures. The analysis reveals natural market segments that differ from traditional star-based classifications, providing valuable insights for strategic positioning and marketing in the hospitality industry.

The project explores whether data-driven clustering can identify more nuanced market segments that better reflect actual service combinations and pricing strategies, challenging the conventional wisdom that star ratings alone determine hotel categorization.

39
Hotels Analyzed
6
Optimal Clusters
0.453
Silhouette Score
67.1%
PCA Variance Explained

Skills & Technologies

Languages & Libraries

  • Python
  • Pandas & NumPy
  • Scikit-learn
  • SciPy
  • Matplotlib & Seaborn
  • Jupyter Notebook

Technical Concepts

  • Principal Component Analysis (PCA)
  • Hierarchical Clustering
  • K-Means Clustering
  • Data Standardization
  • Silhouette Analysis
  • Dendrogram Visualization

Methodologies

  • Exploratory Data Analysis
  • Correlation Analysis
  • Dimensionality Reduction
  • Cluster Validation
  • Business Insight Generation
  • Comparative Analysis

Part 1: Exploratory Analysis & PCA

Dataset Overview

The analysis began with 39 hotels characterized by 6 numerical features: Comfort, Room Count, Cuisine Quality, Sports Facilities, Beach Access, and Price. Traditional star ratings (0-5 stars) were used as a baseline for comparison.

Correlation Analysis

Initial exploration revealed interesting relationships between variables:

  • Strongest correlation: Cuisine quality and Price (0.57)
  • Moderate correlation: Comfort and Cuisine (0.56)
  • Weakest correlation: Room count and Price (-0.03)
  • Beach access showed minimal correlation with comfort (-0.05)

Principal Component Analysis

PCA was applied to reduce dimensionality while preserving maximum variance:

  • Data standardization ensured equal feature weighting
  • Two principal components captured 67.1% of total variance
  • PC1 explained 43.6% of variance (primarily comfort, cuisine, price)
  • PC2 explained 23.6% of variance (sports facilities, beach access)

Key Findings

Visual Cluster Patterns

The PCA visualization revealed natural groupings that didn't align perfectly with star ratings. Hotels with similar service profiles clustered together regardless of their official star classification.

This suggested that traditional star ratings might not fully capture the multidimensional nature of hotel quality and service offerings.

Feature Importance

Comfort, cuisine quality, and price were the most influential factors in the first principal component, while sports facilities and beach access dominated the second component.

Room count showed minimal impact on the overall segmentation, indicating it's less relevant for customer perception of hotel quality.

Part 2: Hierarchical Clustering

Methodology

Agglomerative hierarchical clustering was implemented using multiple linkage methods to identify the natural structure of the data:

  • Single linkage: Minimum distance between clusters
  • Complete linkage: Maximum distance between clusters
  • Average linkage: Mean distance between clusters
  • Ward's method: Minimizes variance within clusters

Dendrogram Analysis

Complete linkage produced the most balanced and interpretable dendrogram with well-separated clusters. The dendrogram visualization allowed for natural cut-point identification at a distance threshold of 3.0.

Cluster Validation

The hierarchical clustering results were evaluated using silhouette analysis:

Linkage Method Number of Clusters Silhouette Score Interpretability
Complete 6 0.453 High
Ward 5 0.441 Medium
Average 7 0.428 Low
Single 8 0.402 Very Low

Optimal Configuration

Complete linkage with 6 clusters provided the best balance between cluster cohesion and separation. The resulting segmentation created meaningful hotel groups with distinct service profiles.

This configuration was selected for further analysis and comparison with K-Means results.

Business Interpretation

The hierarchical approach validated that natural hotel segments exist beyond star classifications. Each cluster represented a different value proposition and service mix that could inform targeted marketing strategies.

Part 3: K-Means Clustering & Validation

Optimal K Determination

Both elbow method and silhouette analysis were used to determine the optimal number of clusters:

  • Elbow method: Inertia values showed diminishing returns beyond K=6
  • Silhouette analysis: Peak silhouette score (0.453) achieved at K=6
  • Stability analysis: K-means++ initialization provided consistent results

Algorithm Comparison

K-Means was compared with hierarchical clustering using multiple metrics:

Algorithm Number of Clusters Silhouette Score ARI with Star Ratings Stability
K-Means (K=6) 6 0.453 0.179 High
Hierarchical (Complete) 6 0.453 0.172 Medium
K-Means (K=5) 5 0.441 0.165 High
Hierarchical (Ward) 5 0.441 0.158 Medium

Low ARI Interpretation

The Adjusted Rand Index (ARI) of 0.179 between clusters and star ratings indicates limited alignment between data-driven segmentation and traditional classification:

  • Only 17.9% of hotel pairings were consistently grouped together
  • 7 hotels showed significant deviation from expected categories
  • This suggests star ratings don't fully capture service quality dimensions

Cluster Profiles

Cluster 1: Urban Business Hotels

13 hotels | Avg stars: 2.15 | Avg price: €469

High comfort focus with less emphasis on beach/sports amenities. Mixed star ratings suggesting inconsistent classification.

Cluster 2: Luxury Resorts

4 hotels | Avg stars: 4.50 | Avg price: €796

Premium service across all dimensions - maximum comfort, cuisine, sports, and beach access.

Cluster 3: Budget Accommodations

6 hotels | Avg stars: 1.67 | Avg price: €436

Lower-priced options with basic amenities, primarily 1-2 star hotels with one 4-star outlier.

Cluster 4: Mid-Range Hotels

9 hotels | Avg stars: 4.11 | Avg price: €589

Strong service profile with high comfort and cuisine ratings, positioned as quality alternatives to luxury resorts.

Key Insights & Business Implications

Star Rating Limitations

The low ARI (0.179) demonstrates that traditional star ratings don't fully capture the multidimensional nature of hotel quality. Data-driven clustering reveals more nuanced segments based on actual service combinations rather than standardized classifications.

Pricing Strategy Insights

Price doesn't always correlate directly with star ratings. Some lower-star hotels offer premium services while some higher-star hotels provide basic amenities, creating pricing anomalies and market opportunities.

Marketing Applications

Hotels can develop targeted strategies based on their actual cluster profile rather than star rating alone. This enables more precise positioning and competitive differentiation in the marketplace.

Methodological Strengths

This analysis demonstrates the power of unsupervised learning for market segmentation. The combination of PCA for dimensionality reduction with multiple clustering algorithms provided robust validation of the natural segment structure. The consistent results across different methodologies strengthen the confidence in the identified clusters.

Future Research Directions

Potential extensions include incorporating additional features like location data, customer reviews, and seasonal pricing patterns. Applying this methodology to larger datasets across different geographic markets could reveal broader industry trends and segmentation patterns.