Back to Portfolio

Iris Classification Project

Multi-class classification analysis with multiple Machine Learning algorithms

Project Summary

This project implements a comprehensive analysis of the Iris dataset, a classic benchmark in Machine Learning. Multiple supervised classification algorithms were compared to predict iris species based on sepal and petal morphological measurements. The project includes exploratory analysis, advanced visualizations, and rigorous model evaluation.

3 Classes
Iris Species
98.00%
Best Accuracy (LDA)
5 Models
Algorithms Compared
4 Features
Morphological Characteristics

Iris Dataset

Dataset Characteristics

The Iris dataset is possibly the best-known database in pattern recognition literature. It contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

Features

  • Sepal Length: Sepal length (cm)
  • Sepal Width: Sepal width (cm)
  • Petal Length: Petal length (cm)
  • Petal Width: Petal width (cm)

Classes (Target)

  • Iris Setosa: 50 samples
  • Iris Versicolour: 50 samples
  • Iris Virginica: 50 samples

Important property: The Iris Setosa class is linearly separable from the other two, while Iris Versicolour and Iris Virginica are not linearly separable from each other.

Methodology and Algorithms

Exploratory Analysis

  • Pairplots with Seaborn
  • Dimensionality Reduction (t-SNE)
  • Correlation Analysis
  • Distribution Visualization

Implemented Models

  • Linear Discriminant Analysis (LDA)
  • Quadratic Discriminant Analysis (QDA)
  • SVM with RBF Kernel
  • Perceptron
  • Decision Tree Classifier

Evaluation

  • Cross-Validation (5-fold)
  • Confusion Matrices
  • Feature Importance
  • Tree Visualization

Results and Comparison

Model Performance

Model Mean Accuracy Standard Deviation Observations
LDA 98.00% ±1.63% Best overall performance
QDA 97.33% ±1.33% Excellent performance
SVM (RBF) 96.67% ±2.98% Good performance, higher variability
Decision Tree 95.56% ±2.98% Good performance, easy interpretation
Perceptron 72.76% ±17.31% Inconsistent performance

Results Analysis

LDA proved to be the most effective model with 98% accuracy and low variability. Linear models (LDA, QDA) outperformed non-linear approaches in this specific dataset.

The Perceptron showed inconsistent performance due to the dataset's incomplete linear separability.

Feature Importance

Decision tree analysis revealed that petal length is the most important feature (89.3%), followed by petal width (8.8%).

Sepal features showed less discriminatory power for this classification problem.

Key Findings

Class Separability

Visualization with t-SNE and pairplots confirmed that Iris Setosa is clearly separable, while Iris Versicolour and Iris Virginica show some overlap in the feature space.

Discriminatory Features

Petal measurements (length and width) are significantly more informative than sepal measurements for distinguishing between species, with petal length being the most important feature.

Model Performance

Linear models (LDA, QDA) outperformed more complex approaches, suggesting that linear relationships adequately capture the underlying structure of the Iris dataset.

Technologies Used

Languages & Libraries

  • Python 3.8+
  • scikit-learn
  • pandas & numpy
  • matplotlib & seaborn

Algorithms

  • LDA & QDA
  • SVM with RBF Kernel
  • Perceptron
  • Decision Trees

Techniques

  • Cross-Validation
  • Dimensionality Reduction (t-SNE)
  • Exploratory Analysis
  • Model Evaluation

Conclusions

This project demonstrated the effective application of multiple classification algorithms to the Iris dataset. The results confirm that for problems with underlying linear structure, simple models like LDA can outperform more complex approaches.

Main Contributions

  • Comparative implementation of 5 ML algorithms
  • Comprehensive feature importance analysis
  • Advanced visualizations for result interpretation
  • Rigorous evaluation with cross-validation

Key Learnings

  • Model simplicity doesn't always compromise performance
  • Exploratory analysis is crucial for understanding data
  • Cross-validation provides more robust estimates
  • Model interpretability is as important as accuracy