Iris Classification Project | Jose Miguel Alfaro

Project Summary

This project implements a comprehensive analysis of the Iris dataset, a classic benchmark in Machine Learning. Multiple supervised classification algorithms were compared to predict iris species based on sepal and petal morphological measurements. The project includes exploratory analysis, advanced visualizations, and rigorous model evaluation.

3 Classes

Iris Species

98.00%

Best Accuracy (LDA)

5 Models

Algorithms Compared

4 Features

Morphological Characteristics

Iris Dataset

Dataset Characteristics

The Iris dataset is possibly the best-known database in pattern recognition literature. It contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

Features

Sepal Length: Sepal length (cm)
Sepal Width: Sepal width (cm)
Petal Length: Petal length (cm)
Petal Width: Petal width (cm)

Classes (Target)

Iris Setosa: 50 samples
Iris Versicolour: 50 samples
Iris Virginica: 50 samples

Important property: The Iris Setosa class is linearly separable from the other two, while Iris Versicolour and Iris Virginica are not linearly separable from each other.

Methodology and Algorithms

Exploratory Analysis

Pairplots with Seaborn
Dimensionality Reduction (t-SNE)
Correlation Analysis
Distribution Visualization

Implemented Models

Linear Discriminant Analysis (LDA)
Quadratic Discriminant Analysis (QDA)
SVM with RBF Kernel
Perceptron
Decision Tree Classifier

Evaluation

Cross-Validation (5-fold)
Confusion Matrices
Feature Importance
Tree Visualization

Results and Comparison

Model Performance

Model	Mean Accuracy	Standard Deviation	Observations
LDA	98.00%	±1.63%	Best overall performance
QDA	97.33%	±1.33%	Excellent performance
SVM (RBF)	96.67%	±2.98%	Good performance, higher variability
Decision Tree	95.56%	±2.98%	Good performance, easy interpretation
Perceptron	72.76%	±17.31%	Inconsistent performance

Results Analysis

LDA proved to be the most effective model with 98% accuracy and low variability. Linear models (LDA, QDA) outperformed non-linear approaches in this specific dataset.

The Perceptron showed inconsistent performance due to the dataset's incomplete linear separability.

Feature Importance

Decision tree analysis revealed that petal length is the most important feature (89.3%), followed by petal width (8.8%).

Sepal features showed less discriminatory power for this classification problem.

Key Findings

Class Separability

Visualization with t-SNE and pairplots confirmed that Iris Setosa is clearly separable, while Iris Versicolour and Iris Virginica show some overlap in the feature space.

Discriminatory Features

Petal measurements (length and width) are significantly more informative than sepal measurements for distinguishing between species, with petal length being the most important feature.

Model Performance

Linear models (LDA, QDA) outperformed more complex approaches, suggesting that linear relationships adequately capture the underlying structure of the Iris dataset.

Technologies Used

Languages & Libraries

Python 3.8+
scikit-learn
pandas & numpy
matplotlib & seaborn

Algorithms

LDA & QDA
SVM with RBF Kernel
Perceptron
Decision Trees

Techniques

Cross-Validation
Dimensionality Reduction (t-SNE)
Exploratory Analysis
Model Evaluation

Conclusions

This project demonstrated the effective application of multiple classification algorithms to the Iris dataset. The results confirm that for problems with underlying linear structure, simple models like LDA can outperform more complex approaches.

Main Contributions

Comparative implementation of 5 ML algorithms
Comprehensive feature importance analysis
Advanced visualizations for result interpretation
Rigorous evaluation with cross-validation

Key Learnings

Model simplicity doesn't always compromise performance
Exploratory analysis is crucial for understanding data
Cross-validation provides more robust estimates
Model interpretability is as important as accuracy