AI Development Lesson 5: Machine Learning with sklearn

🤖 AI Development CourseLesson 5 of 10 · 50% complete

scikit-learn is the go-to Python ML library. Every algorithm follows the same simple interface: fit, predict, score.

The Sklearn API

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# 1. Prepare data
X = features  # numpy array: shape (n_samples, n_features)
y = labels    # numpy array: shape (n_samples,)

# 2. Split: train (80%) and test (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Scale features (important for many algorithms!)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # fit AND transform
X_test = scaler.transform(X_test)        # only transform (not fit!)

# 4. Train model (ONE LINE!)
model = LogisticRegression()
model.fit(X_train, y_train)

# 5. Predict
y_pred = model.predict(X_test)

# 6. Evaluate
print(accuracy_score(y_test, y_pred))         # 0.87
print(classification_report(y_test, y_pred))  # precision, recall, F1

Common Algorithms

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# All use the same API: .fit(), .predict(), .score()
# For regression: LinearRegression, Ridge, RandomForestRegressor
# For classification: LogisticRegression, RandomForest, SVC

# Best in practice for tabular data:
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)

🏋️ Practice Task

Predict Titanic survival. Load Titanic dataset. Features: Pclass, Sex (encode: male=0, female=1), Age (fillna with median), Fare. Target: Survived. Train 3 models: LogisticRegression, RandomForest, GradientBoosting. Compare accuracy. Print classification report for best model.

💡 Hint: pd.get_dummies(df[“Sex”]) or df[“Sex”].map({“male”:0,”female”:1}). Drop NaN rows or fill Age with df.Age.median().

← PreviousLesson 5 of 10Next Lesson →