AI Development Lesson 5: Machine Learning with sklearn
scikit-learn is the go-to Python ML library. Every algorithm follows the same simple interface: fit, predict, score.
The Sklearn API
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# 1. Prepare data
X = features # numpy array: shape (n_samples, n_features)
y = labels # numpy array: shape (n_samples,)
# 2. Split: train (80%) and test (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. Scale features (important for many algorithms!)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # fit AND transform
X_test = scaler.transform(X_test) # only transform (not fit!)
# 4. Train model (ONE LINE!)
model = LogisticRegression()
model.fit(X_train, y_train)
# 5. Predict
y_pred = model.predict(X_test)
# 6. Evaluate
print(accuracy_score(y_test, y_pred)) # 0.87
print(classification_report(y_test, y_pred)) # precision, recall, F1
Common Algorithms
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression, Ridge, Lasso
# All use the same API: .fit(), .predict(), .score()
# For regression: LinearRegression, Ridge, RandomForestRegressor
# For classification: LogisticRegression, RandomForest, SVC
# Best in practice for tabular data:
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
🏋️ Practice Task
Predict Titanic survival. Load Titanic dataset. Features: Pclass, Sex (encode: male=0, female=1), Age (fillna with median), Fare. Target: Survived. Train 3 models: LogisticRegression, RandomForest, GradientBoosting. Compare accuracy. Print classification report for best model.
💡 Hint: pd.get_dummies(df[“Sex”]) or df[“Sex”].map({“male”:0,”female”:1}). Drop NaN rows or fill Age with df.Age.median().