Libraries: scikit-learn, XGBoost, matplotlib, pandas, numpy
Dataset: ISOT Fake News Detection Dataset
In this project we use the π Python libraries scikit-learn and XGBoost to build a machine learning model that classifies news articles as fake or real. We combine classical machine learning techniques with engineered textual features to improve model generalisability and performance.
- Text vectorisation: Bag of Words (BoW)
- Feature engineering: % of special characters & % of capitalised characters
- Baseline model:
DecisionTreeClassifierwithGridSearchCV - Ensemble model:
XGBClassifierwithRandomizedSearchCV - Robustness: Removed dataset-specific artefacts (eg. reuters) from BoW to improve generalisability
- π€ XGBoost ensemble achieved ~99.8% accuracy, precision, recall, and F1 score
- Top feature:
headline_capitalised(engineered) - Fun insight: second most important vectorized word for classification β "Trump" πΊπΈ
- Test on more diverse, real-world datasets
- Experiment with advanced text vectorisation (eg. word embeddings, transformer models)
- Compare with alternative classifiers (eg. Support Vector Machines)


