This project focuses on the classification of news articles into distinct topics using various data science techniques. The dataset comprises 20 article categories with the objective of accurately categorizing articles into their respective classes. The project employs machine learning models, feature engineering, and preprocessing strategies to enhance classification accuracy.
- Notebooks:
01_Looking_into_data.ipynb
: Initial exploration of the dataset.02_baseline.ipynb
: Creation of the baseline with various classifiers.03_preprocessing.ipynb
: Feature engineering and preprocessing steps.04_feature_extraction.ipynb
: Analysis of discriminative features.05_Grid_search_VC.ipynb
: Grid search for model tuning.06_GD.ipynb
: Further exploration, gradient descent, and additional feature engineering.
-
Clone the Repository:
git clone https://github.com/anush-data-portfolio/Classification-20NewsGroup cd text-classification-project
-
Run the Notebooks:
- Each notebook is self-contained and handles the installation of necessary libraries. Execute the notebooks in the following order:
01_Looking_into_data.ipynb
02_baseline.ipynb
03_preprocessing.ipynb
04_feature_extraction.ipynb
05_Grid_search_VC.ipynb
06_GD.ipynb
- Each notebook is self-contained and handles the installation of necessary libraries. Execute the notebooks in the following order:
-
Follow Notebook Instructions:
- Each notebook provides detailed explanations, code comments, and instructions. Follow the steps outlined in each notebook to understand the project's progression and outcomes.
-
Internet Connection:
- Ensure a stable internet connection, as the notebooks handle library installations from online repositories.
The project relies on the following Python libraries:
- scikit-learn (
sklearn
) - pandas
- numpy
- nltk
- matplotlib
- Recreating Results:
- Users can rerun the notebooks to recreate the results. The notebooks are designed to handle the necessary library installations.