-
Notifications
You must be signed in to change notification settings - Fork 2
Fetching and Processing Emails
The extraction.py
tool is used to fetch all the sent emails of the user, process them in the desired format and save them.
Usage:
$ python extraction.py -h
usage: extraction.py [-h] --out OUT [--reload RELOAD] [--info INFO]
Tool for extracting emails from a user's account
optional arguments:
-h, --help show this help message and exit
required arguments:
--out OUT Output directory
optional arguments:
--reload RELOAD If true, remove any existing account.
--info INFO If true, create an info file containing the headers.
A token.pickle file is created automatically when the authorization flow completes for the first time. So, in order to fetch all sent emails from a new email account and save them in emails
directory, we use --reload True
argument, as follows:
$ python extraction.py --out emails --reload
In order to connect to an email account, Gmail API is used, that provides a flexible RESTful access to the emails of a gmail account. As a result, only gmail accounts are supported, but the tool can also be extended for more email providers.
After email fetching, the body of each email contains a lot of undesired things, that should be removed. The clean body should contain only Greek words, since it will be used as input to the language model tool. In order to achieve it, we use:
- BeautifulSoup library to remove all html characters.
- alphabet-detector library to detect and keep Greek words. Also, some emails contain the whole history of the conversation. Since we need only the new sent email, previous conversations are removed. Finally, we remove all punctuation and non-alphabetic characters and convert all characters to lowercase.
Before:
Καλησπέρα σας,
Θα ήθελα να ρωτήσω πόσο πήρα στο μάθημα Machine Learning με κωδικό 12345.
--
Αντωνιάδης Παναγιώτης
After:
καλησπέρα σας θα ήθελα να ρωτήσω πόσο πήρα στο μάθημα με κωδικό
Finally, each clean email is saved in out
directory as email_{id}
. Also, by applying the --info True
argument an info file is saved, that contains the headers of the emails in the following format:
sender | receiver | subject
.