Skip to content

Fetching and Processing Emails

Panagiotis Antoniadis edited this page Jun 22, 2019 · 4 revisions

Tools

The extraction.py tool is used to fetch all the sent emails of the user, process them in the desired format and save them.

Usage:

$ python extraction.py -h
usage: extraction.py [-h] --out OUT [--reload RELOAD] [--info INFO]

Tool for extracting emails from a user's account

optional arguments:
  -h, --help       show this help message and exit

required arguments:
  --out OUT        Output directory

optional arguments:
  --reload RELOAD  If true, remove any existing account.
  --info INFO      If true, create an info file containing the headers.

A token.pickle file is created automatically when the authorization flow completes for the first time. So, in order to fetch all sent emails from a new email account and save them in emails directory, we use --reload True argument, as follows:

$ python extraction.py --out emails --reload

Libraries and API's

Connection

In order to connect to an email account, Gmail API is used, that provides a flexible RESTful access to the emails of a gmail account. As a result, only gmail accounts are supported, but the tool can also be extended for more email providers.

Processing

After email fetching, the body of each email contains a lot of undesired things, that should be removed. The clean body should contain only Greek words, since it will be used as input to the language model tool. In order to achieve it, we use:

  • BeautifulSoup library to remove all html characters.
  • alphabet-detector library to detect and keep Greek words. Also, some emails contain the whole history of the conversation. Since we need only the new sent email, previous conversations are removed. Finally, we remove all punctuation and non-alphabetic characters and convert all characters to lowercase.

Before:

Καλησπέρα σας,

Θα ήθελα να ρωτήσω πόσο πήρα στο μάθημα Machine Learning με κωδικό 12345.

--
Αντωνιάδης Παναγιώτης 

After:

καλησπέρα σας θα ήθελα να ρωτήσω πόσο πήρα στο μάθημα με κωδικό

Finally, each clean email is saved in out directory as email_{id}. Also, by applying the --info True argument an info file is saved, that contains the headers of the emails in the following format: sender | receiver | subject.

Clone this wiki locally