Between the <start> and <end> lines is a sample configuration file for extracting text from emails.

The <start> and <end> lines are not part of the sample file.

<start>
collect_conf collected.conf
collected collected
media_types ~/MediaTypes
extracted extracts
text_content_type text/plain
pdf_content_type application/pdf
csv_content_type text/comma-separated-values
ss_content_type application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
docx_content_type application/vnd.openxmlformats-officedocument.wordprocessingml.document
odt_content_type application/vnd.oasis.opendocument.text
xlsx_content_type application/vnd.openxmlformats-officedocument.wordprocessingml.document
ods_content_type application/vnd.oasis.opendocument.text
earliestdate 2017-09-01
mostrecentdate 30 June 2018
emailsender a.sender@verdant.net
include_csv_file datafile.csv
exclude_csv_file csvdatafile.csv
include_ss_file_sheet data
exclude_ss_file_sheet sheet2

ignore 20171008021048a.sender@verdant.net+0000.mbs
<end>

The lines can be in any order.

From the top:


The emailstore package collects emails from one or more mailboxes and puts them in the directory named in the collected line of it's configuration file.  The configuration file for emailstore is called 'collected.conf' by default.

The emailextract package extracts text from emails from one of these places in order, ignoring any others: the directory named in the collected line in the 'collect_conf' file, the directory named in 'collected' line, the 'collected' directory in the directory containing the emailextract configuration file.

collect_conf collected.conf
collected collected


Media types registered with IANA are available in a set of csv files which can be downloaded from www.iana.org/assignments/media-types.xhtml.

The download location is in the media_types line. Note emailextract does not use this information at present.

media_types ~/MediaTypes


Text extracted from emails is stored in files, one per mail, in the directory named on the extracted line in the directory containing the emailextract configuration file.

These files hold two versions of the text: the one supplied in the email, and the one with any edits made.

Text is extracted from the emails and copied to the directory named on the extracted line provided all the existing copies match the original emails.  The absence of an email counts as not matching.

extracted extracts


A number of *_content_type lines can be used to identify the media types which select email body and attachments from which text is extracted.

Eight kinds of content type line are available:

text: body or attachment is extracted as it appears in the email.
pdf:  body or attachment is extracted using pdf2text tool from Xpdf.
csv:  body or attachment is extracted as it appears in the email.
ss:   body or attachment is extracted using ssconvert tool from Gnumeric.
docx: body or attachment text is extracted using Python's xml functions.
odt:  body or attachment text is extracted using Python's xml functions.
xlsx: body or attachment text is extracted using Python's xml functions.
ods:  body or attachment text is extracted using Python's xml functions.

ss takes priority over xlsx and ods where the same media type is named in both kinds of line.  Gnumeric's ssconvert utility is used for ss lines and no text is extracted if Gnumeric is not available.  Gnumeric's ssconvert utility is used for xlsx and ods lines if available, but Python's xml functions are used if not available.

text_content_type text/plain
pdf_content_type application/pdf
csv_content_type text/comma-separated-values
ss_content_type application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
docx_content_type application/vnd.openxmlformats-officedocument.wordprocessingml.document
odt_content_type application/vnd.oasis.opendocument.text
xlsx_content_type application/vnd.openxmlformats-officedocument.wordprocessingml.document
ods_content_type application/vnd.oasis.opendocument.text


Text is extracted from emails sent on or after the date on the earliestdate line and on or before the date on the mostrecentdate line.

The absence of a date implies no limit in that direction.

Several date formats are accepted but the two preferred formats appear in the sample file.

earliestdate 2017-09-01
mostrecentdate 30 June 2018


Emails can be selected by sender address.  When any emailsender lines are present only emails from the named addresses are selected.

emailsender a.sender@verdant.net


Particular csv file attachments and sheets from spreadsheet attachments may be included or excluded from the extract.

One or more include_csv_file lines imply any csv file not named in these lines is excluded from the extract.  One or more exclude_csv_file lines imply any csv file not named is included in the extract.  When both kinds of line are present the exclude lines are ignored.

One or more include_ss_file_sheet lines imply any spreadsheet sheet not named in these lines is excluded from the extract.  One or more exclude_ss_file_sheet lines imply any spreadsheet sheet not named is included in the extract.  When both kinds of line are present the exclude lines are ignored.

The csv and ss include and exclude instructions apply to all attachments to all selected emails.

include_csv_file datafile.csv
exclude_csv_file csvdatafile.csv
include_ss_file_sheet data
exclude_ss_file_sheet sheet2


Emails may be ignored when extracting text to the directory named on the extracted line.  The file name can be typed, but is usually generated by right-click over the display of the full content of the email which appends the ignore line to the configuration file.

ignore 20171008021048a.sender@verdant.net+0000.mbs
