Mining Enron data

The web is more about human interaction than computing
Site Admin
Posts: 178
Joined: Tue Sep 29, 2009 6:25 pm
Location: Moss Norway

Mining Enron data

Postby KBleivik » Sat Mar 03, 2012 1:45 pm

1. Enron - Energy giant that was transformed to a casino.

Some of you that read this post, may know the book ... 1591840082 and the related video: Personally I saw a related play in London. It was a thought provoking play about structural finace and selfproclaimed geniuses. In sum, a renoved energy company was transformed to a casino over an oil pool. The data is available online.

2. Enron Data.

The Enron case may be regarded so importnt that it has got its own data website: The Enron Email Data set is also available.

See also: ... -set-files

KW search for additional information: Enron data

3. Using Python to mine the data.

Since we don't risk a broken link, one code exemple is reproduced here:

Code: Select all

# -*- coding: utf-8 -*-

import sys
import mailbox
import email
import quopri
from BeautifulSoup import BeautifulSoup
import dateutil.parser as parser # pip install python-dateutil==1.5 for python2.6

    import jsonlib2 as json  # much faster then Python 2.6.x's stdlib
except ImportError:
    import json

MBOX = sys.argv[1]

def cleanContent(msg):

    # Decode message from "quoted printable" format

    msg = quopri.decodestring(msg)

    # Strip out HTML tags, if any are present

    soup = BeautifulSoup(msg)
    return ''.join(soup.findAll(text=True))

def jsonifyMessage(msg):
    headers = {}
    for (k, v) in msg.items():
         k = k.lower()
        v = v.decode('utf-8', 'ignore')
        headers[k] = [v]
        if k == "date":
             date = parser.parse(v)
            headers['date'] = [date.isoformat()]
    json_msg = {'parts': [], 'headers': headers}
        for part in msg.walk():
            if part.get_content_maintype() == 'multipart':
            json_part = {"headers": {"content-type": []}}
            # TODO store attachments in _attachments key for couchdb upload
            content = part.get_payload(decode=False).decode('utf-8', 'ignore')
            json_part['bodytext'] = cleanContent(content)
    except Exception, e:
        sys.stderr.write('Skipping message - error encountered (%s)' % (str(e), ))
        return json_msg

# Note: opening in binary mode is recommended
mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file) 
json_msgs = []
while 1:
    msg =
    if msg is None:

print json.dumps(json_msgs, indent=4)

Source: ...

Related thread: ... mbox2couch

4. Database platform and litterature.

Python and databases. Good enough is sometimes best

5. Exercise.

Transform the code to C / C++ and observe any efficiency, speed improvement.

Return to “Social data mining, coding and network analysis”

Who is online

Users browsing this forum: No registered users and 1 guest