Mining Enron data

The web is more about human interaction than computing
Post Reply
KBleivik
Site Admin
Posts: 184
Joined: Tue Sep 29, 2009 6:25 pm
Location: Moss Norway
Contact:

Mining Enron data

Post by KBleivik »

1. Enron - Energy giant that was transformed to a casino.

Some of you that read this post, may know the book http://www.amazon.com/Smartest-Guys-Roo ... 1591840082 and the related video: http://www.imdb.com/title/tt1016268/ Personally I saw a related play in London. It was a thought provoking play about structural finace and selfproclaimed geniuses. In sum, a renoved energy company was transformed to a casino over an oil pool. The data is available online.

2. Enron Data.

The Enron case may be regarded so importnt that it has got its own data website: http://enrondata.org/ The Enron Email Data set http://www.cs.cmu.edu/~enron/ is also available.

See also: http://www.edrm.net/resources/data-sets ... -set-files

KW search for additional information: Enron data

3. Using Python to mine the data.

Since we don't risk a broken link, one code exemple is reproduced here:

Code: Select all

# -*- coding: utf-8 -*-

import sys
import mailbox
import email
import quopri
from BeautifulSoup import BeautifulSoup
import dateutil.parser as parser # pip install python-dateutil==1.5 for python2.6

try:
    import jsonlib2 as json  # much faster then Python 2.6.x's stdlib
except ImportError:
    import json

MBOX = sys.argv[1]

def cleanContent(msg):

    # Decode message from "quoted printable" format

    msg = quopri.decodestring(msg)

    # Strip out HTML tags, if any are present

    soup = BeautifulSoup(msg)
    return ''.join(soup.findAll(text=True))


def jsonifyMessage(msg):
    headers = {}
    for (k, v) in msg.items():
      	k = k.lower()
        v = v.decode('utf-8', 'ignore')
        headers[k] = [v]
        if k == "date":
      	    date = parser.parse(v)
            headers['date'] = [date.isoformat()]
    
    json_msg = {'parts': [], 'headers': headers}
    
    try:
        for part in msg.walk():
            if part.get_content_maintype() == 'multipart':
                continue
            json_part = {"headers": {"content-type": []}}
            # TODO store attachments in _attachments key for couchdb upload
            json_part['headers']['content-type'].append(part.get_content_type())
            content = part.get_payload(decode=False).decode('utf-8', 'ignore')
            json_part['bodytext'] = cleanContent(content)
            json_msg['parts'].append(json_part)
    except Exception, e:
        sys.stderr.write('Skipping message - error encountered (%s)' % (str(e), ))
    finally:
        return json_msg

# Note: opening in binary mode is recommended
mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)  
json_msgs = []
while 1:
    msg = mbox.next()
    if msg is None:
        break
    json_msgs.append(jsonifyMessage(msg))

print json.dumps(json_msgs, indent=4)
Source: https://raw.github.com/maxogden/couchma ... fy_mbox.py

Related thread: https://github.com/maxogden/couchmail/t ... mbox2couch

4. Database platform and litterature.

Python and databases. Good enough is sometimes best

5. Exercise.

Transform the code to C / C++ and observe any efficiency, speed improvement.

Post Reply

Who is online

Users browsing this forum: No registered users and 7 guests