Some of you that read this post, may know the book http://www.amazon.com/Smartest-Guys-Roo ... 1591840082 and the related video: http://www.imdb.com/title/tt1016268/ Personally I saw a related play in London. It was a thought provoking play about structural finace and selfproclaimed geniuses. In sum, a renoved energy company was transformed to a casino over an oil pool. The data is available online.
2. Enron Data.
The Enron case may be regarded so importnt that it has got its own data website: http://enrondata.org/ The Enron Email Data set http://www.cs.cmu.edu/~enron/ is also available.
See also: http://www.edrm.net/resources/data-sets ... -set-files
KW search for additional information: Enron data
3. Using Python to mine the data.
Since we don't risk a broken link, one code exemple is reproduced here:
Code: Select all
# -*- coding: utf-8 -*-
import sys
import mailbox
import email
import quopri
from BeautifulSoup import BeautifulSoup
import dateutil.parser as parser # pip install python-dateutil==1.5 for python2.6
try:
import jsonlib2 as json # much faster then Python 2.6.x's stdlib
except ImportError:
import json
MBOX = sys.argv[1]
def cleanContent(msg):
# Decode message from "quoted printable" format
msg = quopri.decodestring(msg)
# Strip out HTML tags, if any are present
soup = BeautifulSoup(msg)
return ''.join(soup.findAll(text=True))
def jsonifyMessage(msg):
headers = {}
for (k, v) in msg.items():
k = k.lower()
v = v.decode('utf-8', 'ignore')
headers[k] = [v]
if k == "date":
date = parser.parse(v)
headers['date'] = [date.isoformat()]
json_msg = {'parts': [], 'headers': headers}
try:
for part in msg.walk():
if part.get_content_maintype() == 'multipart':
continue
json_part = {"headers": {"content-type": []}}
# TODO store attachments in _attachments key for couchdb upload
json_part['headers']['content-type'].append(part.get_content_type())
content = part.get_payload(decode=False).decode('utf-8', 'ignore')
json_part['bodytext'] = cleanContent(content)
json_msg['parts'].append(json_part)
except Exception, e:
sys.stderr.write('Skipping message - error encountered (%s)' % (str(e), ))
finally:
return json_msg
# Note: opening in binary mode is recommended
mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
json_msgs = []
while 1:
msg = mbox.next()
if msg is None:
break
json_msgs.append(jsonifyMessage(msg))
print json.dumps(json_msgs, indent=4)
Related thread: https://github.com/maxogden/couchmail/t ... mbox2couch
4. Database platform and litterature.
Python and databases. Good enough is sometimes best
5. Exercise.
Transform the code to C / C++ and observe any efficiency, speed improvement.