|
According to Carnegie Mellon University's William W. Cohen, who posted the dataset on the Web, the Enron e-mail dataset is proving to be "a resource for researchers who are interested in improving current e-mail tools, or understanding how e-mail is currently used. This data is valuable; to my knowledge it is the only substantial collection of 'real' e-mail that is public." As a result, researchers at such institutions as MIT, UC Berkeley, University of Massachusetts, University of Southern California and SRI have used the data to study social networks as evidenced by the exchange of e-mail messages.
The Berkeley Lab group decided to conduct a series of searches of the Enron e-mail dataset to see how FastBit, an efficient, compressed bitmap indexing technology that was developed by the group, stacked up against the MySQL database, which bills itself as "the world's most popular open source database", in which the data were stored.
In a report published in January 2006, the group evaluated the performance of MySQL and FastBit in handling a number of queries for a dataset of 250,000 unique e-mail messages sent by 151 Enron employees and found that FastBit outperformed MySQL - between 10 to 1000 times faster, depending on the size of the search result. To achieve their results, group members conducted several experiments.
"In our first set of experiments we measured the performance of searching for specific senders and receivers of the e-mails", wrote Kurt Stockinger, lead author of the group's report. "We built an index for each of these two attributes. Since both senders and receivers are in different database tables, this kind of search requires an expensive 'join operation'."
To reduce the processing time needed for such join operations, the group combined the two separate tables to create a new table which contained names of all the senders and receivers. Called the materialized table, this newly created table contains 2 million records. Since the number of original messages was 250,000, this indicates that, on average, each message contains 8 recipients.
"We also built indices for sender and receiver on the materialized table", Kurt Stockinger wrote. "In order to build bitmap indices for the materialized table, we needed to export the data into binary files. In particular, we stored each attribute in a separate file and then built a bitmap index for the attributes sender and receiver."
First, the group measured the performance of queries of the form "Retrieve the recipients of all e-mails that were sent by person P". For these experiments, a group of 100 names were randomly selected and a query executed for each person. A total of 100 queries were run and the group measured the retrieval time, including the time to extract the result after the search. The results showed that using the MySQL-Join approach took the most time, about 8 seconds, while MySQL-Materialized - which used the expanded table - took from 0.01 to 0.9 seconds to complete the queries, depending on the number of hits. FastBit took about 0.0075 seconds, regardless of the number of hits, making it up to 100 times faster than the best MySQL result.
When the group measured the performance of queries of the form "Retrieve all senders of e-mails that were received by person P", FastBit was again up to a factor of 100 faster than MySQL-Materialized.
In the next experiments the researchers measured the query performance of a larger dataset by duplicating the Enron dataset 10 times. The resulting materialized table contains some 20 million records. In experiments with one specific search criterion, FastBit was again up to a factor of 100 faster than MySQL.
In the group's last set of experiments, they measured the performance of queries with multiple search criteria (multidimensional queries), such as "Count the number of e-mails that were sent by person P in the time interval T before date D". The results showed that as the number of query dimensions increases, the relative performance improvement of FastBit over MySQL increases even more. For these types of queries, FastBit is up to 1000 times faster than MySQL.
The full report with detailed comparisons of the results can be found at
http://www-library.lbl.gov/docs/LBNL/594/37/PDF/LBNL-59437.pdf.</a>
|