logo

EnterTheGrid - Primeur Weekly

EnterTheGrid - Primeur is the premier Grid and Supercomputing information source in the world. Primeur Weekly delivers the news each week in your e-mail box.

>Primeur Magazine
>PrimeurLive!
>EnterTheGrid
>Analysis
>Backissues
>Calendar
>Subscribe
>Advertise
>Contact
Primeur Weekly 13 February 2006
>EuroFlash
>UK e-Science Programme moves on with new ambassador
>EGA and GGF to sign non-binding Letter of Intent to merge
>EELA takes off
>Official kick-off meeting of the EUMEDGRID Project: empowering e-Science across the Mediterranean
>The UK e-Science Institute wins continued funding
>Innovative VR simulation framework cuts time-to-market
>International Summer School on Grid Computing 2006 to be held in Ischia, Italy
>Altair Engineering announces establishment of Trans-National European and Asia/Pacific operations
>February 28 deadline for submitting Birds-of-a-Feather proposals for ISC2006
>USFlash
>CenterPoint Energy and IBM announce deployment of Intelligent Grid technology
>Blade.org bladeserver community organised by large number of IT companies
>IBM unveils Cell Broadband Engine computer
>Georgia Institute of Technology accelerates drug discovery with new IBM supercomputing cluster
>Nascentric to use United Devices' Grid MP for simulation and verification clusters
>NSF names Daniel Atkins to head new Office of Cyberinfrastructure
>UCLA's Laboratory of Neuro Imaging (LONI) chooses Sun to help improve the study of healthy and diseased human brains
>New IBM Blade computers
>Bioinformatics Consortium at the University of Missouri adds SGI technology for large-scale computational life sciences research
>Biodesign and TGen form joint Center for Systems and Computational Biology
>Three Pitt 'teacher-scholars' honoured By NSF with Career Awards
>Enron e-mail database proves easy pickings for LBNL's FastBit Search technology
>Sun spotlights growing momentum with world-record setting performance for new Sun Fire server line running UltraSPARC IV+ processors
>Force10 Networks TeraScale E-Series to anchor Sun Grid Compute Utility
>ProCurve Networking by HP expands functionality at Network Edge with new intelligent switches
>Vodacom calls on Callidus Software TrueComp solution for Enterprise Incentive Management
Enron e-mail database proves easy pickings for LBNL's FastBit Search technology
Berkeley 13 February 2006 As the trial of former Enron executives gets under way, the extensive e-mail trails left by employees of the Houston energy firm are expected to provide both compelling evidence and entertaining insight. In 2003, as part of an investigation into Enron's business dealings in California, the Federal Energy Regulatory Commission made public a database containing more than 500,000 e-mails sent by 151 Enron employees. Subjects ranged from corporate decisions to jokes to personal matters. While the subject matter makes for intriguing reading, the entire database also proved an interesting subject for a number of researchers around the USA, including members of the Scientific Data Management Research Group at Lawrence Berkeley National Laboratory.
Advertisement
Visit our sponsors
Advertisement
Visit our sponsors

According to Carnegie Mellon University's William W. Cohen, who posted the dataset on the Web, the Enron e-mail dataset is proving to be "a resource for researchers who are interested in improving current e-mail tools, or understanding how e-mail is currently used. This data is valuable; to my knowledge it is the only substantial collection of 'real' e-mail that is public." As a result, researchers at such institutions as MIT, UC Berkeley, University of Massachusetts, University of Southern California and SRI have used the data to study social networks as evidenced by the exchange of e-mail messages.

The Berkeley Lab group decided to conduct a series of searches of the Enron e-mail dataset to see how FastBit, an efficient, compressed bitmap indexing technology that was developed by the group, stacked up against the MySQL database, which bills itself as "the world's most popular open source database", in which the data were stored.

In a report published in January 2006, the group evaluated the performance of MySQL and FastBit in handling a number of queries for a dataset of 250,000 unique e-mail messages sent by 151 Enron employees and found that FastBit outperformed MySQL - between 10 to 1000 times faster, depending on the size of the search result. To achieve their results, group members conducted several experiments.

"In our first set of experiments we measured the performance of searching for specific senders and receivers of the e-mails", wrote Kurt Stockinger, lead author of the group's report. "We built an index for each of these two attributes. Since both senders and receivers are in different database tables, this kind of search requires an expensive 'join operation'."

To reduce the processing time needed for such join operations, the group combined the two separate tables to create a new table which contained names of all the senders and receivers. Called the materialized table, this newly created table contains 2 million records. Since the number of original messages was 250,000, this indicates that, on average, each message contains 8 recipients.

"We also built indices for sender and receiver on the materialized table", Kurt Stockinger wrote. "In order to build bitmap indices for the materialized table, we needed to export the data into binary files. In particular, we stored each attribute in a separate file and then built a bitmap index for the attributes sender and receiver."

First, the group measured the performance of queries of the form "Retrieve the recipients of all e-mails that were sent by person P". For these experiments, a group of 100 names were randomly selected and a query executed for each person. A total of 100 queries were run and the group measured the retrieval time, including the time to extract the result after the search. The results showed that using the MySQL-Join approach took the most time, about 8 seconds, while MySQL-Materialized - which used the expanded table - took from 0.01 to 0.9 seconds to complete the queries, depending on the number of hits. FastBit took about 0.0075 seconds, regardless of the number of hits, making it up to 100 times faster than the best MySQL result.

When the group measured the performance of queries of the form "Retrieve all senders of e-mails that were received by person P", FastBit was again up to a factor of 100 faster than MySQL-Materialized.

In the next experiments the researchers measured the query performance of a larger dataset by duplicating the Enron dataset 10 times. The resulting materialized table contains some 20 million records. In experiments with one specific search criterion, FastBit was again up to a factor of 100 faster than MySQL.

In the group's last set of experiments, they measured the performance of queries with multiple search criteria (multidimensional queries), such as "Count the number of e-mails that were sent by person P in the time interval T before date D". The results showed that as the number of query dimensions increases, the relative performance improvement of FastBit over MySQL increases even more. For these types of queries, FastBit is up to 1000 times faster than MySQL.

The full report with detailed comparisons of the results can be found at

http://www-library.lbl.gov/docs/LBNL/594/37/PDF/LBNL-59437.pdf.</a>

Advertisement
Visit our sponsors
Advertisement
Visit our sponsors
Ad Emmen

EnterTheGrid - Primeur

James Stewartstraat 248

1325 JN Almere

The Netherlands

http://www.hoise.com/primeur

mailto:primeur@hoise.com

© EnterTheGrid - Primeur Weekly