Combining information on supercomputers from several sources with XML - an example using the TOP500 and the "Overview of recent Supercomputers"
Almere 12 mei 2000 There are many sources of information on supercomputers available. One of the most important ones is the TOP500 list of most powerful supercomputers in the world. It contains data on the position in the TOP500, the performance and where the machine is located. Another source is the "Overview of recent supercomputers" that contains a description of the supercomputer systems available, with system parameters, and benchmark information. With XML-technology we have combined information from the two sources to generate a new document, containing a description of each architecture, extended with a sub list of the TOP500 on that machine. The new document is generated with XSL-Transformation-stylesheets, that leave the original sources untouched. When there is a new version of for instance the TOP500, running the stylesheet is enough to generate a new, updated version of the assembled information document.
In November we produced an XML-version of the TOP500 and showed that, by using stylesheets, this can be used to select and combine information from several lists easily. Although this is already very useful, combining information with information from other sources can make the TOP500 even more valuable. The report
Overview of Recent Supercomputers
by Aad van der Steen is such an information source.
Overview of Recent Supercomputers
written by Aad van der Steen describes all the supercomputer systems currently available in detail. For each system it lists, for instance, the operating systems, compilers, and parameters like clock cycle, peak performance, memory size and interconnect speed. Also it contains information on benchmarks.
The previous nine editions of the overview report were available on paper as the primary source and html version on the web derived from the LaTex input source.
Looking up and comparing information in the Overview report was simply a way of scanning through the document with the traditional table-of-contents as guidance. Combing information from the report with that from the TOP500 was a tedious process. Let us take as an example that you want to have a list of all the TOP500 sites that belong to a certain system architecture from the Overview report. You go to the TOP500 web site. Create a sublist for system A, print it out and stick it into the Overview report. Than do the same for system B, C and so one. I guess only a handful of people went through this process.
With XML technology, that becomes a simple task. But before we describe how to do it, let us recall what XML is. With XML you can not only write information in a document, but also describe what the information is about. When you are writing a book, you can tell what a chapter is, what a section, etc. When you write about the TOP500 list, you can tell which item is the TOP500 position and which is the address of the site that has the supercomputer. Each information element is contained by XML-tags that look like HTML tags. The difference is that you can define your own tags to exactly describe the information in your document.
When you have XML documents, you can use XSLT (Extensible Stylesheet Language Transformations) to extract information from one or more documents and combine that into a new one. You can specify in detail what information you need. The new document can be another XML document, an HTML document, or a text. When you apply a formatter to it, you can generate RTF (word), PDF, LaTex, a database record or whatever format you need.
XSLT is a powerful tool. You can for instance extract information like: "Give me the country name of the last system in the TOP500 list, that according to the Overview report could have an HPF compiler on it." I do not know what you can do with this information, but you can ask it.
Although this looks a bit like a data base query, it is important to realise that we are locating information in documents, not in a relation database.
With the TOP500 already available in XML, the first task was to make the Overview report available in XML. The source text of the report is written in LaTeX. Although LaTex gives a good control over the page-lay-out, it does not provide us with much clues as what type of information is described. Most of the important data on machines is hidden in tables. Hence we decide to first turn that information into XML designing a new Document Type Definition suited for description of supercomputing systems as applied in the report.
This is an excerpt showing how a description of a supercomputer model looks like.
<ors:model>
<ors:name>VPP5000</ors:name>
<ors:clock-cycle unit="nsec">3.3
</ors:clock-cycle>
<ors:processor-performance unit="gflops">9.6</ors:processor-performance>
<ors:peak-performance unit="tflops">1.22</ors:peak-performance>
<ors:memory-node unit="gbyte">16</ors:memory-node>
<ors:memory-maximal unit="tbyte">2</ors:memory-maximal>
<ors:number-of-processors>
<ors:min>4</ors:min>
<ors:max>128</ors:max>
</ors:number-of-processors>
<ors:communication-bandwidth>
<ors:point-to-point unit="gbitpsec">1.6</ors:point-to-point>
<ors:processor-memory unit="gbitpsec">38.4</ors:processor-memory>
</ors:communication-bandwidth>
</ors:model>
|
Note that from the names of the tags, one can easily deduce what the information means. We defined an attribute "unit" to indicate what a number is representing.
When you define your own tags, you can choose to make them as short as possible, or as descriptive as possible. Although in the latter case documents tend to become more lengthy, I choose nevertheless for that one: it makes the documents much more self explanatory.
Next we create an XSLT stylesheet. It takes the XML document of the systems from the Overview report, and for each system matches it with manufacturer information in the TOP500 list. The lines that do this are:
<xsl:template match="ors:system">
...
<xsl:for-each
select="document('top500-199911.xml')
/top500:list/top500:site/top500:manufacturer">
...
|
The output document, also XML, has a traditional report type of structure, with chapters, sections, and tables. For this we already have another stylesheet that we apply to reports and that can turn it for instance into an HTML document. We have also made this available over the Web.
More information can be found at:
Overview report merged with TOP500 information,
The original report:
Overview of recent supercomputers.
The TOP500 report in XML,
The originalTOP500 site
XML publishing software used.
Ad Emmen
[News on Advanced IT]
[Calendar]
[Analysis]
[IT in Medicine]
|