|
The computing characteristics of high Energy Physics consists of independent events (collisions), easy (read: trivial) parallel processing, the bulk of the data is read-only, versions rather than updates, meta-data in databases linking to files, the compute power is measured in SPECint (not SPECfp). There are very large aggregate requirements, computation, data, input/output, chaotic workload in the research environment, physics extracted by iterative analysis, collaborating groups of physicists and unpredictable, unlimited demand.
The Large Hadron Collider (LHC) has the goal: "Find new physics, such as the Higgs particle, and get the Nobel price !", were Sverre Jarp's comments.
The Data Acquisition is characterised by:
- Multi-level trigger
- Filters out background
- Reduces data volume
- Record data 24 hours a day, 7 days a week
- Equivalent to writing a CD every 2 seconds
CERN's responsibilities as a large data centre:
- Manage home directories
- With worldwide access
- Transfer data from experimental areas
- Record to tape
- Re-export "raw" data to other physics labs
- Manage the physics data at every level in the analysis chain
- Data accessed locally
- Data accessed via the "GRID"
CASTOR (CERN Advanced STORage Manager)
The hierarchical storage manager used to store user and physics files and manages the secondary and tertiary storage. Currently it holds more than 9 million files and 2000 TB of data. The development started in 1999 based on SHIFT, CERN's tape and disk management system since the beginning of the 1990s. SHIFT architecture (Scalable Heterogeneous Integrated Facility) connects tape server, disk server, batch server, interactive server, batch and disk SMP via a network based on Ethernet and AFS (Andrew File Systems).
The main characteristics of Castor are:
the components in CASTOR have well defined roles and interfaces, it is possible to change a component without affecting the whole system
- Highly Distributed System
CERN uses a very distributed configuration with many disk servers/tape servers and can also run in more limited environment
the number of disk servers, tape servers, name servers is unlimited. The use of RDBMS (Oracle, MySQL) improves the scalability of some critical components
consist of name server, volume manager, volume and drive queue manager
- Disk subsystem
- Tape Subsystem
Then he mentioned openlab, a technology focus industrial collaboration with Enterasys, HP, IBM, and Intel as the partners. The technology is aimed at the LHC era, network switch at 10 Gigabits, rack-mounted HP servers, Itanium-2 processors, and a StorageTank storage system. He expects a cluster evolution with a cluster of 32 systems (64 processors) in 2002, in 2003 64 systems ("Madison" processors), and in 2004/05 possibly 128 systems ("Montecito" processors). The opencluster result is the "10 Gbit/s network" challenge, groups together three Openlab partners and CERN, current results are a single stream tcp/ip connection: 755 MB/s over a 10 km fibre, back-to-back, memory-to-memory.
Grid = Virtual Computing Center
This seems the solution for LHC, as the user sees the image of a single cluster and does not need to know, where the data is, where the processing capacity is, how things are interconnected, the details of the different hardware and is not concerned by the local policies of the equipment owners and managers. The vision of Grid Data Management is distributed shared data storage, ubiquitous data access, transparent data transfer and migration, consistency and robustness as wel as optimisation. CERN is busily preparing for the first arrival of LHC data in 2007. New and exciting technologies are needed to manage the data seamlessly, around the globe. Together with the partners (industry, other Physics Labs, other sciences) CERN expects to come up with interesting proofs-of-concept and technological spin-off ! Petabyte Data Centers are here to stay ! |