|
First he presented the Evolution of Science and divided into:
Observational Science
- Scientist gathers data by direct observation
- Scientist analyses data
Analytical Science
- Scientist builds analytical model
- Scientist makes predictions.
Computational Science
- Scientist simulates analytical model
- Scientist validates model and makes predictions
Data Exploration Science
Data captured by instruments
Or data generated by simulator
- Processed by software
- Placed in a database / files
- Scientist analyses database / files
The Information is growing to an avalanche because of better observational instruments and better simulations. They produce a huge amount of data. He gave some examples, the turbulence produces 100 TB by simulation, then the scientist has to mine the information. Another extreme example is CERN, the LHC will generate 1GB/s that sums up to 10 PB/y.
The next-generation data analysis looks for needles in haystacks: the Higgs particle for example. The haystacks are for example dark matter and dark energy. Global statistics have poor scaling. The correlation functions are N 2, likelihood techniques N3. As data and computers grow at the same rate, we can only keep up with N logN. He presented a way out, e.g. one has to discard notion of optimal (data is fuzzy, answers are approximate) and can not assume infinite computational resources or memory. To solve these problems a combination of statistics and computer science is necessary.
Another important issue is the data access. It is hitting a wall, as FTP and GREP are not adequate. One can GREP 1 MB in a second, 1 TB in 2 days, 1 PB in 3 years - this means ~5,000 disks.
Thus one needs at some point indices to limit search and parallel data search and analysis.
Smart Data (active databases) allow to take the analysis to the data and do all data manipulations at database. One can build custom procedures and functions in the database and use integrated tools. Jim Gray proposed to use clever data structures (trees, cubes), fast approximate heuristic algorithms and to take the cost of computation into account - best result in a given time, given our computing resources.
Data Federations of Web Services
The massive datasets live near their owners, near the instrument's software pipeline, near the applications, near data knowledge and curation, and Supercomputer centres become Superdata centres. Each archive can be published as a web service. Then scientists get "personalised" extracts and have uniform access to multiple archives. The web services can be the key. The Web SERVER, when given a url + parameters returns a web page (often dynamic). The Web SERVICE, given an XML document (soap msg), it returns an XML document.
Then he discussed the issues of the World Wide Telescope Virtual Observatory. The Internet is the world's best telescope, as it has data on every part of the sky in every measured spectral band: optical, x-ray, radio, and it is up when you are up. He discussed some of the questions researchers asked in the astronomy and presented examples to find stars easily or with a high effort.
In the end he called to action. If you do data visualisation: we need you (and we know it). If you do databases, here is some data you can practise on. If you do distributed systems, here is a federation you can practise on.
If you do data mining, here is a dataset to test your algorithms.
If you do astronomy educational outreach, here is a tool for you.
http://research.microsoft.com/~gray
http://www.astro.caltech.edu/nvoconf/
http://www.voforum.org
http://www.sdss.jhu.edu/ScienceArchive/sxqt/sxQT/Example_Queries.html
.
|