Information about http://research.microsoft.com/~Gray/Papers/MSR_TR_2001_77_Virtual_Observatory.pdf

The World-Wide Telescope 1 …

Tags: american association for the advancement of science, astronomer, astronomy data, distant universe, evolv, jim gray, johns hopkins university, microsoft corporation, microsoft research, new instruments, private copy, san francisco ca, science science, sulting, systematic surveys, szalay, terabytes, virtual observatory, whole sky, worldwide data,
Pages: 6
Language: english
Created: Thu Sep 20 14:46:47 2001
Display cached document
Page 1
image
Page 2
image
Page 3
image
Page 4
image
Page 5
image
Page 6
image
                                      The World-Wide Telescope 1


                                     Alexander Szalay, The Johns Hopkins University

                                                     Jim Gray, Microsoft

                                                         August 2001



                                                    Technical Report

                                               MSR-TR-2001-77



                                                  Microsoft Research
                                                 Microsoft Corporation
                                               301 Howard Street, #830
                                               San Francisco, CA, 94105




1
    This article appears in Science V. 293 pp. 2037-2040 14 Sept 2001. Copyright © 2001 by The American Association for the
Advancement of Science.



Science V. 293 pp. 2037-2040 14 Sept 2001. Copyright © 2001 by The American Association for the Advancement of Science        1
                                      The World-Wide Telescope
                                     Alexander Szalay, The Johns Hopkins University
                                                  Jim Gray, Microsoft
                                                      August 2001

Abstract All astronomy data and literature will soon be               tral band over the whole sky is a few terabytes. It is not
online and accessible via the Internet. The community is              even possible for each astronomer to have a private copy
building the Virtual Observatory, an organization of this             of all their data. Many of these new instruments are
worldwide data into a coherent whole that can be ac-                  being used for systematic surveys of our galaxy and of
cessed by anyone, in any form, from anywhere. The re-                 the distant universe. Together they will give us an un-
sulting system will dramatically improve our ability to do            precedented catalog to study the evolving universe--
multi-spectral and temporal studies that integrate data               provided that the data can be systematically studied in an
from multiple instruments. The virtual observatory data               integrated fashion.
also provides a wonderful base for teaching astronomy,
                                                                      Already online archives contain raw and derived astro-
scientific discovery, and computational science.
                                                                      nomical observations of billions of objects ­ both temp o-
Many fields are now coping with a rapidly mounting                    ral and multi-spectral surveys. Together, they have an
problem: how to organize, use, and make sense of the                  order of magnitude more data than any single instrument.
enormous amounts of data generated by today's instru-                 In addition, all the astronomy literature is online, and is
ments and experiments. The data should be accessible to               cross-indexed with the observations [Simbad, NED].
scientists and educators so that the gap between cutting
edge research and education and public knowledge is
minimized and presented in a form that will facilitate                                                                              1000

integrative research. This problem is becoming particu-
                                                                                                                                    100
larly acute in many fields, notably genomics, neurosci-
ence, and astrophysics. In turn, the availability of the
                                                                                                                                   10
internet is allowing new ideas and concepts for data shar-
ing and use. Here we describe a plan to develop an                                                                                 1
internet data resource in astronomy to help address this
problem in which, because of the nature of the data and                                                                            0.1

analyses required of them, the data remain widely dis-                                                               1995
                                                                                                                            2000
                                                                                                              1990
tributed rather than coalesced in one or a few databases                                        1980
                                                                                                       1985
                                                                                         1975
(e.g., Genbank). This approach may have applicability                             1970                                 CCDs    Glass

in many other fields. The goal is to make the Internet act              Figure 1. Telescope area doubles every 25 years, while
as the world's best telescope--a World-Wide Telescope.                  telescope CCD pixels double every two years. This rate
                                                                        seems to be accelerating. It implies a yearly data dou-
The problem                                                             bling. Huge advances in storage, computing, and commu-
Today, there are many impressive archives painstakingly                 nications technologies have enabled the Internet and will
                                                                        enable the VO.
constructed from observations associated with an instru-
ment. The Hubble Space Telescope [HST], the Chandra                   Why is it necessary to study the sky in such detail? Ce-
X-Ray Observatory [Chandra], the Sloan Digital Sky                    lestial objects radiate energy over an extremely wide
Survey [SDSS], the Two Micron All Sky Survey                          range of wavelengths, from the radio, to infrared, optical,
[2MASS], and the Digitized Palomar All Sky Survey                     ultraviolet, x-rays and even gamma -rays. Each of these
[DPOSS] are examples of this. Each of these archives is               observations carries important information about the
interesting in itself, but temporal and multi-spectral stud-          nature of the objects. The same physical object can ap-
ies require combining data from multiple instruments.                 pear to be totally different in different wavebands (See
Furthermore, yearly advances in electronics enable new                Figure 2). A young spiral galaxy appears as many con-
instruments that double the data we collect each year (see            centrated `blobs', the so-called HII regions in the ultra-
Figure 1). For exa mple, about a gigapixel is deployed                violet, while it shows smooth spiral arms in the optical.
on all telescopes today, and new gigapixel instruments                A galaxy cluster can only be seen as an aggregation of
are under construction. A night's observation is a few                galaxies in the optical, while x-ray observations show the
hundred gigabytes. The processed data for a single spec-              hot and diffuse gas between the galaxies.

This article appears in Science V. 293 pp. 2037-2040 14 Sept          The physical processes inside these objects can only be
2001. Copyright © 2001 by The American Association for the            understood by combining observations at several wave-
Advancement of Science.


Science V. 293 pp. 2037-2040 14 Sept 2001. Copyright © 2001 by The American Association for the Advancement of Science                     2
lengths. Today we already have large sky coverage in 10               data one way or another. Any solution that tries to feder-
spectral regions; soon we will have additional data in at             ate the astronomy data sets must start with the premise
least 5 more bands. These will reside in different ar-                that this trend is not going to change substantially in the
chives, making their integration all the more compli-                 near future; there is no top-down way to simultaneously
cated.                                                                rebuild all data sources.

   X-
                                                                      The World Wide Telescope
                                 Optic
                                                                      So to solve these problems, the astrophysical community
                                                                      is developing the World-Wide Telescope ­ often called
                                                                      the Virtual Observatory [VO]. In this approach, the data
                                                                      will primarily be accessed via digital archives that are
                                                                      widely distributed. The actual telescopes will either be
                                                                      dedicated to surveys that feed the archives, or telescopes
                                                                      will be scheduled to follow-up `interesting' phenomena
   Infra                         Rad                                  found in the archives. Astronomers will look for patterns
                                                                      in the data, both spectral and temporal, known and un-
                                                                      known, and use these to study various object classes .
                                                                      They will have a variety of tools at their fingertips: a
                                                                      unified search engine, to collect and aggregate data from
                                                                      several large archives simultaneously, and a huge dis-
                                                                      tributed computing resource, to perform the analyses
                                                                      close to the data, in order to avoid moving petabytes of
                                                                      data across the networks.
  Figure 2: M82, at a distance of 11 million light years, is a
  rare nearby starburst galaxy where stars are forming and            Other sciences have comparable efforts of putting all
  expiring at a rate ten times higher than in our galaxy. A           their data online and in the public domain ­ Genbank®
  close encounter with the large galaxy M81 in the last 100           in genomics is a good example ­ but so far these are cen-
  million years is thought to be the cause of this activity.          tralized rather than federated systems.
  The images taken at different wavelengt hs unveil differ-
  ent aspects of the star forming activity inside the galaxy.         The Virtual Observatory will give everyone access to
  Images courtesy of:                                                 data that span the entire spectrum, the entire sky, all his-
  (X-ray) NASA/CXC/SAO/PSU/CMU, (Optical)                             torical observations, and all the literature. For publica-
  AURA/NOAO/NSF, (Infrared) SAO, (Radio)
                                                                      tions, data will reside at a few sites maintained by the
  MERLIN/VLA, compilation from
  http://chandra.harvard.edu/photo/0094/index.html                    publishers. These archive sites will support simple
                                                                      searches. More complex analyses will be done with im-
Raw astronomy data is complex. It can be in the form of               ported data extracts at the user's facility.
fluxes measured in finite size pixels on the sky, it can be
spectra (flux as a function of wavelength), it can be indi-           And time on the instrument will be available to all. The
vidual photon events, or even phase information from the              Virtual Observatory should thus make it easy to conduct
interference of radio waves.                                          such temporal and multi-spectral studies, by automating
                                                                      the discovery and the assembly of the necessary data.
In many other disciplines, once data is collected, it can
be frozen and distributed to other locations. This is not             The typical and the rare
the case for astronomy. Astronomy data needs to be cali-              One of the main uses of the VO will be to facilitate
brated for the transmission of the atmo sphere, and for the           searches where statistics are critical. We need large
response of the instruments. This requires an exquisite               samples of galaxies in order to understand the fine details
understanding of all the properties of the whole system,              of the expanding universe, and of galaxy formation.
which sometimes takes several years. With each new                    These statistical studies require mult icolor imaging of
understanding of how corrections should be made, the                  millions of galaxies, and measurement of their distances.
data are reprocessed and recalibrated. As a result, data in           We need to perform statistical analyses as a function of
astronomy stays `live' much longer than in other disci-               their observed type, environment, and distance.
plines ­ it needs an active `curation', mostly by the ex-
pert group that collected the data.                                   Other projects study rare objects, ones that do not fit
                                                                      typical patterns ­ the needles in the haystack. Again,
Consequently, astronomy data reside at many different                 having multi-spectral observations is an enormous help.
geographical locations, and things are going to stay that             Colors of objects reflect their temperature. At the same
way. There will not be a central Astronomy database.                  time, in the expanding Universe, the light emitted by
Each group has its own historical reasons to archive the              distant objects is redshifted. Searching for extremely red


Science V. 293 pp. 2037-2040 14 Sept 2001. Copyright © 2001 by The American Association for the Advancement of Science          3
objects thus finds either extremely cold objects, or e    x-          data, while others might just be indices of the data itself
tremely distant ones. Data mining studies of extremely                ­ analogous to Yahoo! for the text -based web.
red objects discovered distant quasars, the latest at a red-
                                                                      Astronomers own the data they collect ­ but the field has
shift of 6.28 [QSO]. Mining the 2MASS and SDSS ar-
                                                                      a long tradition of making all data public after a year.
chives found many cold objects, brown dwarfs, bigger
                                                                      This gives the astronomer time to analyze data and pub-
than a planet, yet smaller than a star. These are good
                                                                      lish early results, and it also gives other astronomers
examples of multi-wavelength searches, not possible
                                                                      timely access to the data. Given that data are doubling
with a single observation of the sky, done by hand today,
                                                                      every year, and given that the data become public within
automated in the future, we do not know even of data
                                                                      a year, about half the word's astronomy data is available
existed--discover on the fly.
                                                                      to all. A few astronomers have access to a private data
The time dimension                                                    stream from some instrument; so, we estimate everyone
                                                                      has 50% of the data and some people have 55% of the
Most celestial objects are essentially static; the character-
                                                                      data.
istic timescale for variations in their light output is meas-
ured in millions or billions of years. There are time-                Uniform views of diverse data
varying phenomena on much shorter timescales as well.
                                                                      The social dynamics of the Virtual Observatory will al-
Variations are either transient, like supernovae, or regu-
                                                                      ways have a tension between coherence and creativity ­
lar, like variable stars. If a dark object in our galaxy
                                                                      between uniformity and autonomy. It is our hope that
passes in front of a star or galaxy, we can measure a sud-
                                                                      the Virtual Observatory will act as a catalyst to homoge-
den brightening of the background object, due to gravita-
                                                                      nize the data. It will constantly struggle with the diver-
tional microlensing. Asteroids can be recognized by their
                                                                      sity of the different collections, and the creativity of sci-
rapid motion. All these variations can happen on a few
                                                                      entists who want to innovate and who discover new con-
days' timescale. Stars of the Milky Way Galaxy are all
                                                                      cepts and new ways of looking at things. These two
moving in its gravitational field. Although few stars can
                                                                      forces need to be balanced.
be seen to move in the matter of days, comparing obser-
vations ten years apart measures such motions accu-                   Each individual archive will be an autonomous unit run
rately.                                                               by scientists. The challenge is to translate this heteroge-
                                                                      neous mix of data sources into a uniform body of knowl-
Identifying and following object variability is time-
                                                                      edge for the scientists and educators who want to use
consuming, and adds an additional dimension to the ob-
                                                                      data from multiple sources. Each archive needs to easily
servations. Not only do we need to map the Universe at
                                                                      present its data in compatible formats and the archives
many different wavelengths, we need to do it often, so
                                                                      must be able to exchange data.
that we can find the temporal variations on many time-
scales. Once this ambitious gathering of possibly                     This uniform view will require agreement on terminol-
petabyte-size datasets is under way, we will need sum-                ogy, on units, and on representations ­ a unified concep-
maries of light curves, and also extremely rapid triggers.            tual schema (data model) that describes all the data in all
For example, in gamma -ray bursts, much of the action                 the archives in a common terminology. This schema will
happens within seconds after the burst is detected. This              evolve with time, and there will always be things that are
puts stringent demands on data archive performance.                   outside the schema, but VO users will see all the archives
                                                                      via this unifying schema, and data interchange will be
Agenda
                                                                      done in terms of the schema.
The architecture mu st be designed with a 50-year hori-
                                                                      We believe that the base representations will likely be
zon. Things will be different in 50 years ­ computers
                                                                      done using the emerging standards for XML, Schemas,
will be several orders of magnitude faster, cheaper, and
                                                                      and SOAP, and web services [W3C], but beyond that
smarter. So the architecture must not make short-term
                                                                      there will have to be tools that automatically transform
technology compromises. On the other hand, the system
                                                                      the diverse and heterogeneous archives to this common
must work today on today's technology.
                                                                      format. This is beyond the current state of computer
The Virtual Observatory will be a federation of astron-               science, yet solving this schema integration problem will
omy archives, each with unique assets. Archives will                  be a key enabler for the Virtual Observatory.
typically be associated with the institutions that gathered
                                                                      Users will want query the virtual observatory using nice
the data and with the people who best understand the
                                                                      graphical tools, both to pose questions and to analyze
data. Some archives might contain data derived from
                                                                      and visualize the results. The users will range in skill
others and some might contain synthetic data from simu-
                                                                      from professional astronomers to bright grammar school
lations. A few archives might specialize in organizing
                                                                      students, so a variety of tools will be needed.
the astrophysical literature and cross-indexing it with the




Science V. 293 pp. 2037-2040 14 Sept 2001. Copyright © 2001 by The American Association for the Advancement of Science           4
The Virtual Observatory query systems will have to find               Better algorithms
data among the various versions and replicas stored at
                                                                      At the micro-level, we expect major advances in com-
the various archives. The user might ask for the correla-
                                                                      puter science algorithms. Hardware performance has
tion between radio, optical, and X-ray sources with cer-
                                                                      improved about a hundred-fold every decade since 1970.
tain properties. It will be up to the query system to lo-
                                                                      There has been a comparable improvement in software
cate the appropriate datasets, subset them and cross-
                                                                      algorithms ­ the simplest example being sorting: sort
correlate them, returning the requested attributes from
                                                                      performance has doubled each year since 1985, half due
each data source. As with schema integration, there is
                                                                      to improvements in hardware, and half due to improve-
good technology for automatic indexing, location-
                                                                      ments in parallel sorting algorithms. We must continue to
transparent, and parallel data search. But the scenario
                                                                      invest in and investigate new algorithms and data struc-
just described is beyond the limits of what computer sci-
                                                                      tures for data loading, data cleansing, data search, data
ence researchers can do today.
                                                                      organization, and data mining. Statistical methods lie at
At its core, the Virtual Observatory is a data manager.               that heart of most of our data search algorithms, so ad-
There is a well-established discipline of defining and                vances in computational statistics will be essential to the
managing data. There are good languages for defining                  success of the Virtual Observatory.
data schemas (structures) and for mapping these data
                                                                      Current computer architectures are moving towards huge
schemas into existing data management systems. But,
                                                                      arrays of loosely-coupled computers with local storage.
most of these tools are designed for homogeneous envi-
                                                                      These Beowulf clusters harness commodity components
ronments with central control ­ for example managing
                                                                      and so provide very inexpensive computing. Cellular
the assets of an individual corporation, organization, or
                                                                      architectures like IBM's BlueGene project or the Japa-
agency. Tools that integrate heterogeneous databases are
                                                                      nese Earth Simulator project carry this to an extreme ­
still primitive and labor intensive.
                                                                      millions of computers in one cluster. Computational
The raw astronomy data from a telescope runs through a                Grids are clusters of computer clusters that can dynami-
substantial `software pipeline' that extracts objects (stars,         cally allocate resources to the tasks at hand [Grid]. But,
galaxies, clouds, planets, asteroids) from the data and               not all our algorithms fit this computational structure ­
assigns attributes (luminosity, morphology, classifica-               some require fine-grain parallelism and shared memory.
tion) to them. These pipelines are constantly improving               So, we must either find new algorithms that work well on
as we learn new science and as we discover                            cluster computer architectures, or we must invest in mas-
flaws in the pipeline software, so the derived                        sive-memory computer machines that can solve these
                                                                      problems.
data is constantly evolving into new versions.
                                                                      Education
Each archive's data will likely be replicated at one or
more mirror sites so that a catastrophe at one site will not          The Virtual Observatory offers the opportunity to teach
cause any long-term data loss. Data may also be wholly                science in a participatory way. We can give students
or partially replicated at other sites so to speed local              direct access to a wonderful scientific instru ment. They
computations: it is often faster and cheaper to access                can use it to make discoveries on their own. Very inter-
local data than to fetch it from a remote site.                       esting projects and lectures can be built using the Virtual
                                                                      Observatory tools and data.
We assume the Internet will evolve so that copying lar-
ger datasets will be feasible and economic (today, data-              Astronomy has a special attraction for all, as demo n-
sets in the terabyte range are typically moved by parcel              strated by planetariums, amateur telescopes, and text-
post). Still, it will often be best to move the computation           books. Even very young children can be engaged in
to the petabyte-scale datasets to minimize data move-                 many different sciences via astronomy ­ astronomy has
ment and speed the computation.                                       strong ties to physics, chemistry, and mathematics. As-
                                                                      tronomy can be used as a vehicle for introducing the ba-
Current data management systems propagate and manage                  sic concepts of all these fields and also used to teach the
data replicas, but they require administrators to decide
                                                                      process of scientific discovery. The two of us, with a lot
when and where the replicas are stored. Administrators                of help from others, are in the process of creating such a
must design and manage the replication strategy and                   pilot project, using data from the SDSS [SkyServer].
must track data versions and data lineage. The computer
science challenge is to automate these tasks: to automati-
cally configure and track replicas and to automatically               The Virtual Observatory can also be used to teach com-
track data lineage and dependencies on different derived              putational science. Traditionally, science has been either
data versions as the programs evolve.                                 theoretical or empirical. In the last 50 years, computa-
                                                                      tional science emerged as a third approach, first in doing
                                                                      simulations and now increasingly in mining scientific


Science V. 293 pp. 2037-2040 14 Sept 2001. Copyright © 2001 by The American Association for the Advancement of Science         5
data. Indeed, most scientific departments now have a
strong computational program. The Virtual Observatory
is a unique tool to teach these skills
A World-Wide Effort
Like astronomy, the Virtual Observatory is a world-wide
effort. Initiatives are underway in many countries with a
common goal: join the diverse world wide astronomical
databases into a single federated entity facilitating new
research for the worldwide astronomy community. The
European national and international astronomy data cen-
ters are leading an effort funded by. the European Union.
The Astronomical Virtual Observatory (AVO) i led by
                                                  s
the European Southern Observatory and also backed by
the European Space Agency. The EU also sponsors re-
search in pipeline processing technology to handle the
anticipated terabyte data streams from future large sur-
vey telescopes. The UK funds the AstroGrid Project to
investigate distributed data archives Grid technology.
Japan and Australia are setting up their own large a    r-
chives. In the United States, the National Science Foun-
dation sponsors development of the information technol-
ogy infrastructure necessary for a National Virtual Ob-
servatory (NVO). There is a close cooperation with the
particle physics community, through the Grid Physics
Network (GriPhyN). NASA supports astronomy mission
archives and discipline data centers, while developing a
roadmap for their federation.
Impressively, these projects are all cooperating, and are
working toward a future Global Virtual Observatory to
benefit the international astronomical community and the
public alike. There are similar efforts under way in other
areas of science as well. The VO has had and will have
significant interactions with other science communities,
both learning from some and providing a model for oth-
ers.
References
[Chandra] Chandra X-Ray Observatory Center,
      http://chandra.harvard.edu/
[DPOSS] Digitized Palomar Observatory Sky Survey,
      http://www.astro.caltech.edu/~george/dposs/
[Grid] The Grid: Blueprint for a New Computing Infrastru cture, I.
      Foster, C. Kesselman eds., Morgan Kaufmann, 1998.
[HST] The Hubble Space Telescope, http://www.stsci.edu/
[NED] NASA/IPAC Extragalactic Database,
      http://nedwww.ipac.caltech.edu/
[QSO] Fan,X. et al., http://xxx.lanl.gov/abs/astro-ph/0108063
[SDSS] The Sloan Digital Sky Survey website, http://www.sdss.org/
[Simbad] SIMBAD Astronomical Database, http://simbad.u-strasbg.fr/
[SkyServer] Public access website to the SDSS data,
      http://skyserver.sdss.org/
[VIzieR] VizieR Service, http://vizier.u-strasbg.fr/viz-bin/VizieR
[VO] Virtual Observatories of the Future, ASP Conf. Series, V. 225,
      R.J. Brunner, S.G. Djo rgovski, A.S. Szalay eds., Sept 2000.
[W3C] The Semantic Web http://www.w3.org/2001/sw/ and Web
      Services, http://www.w3.org/TR/wsdl
[2MASS] The Two Micron All Sky Survey,
http://www.ipac.caltech.edu/2mass/




Science V. 293 pp. 2037-2040 14 Sept 2001. Copyright © 2001 by The American Association for the Advancement of Science   6