Information about http://www-personal.umich.edu/~mejn/papers/016131.pdf

PHYSICAL REVIEW E, VOLUME…

Tags: apparent differences, applied mathematics, centrality, collaboration networks, computer databases, connectedness, contained community, cornell university, fundamental results, giant component, hyde park, hyde park road, ithaca new york, network construction, nity, paper numbers, rhodes hall, santa fe institute, santa fe new mexico, statistical properties,
Pages: 8
Language: english
Created: Mon Jun 25 15:33:32 2001
Display cached document
Page 1
image
Page 2
image
Page 3
image
Page 4
image
Page 5
image
Page 6
image
Page 7
image
Page 8
image
                                            PHYSICAL REVIEW E, VOLUME 64, 016131

         Scientific collaboration networks.                I. Network construction and fundamental results
                                                           M. E. J. Newman
                              Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501
                   and Center for Applied Mathematics, Cornell University, Rhodes Hall, Ithaca, New York 14853
                  Received 5 December 2000; revised manuscript received 1 February 2001; published 28 June 2001
                Using computer databases of scientific papers in physics, biomedical research, and computer science, we
              have constructed networks of collaboration between scientists in each of these disciplines. In these networks
              two scientists are considered connected if they have coauthored one or more papers together. We study a
              variety of statistical properties of our networks, including numbers of papers written by authors, numbers of
              authors per paper, numbers of collaborators that scientists have, existence and size of a giant component of
              connected scientists, and degree of clustering in the networks. We also highlight some apparent differences in
              collaboration patterns between the subjects studied. In the following paper, we study a number of measures of
              centrality and connectedness in the same networks.

              DOI: 10.1103/PhysRevE.64.016131                  PACS number s : 89.75.Hc, 89.65. s, 89.70. c, 01.30. y


                     I. INTRODUCTION                                   fairly self-contained community such as a business commu-
                                                                       nity 25­27 , a school 28,29 , a religious or ethnic commu-
   A social network 1,2 is a set of people or groups each of           nity 30 , and so forth, and constructs the network of ties by
which has connections of some kind to some or all of the               interviewing participants, or by circulating questionnaires. A
others. In the language of social network analysis, the people         study will ask respondents to name those with whom they
or groups are called ``actors'' and the connections ``ties.''          have the closest ties, probably ranked by subjective close-
Both actors and ties can be defined in different ways depend-          ness, and may optionally call for additional information
ing on the questions of interest. An actor might be a single           about those people or about the nature of the ties.
person, a team, or a company. A tie might be a friendship
                                                                           Studies of this kind have revealed much about the struc-
between two people, a collaboration or common member be-
                                                                       ture of communities, but they suffer from two substantial
tween two teams, or a business relationship between compa-
nies.                                                                  problems that make them poor sources of data for the kind of
   Social network analysis has a history stretching back at            approach to network analysis that physics has adopted. First,
least half a century, and has produced many results concern-           the data they return are not numerous. Collecting and com-
ing social influence, social groupings, inequality, disease            piling data from these studies is an arduous process and most
propagation, communication of information, and indeed al-              data sets contain no more than a few tens or hundreds of
most every topic that has interested 20th century sociology.           actors. It is a rare study that exceeds 1000 actors. This makes
The Physical Review is, however, a physics journal. Why                the statistical accuracy of many results poor, a particular dif-
should a physicist be interested in social networks? There             ficulty for the large-system-size methods used in statistical
has, in fact, been a substantial surge of interest in social           physics. Second, they contain significant and uncontrolled
networks within the physics community recently, as evi-                errors as a result of the subjective nature of respondents'
denced by the large body of papers on the topic--see Refs.             replies. What one respondent considers to be a friendship or
 3­24 and references therein. The techniques of statistical            acquaintance, for example, may be completely different from
physics in particular turn out to be well suited to the study of       what another respondent does. In studies of schoolchildren
these networks. Profitable use has been made of a variety of             28,29 , for instance, it is found that some children will claim
physical modeling techniques 5­7 , exact solutions 8­13 ,              friendship with every single one of their hundreds of school-
Monte Carlo simulation 14­17 , scaling and renormaliza-                mates, while others will name only one or two friends.
tion group methods 15­17 , mean-field theory 18,19 , per-              Clearly these respondents are employing different definitions
colation theory 20­22 , the replica method 23 , generating             of friendship.
functions 20,22,24 , and a host of other techniques familiar               Reliable statistics do exist for some other types of net-
to the readers of this publication.                                    works. Examples include the world-wide web 14,31,32 ,
   In this paper and the following one, we make use of some            power grids 5 , telephone call graphs 33 , and airline time-
of these techniques in the study of some specific examples of          tables 34 . These graphs are certainly interesting in their
social networks. However, our subject matter will be of in-            own right, and furthermore might loosely be regarded as so-
terest to physicists for another reason: it's about them. In           cial networks, since their structure clearly reflects something
these two papers, we study networks in which the actors are            about the structure of the society that created them. How-
scientists, and the ties between them are scientific collabora-        ever, their connection to the ``true'' social networks dis-
tions, as documented by the learned articles that they write.          cussed here is tenuous at best and so, for our purposes, they
            II. COLLABORATION NETWORKS
                                                                       cannot offer a great deal of insight.
                                                                           A more promising source of data is the affiliation net-
   Traditional investigations of social networks have been             work. An affiliation network is a network of actors con-
carried out through field studies. Typically one looks at a            nected by common membership in groups of some sort, such

1063-651X/2001/64 1 /016131 8 /$20.00                         64 016131-1                          ©2001 The American Physical Society
M. E. J. NEWMAN                                                                                 PHYSICAL REVIEW E 64 016131

as clubs, teams, or organizations. Examples that have been          644 papers, on 643 of which he was the sole author. He has
studied in the past include company CEOs and the clubs they                        
                                                                    no finite Erdos number 41 . Clearly sheer size of output is
frequent 26 , company directors and the boards of directors         not a sufficient condition for high connectedness.
on which they sit 25,35 , women and the social events they              There is also a substantial body of work in bibliometrics
attend 36 , and movie actors and the movies in which they            a specialty within information science on extraction of col-
appear 5,34 . Data on affiliation networks tend to be more          laboration patterns from publication data--see Refs. 44­48
reliable than those on other social networks, since member-         and references therein. However, these studies have not so
ship of a group can often be determined with a precision not        far attempted to reconstruct entire collaboration networks
available when considering friendship or other types of ac-         from bibliographic data, concentrating more on organiza-
quaintance. Very large networks can be assembled in this            tional and institutional aspects of collaboration 49 .
way as well, since in many cases group membership can be                In this paper and the following one, we study networks of
ascertained from membership lists, making time-consuming            scientists using bibliographic data drawn from four publicly
interviews or questionnaires unnecessary. A network of              available databases of papers.
movie actors, for example, and the movies in which they                  1 Los Alamos e-Print Archive: a database of unrefereed
appear has been compiled using the resources of the Internet        preprints in physics, self-submitted by their authors, running
Movie Database 37 , and contains the names of nearly half           from 1992 to the present. This database is subdivided into
a million actors--a much better sample on which to perform          specialties within physics, such as condensed matter and
statistics than most social networks, although it is unclear        high-energy physics.
whether this particular network has any real social interest.            2 Medline: a database of articles on biomedical research
    In this paper we construct affiliation networks of scientists   published in refereed journals, stretching from 1961 to the
in which a link between two scientists is established by their      present. Entries in the database are updated by the database's
coauthorship of one or more scientific papers. Thus the             maintainers, rather than papers' authors, giving it relatively
groups to which scientists belong in this network are the           thorough coverage of its subject area. The inclusion of bio-
groups of coauthors of a single paper. This network is in           medicine is crucial in a study such as this one. In most coun-
some ways more truly a social network than many affiliation         tries biomedical research easily dwarfs civilian research on
networks; it is probably fair to say that most pairs of people      any other topic, in terms of both expenditure and human
who have written a paper together are genuinely acquainted          effort. Any study that omitted it would be leaving out the
with one another, in a way that movie actors who appeared           largest part of current scientific research.
together in a movie may not be. There are exceptions--some               3 Stanford Public Information Retrieval System
very large collaborations, for example in high-energy phys-          SPIRES : a database of preprints and published papers in
ics, will contain coauthors who have never even met--and            high-energy physics, both theoretical and experimental, from
we will discuss these at the appropriate point. By and large,       1974 to the present. The contents of this database are profes-
however, the network reflects genuine professional interac-         sionally maintained. High-energy physics is an interesting
tion between scientists, and may be the largest social net-         case socially, having a tradition of much larger experimental
work ever studied 38 .                                              collaborations than other disciplines.
    The idea of constructing a network of coauthorship is not            4 Networked Computer Science Technical Reference
new. Many readers will be familiar with the concept of the          Library NCSTRL : a database of preprints in computer sci-
                                        
Erdos number, named after Paul Erdos, the Hungarian math-           ence, submitted by participating institutions and stretching
ematician, one of the founding fathers of graph theory              back about ten years.
among other things 39 . At some point, it became a popular              We have constructed complete collaboration networks for
cocktail party pursuit for mathematicians to calculate how          each of these databases separately, and analyzed them using
far removed they were in terms of publication from Erdos.           a variety of techniques, some standard, some invented for the
                                               
Those who had published a paper with Erdos were given an            purpose. A brief report of some of the work described here
     
Erdos number of 1, those who had published with one of              has appeared previously as Ref. 50 .
                                
those people but not with Erdos a number of 2, and so forth.
                                                
The present author, for example, has an Erdos number of 3,                        III. FUNDAMENTAL RESULTS
via Robert Ziff and Mark Kac 40 . In the jargon of social
                      
networks, your Erdos number is the geodesic distance be-               For this study, we constructed collaboration networks us-
tween you and Erdo    s in the coauthorship network. In recent      ing data from a five year period from 1995 to 1999 inclusive,
studies 41­43 , it has been found that the average Erdos            although data for much longer periods were available in
number is about 4.7, and the maximum known finite Erdos             some of the databases. There were several reasons for using
number within mathematics is 15. These results are prob-            this fairly short time window. First, older data are less com-
                                           
ably influenced to some extent by Erdos' prodigious math-           plete than newer for all databases. Second, we wanted to
ematical output: he published at least 1512 papers, more than       study the same time period for all databases, so as to be able
any other mathematician ever except possibly Leonhard Eu-           to make valid comparisons between collaboration patterns in
ler. However, quantitatively similar, if not quite so impres-       different fields. The coverage provided by both the Los Ala-
sive, results are in most cases found if the network is cen-        mos Archive and the NCSTRL database is relatively poor
tered on another mathematician. On the other hand, the fifth        before 1995, and this sets a limit on how far back we can
most published mathematician, Lucien Godeaux, produced              look. Third, networks change over time, both because people

                                                              016131-2
SCIENTIFIC COLLABORATION NETWORKS . . . .              I. . . .                                      PHYSICAL REVIEW E 64 016131

                 TABLE I. Some fundamental statistics for the scientific collaboration networks studied here. Numbers in
              parentheses are standard errors on the least significant figures.

                                                           Los Alamos e-Print Archive

                                         Medline   complete       astro-ph   cond-mat   hep-th    SPIRES      NCSTRL

              Total number of papers      2163923   98502   22029    22016    19085     66652      13169
              Total number of authors     1520251   52909   16706    16726     8361     56627      11994
                First initial only        1090584   45685   14303    15451    7676      47445      10998
              Mean papers per author       6.4(6)  5.1(2)   4.8(2)  3.65(7)  4.8(1)    11.6(5)    2.55(5)
              Mean authors per paper     3.754(2) 2.530(7) 3.35(2) 2.66(1)   1.99(1)  8.96(18)    2.22(1)
              Collaborators per author   18.1(1.3) 9.7(2)  15.1(3) 5.86(9)   3.87(5)   173(6)     3.59(5)
              Size of giant component     1395693   44337   14845    13861     5835     49002       6396
                First initial only        1019418   39709   12874    13324    5593      43089       6706
                As a percentage          92.6(4)% 85.4(8)% 89.4(3) 84.6(8)% 71.4(8)% 88.7(1.1)% 57.2(1.9)%
              2nd largest component          49       18      19       16       24       69          42
              Clustering coefficient     0.066(7) 0.43(1) 0.414(6) 0.348(6) 0.327(2) 0.726(8)    0.496(6)


enter and leave the professions they represent and because              Wright, Francis Wright, and Frank Lloyd Wright could all be
practices of scientific collaboration and publishing change.            the same person. Also, two authors may have the same name.
In this particular study we have not examined time evolution            Grossman and Ion 41 point out that there are two American
in the network, although this is certainly an interesting topic         mathematicians named Norman Lloyd Johnson, who are
for research and indeed is currently under investigation                known to be distinct people but between whom computer
 51,52 . For our purposes, a short window of data is desir-             programs such as ours cannot hope to distinguish. Even ad-
able, to ensure that the collaboration network is roughly               ditional clues such as home institution or field of specializa-
static during the study.                                                tion cannot be used to distinguish such people, since many
   The raw data for the networks described here are com-                scientists have more than one institution or publish in more
puter files containing lists of papers, including authors'              than one field. The present author, for example, has ad-
names and possibly other information such as title, abstract,           dresses at the Santa Fe Institute and Cornell University, and
date, journal reference, and so forth. Construction of the col-         publishes in both statistical physics and paleontology.
laboration networks is straightforward. The files are parsed                In order to control for these biases, we constructed two
to extract author names and as names are found a list is                different versions of each of the collaboration networks stud-
maintained of the ones seen so far--vertices already in the             ied here, as follows. In the first, we identify each author by
network--so that recurring names can be correctly assigned              his or her surname and first initial only. This method is
to extant vertices. Edges are added between each pair of                clearly prone to confusing two people for one, but will rarely
authors on each paper. A naive implementation of this cal-              fail to identify two names which genuinely refer to the same
culation, in which names are stored in a simple array, would            person. In the second version of each network, we identify
take time O(pn), where p is the total number of papers in               authors by surname and all initials. This method can much
the database and n the number of authors. This, however,                more reliably distinguish authors from one another, but will
turns out to be prohibitively slow for large networks since p           also identify one person as two if they give their initials
and n are of similar size and may be a million or more.                 differently on different papers. Indeed this second measure
Instead therefore, we store the names of the authors in an              appears to overestimate the number of authors in a database
ordered binary tree, which reduces the running time to                  substantially. Networks constructed in these two different
O( p log n), making the calculation tractable, even for the             fashions therefore give upper and lower bounds on the num-
largest databases studied here.                                         ber of authors, and hence also give bounds on many of the
   In Table I we give a summary of some of the basic results            other quantities studied here. In Table I we give numbers of
for the networks studied here. We discuss these results in              authors in each network using both methods, but for many of
detail in the rest of this section.                                     the other quantities we give only an error estimate based on
                                                                        the separation of the bounds.
                    A. Number of authors
                                                                                        B. Number of papers per author
   The size of the databases varies considerably from about a
million authors for Medline to about ten thousand for NC-                  The average number of papers per author in the various
STRL. In fact, it is difficult to say with precision how many           subject areas is in the range of around three to six over the
authors there are. One can say how many distinct names                  five year period. The only exception is the SPIRES database,
appear in a database, but the number of names is not the                covering high-energy physics, in which the figure is signifi-
same as the number of authors. A single author may report               cantly higher at 11.6. One possible explanation for this is
their name differently on different papers. For example, F. L.          that SPIRES is the only database that contains both preprints

                                                                  016131-3
M. E. J. NEWMAN                                                                                           PHYSICAL REVIEW E 64 016131

                                                                             of five years used in this study, which prevents any one
                                                                             author from publishing a very large number of papers. Lotka
                                                                             and subsequent authors who have confirmed his law have not
                                                                             usually used such a window.
                                                                                 It is interesting to speculate why the cutoff appears only
                                                                             in physics and not in computer science or biomedicine.
                                                                             Surely the five year window limits everyone's ability to pub-
                                                                             lish very large numbers of papers, regardless of their area of
                                                                             specialization? For the case of Medline one possible expla-
                                                                             nation is suggested by an inspection of the list of the most
                                                                             published authors: it transpires that most of these authors
                                                                             have names that are known to occur frequently. It is thus
                                                                             conceivable that these apparently highly published authors
                                                                             are really each several different people who have been con-
                                                                             flated in our analysis, and hence that there is not after all any
                                                                             fat tail in the distribution, only the illusion of one produced
    FIG. 1. Histograms of the number of papers written by authors            by the large number of scientists with commonly occurring
in Medline, the Los Alamos Archive, and NCSTRL. The dotted                   names. This does not, however, explain why the tail appears
lines are fits to the data as described in the text. Inset: the equivalent   to follow a power law. This argument is strengthened by the
histogram for the SPIRES database.                                           sheer numbers of papers involved. For instance, the number
                                                                             1 author in the Medline database published, it appears, 1697
and published papers. It is possible that the high figure for                papers, or about one paper a day, including weekends and
papers per author reflects duplication of papers in both pre-                holidays, every day for the entire course of our five year
print and published form. However, the maintainers of the                    study. This seems to be an improbably large output.
database go to some lengths to avoid this 53 , and a more                        Interestingly, the names that top the list in physics and
probable explanation is perhaps that publication rates are                   computer science are not ones that are known to be common.
higher for the large collaborations favored by high-energy                   Thus it is still unclear why the NCSTRL database should
physics, since a large group of scientists has more person-                  have a power-law tail, although this database is small and it
hours available for the writing of papers.                                   is possible that it does possess a cutoff in the productivity
    In addition to the average numbers of papers per author in               distribution which is just not visible because of the limits of
each database, it is interesting to look at the distribution p k             the data set.
of numbers k of papers per author. In 1926, Alfred Lotka                         For the SPIRES database, which is shown separately in
 54 showed, using a data set compiled by hand, that this                     the inset of the figure, neither pure nor truncated power law
distribution followed a power law, with exponent approxi-                    fits the data well, the histogram displaying a significant
mately 2, a result that is sometimes referred to as Lotka's                  bump around the 100 paper mark. A possible explanation for
law of scientific productivity. In other words, in addition to               this is that a small number of large collaborations published
the many authors who publish only a small number of pa-                      around this number of papers during the time period studied.
pers, one expects to see a ``fat tail'' consisting of a small                Since each author in such a collaboration is then credited
number of authors who publish a very large number of pa-                     with publishing 100 papers, the statistics in the tail of the
pers. In Fig. 1 we show on logarithmic scales histograms for                 distribution can be substantially skewed by such practices.
each of our four databases of the numbers of papers pub-
lished. These histograms and all the others shown here were
                                                                                            C. Numbers of authors per paper
created using the ``all initials'' versions of the collaboration
networks. For the Medline and NCSTRL databases these                             Grossman and Ion 41 have given results showing that
histograms follow a power law quite closely, at least in their               the average number of authors on papers in mathematics has
tails, with exponents of            2.86(3) and         3.41(7),             increased steadily over the last 60 years, from a little over 1
respectively--somewhat steeper than those found by Lotka,                    to its current value of about 1.5. Higher numbers still seem to
but in reasonable agreement with other more recent studies                   apply to current studies in the sciences. Purely theoretical
 44,55,56 . For the Los Alamos Archive the pure power law                    papers appear to be typically the work of two scientists, with
is a poor fit. An exponentially truncated power law does                     high-energy theory and computer science showing averages
much better:                                                                 of 1.99 and 2.22 authors per paper in our calculations. For
                                           k/
                                                                             databases covering experimental or partly experimental sub-
                          p k Ck       e        ,                      1     ject areas the averages are, not surprisingly, higher: 3.75 for
                                                                             biomedicine, 3.35 for astrophysics, 2.66 for condensed mat-
where and are constants and C is fixed by the require-                       ter physics. The SPIRES high-energy physics database, how-
ment of normalization. The probability p 0 of having zero                    ever, shows the most startling results, with an average of
papers is taken to be zero, since the names of scientists who                8.96 authors per paper, obviously a result of the presence of
have not written any papers do not appear in the database.                   papers in the database written by very large collaborations.
The exponential cutoff we attribute to the finite time window                 Perhaps what is most surprising about this result is actually

                                                                       016131-4
SCIENTIFIC COLLABORATION NETWORKS . . . .                  I. . . .                                        PHYSICAL REVIEW E 64 016131




    FIG. 2. Histograms of the number of authors on papers in Med-               FIG. 3. Histograms of the number of collaborators of authors in
line, the Los Alamos Archive, and NCSTRL. The dotted lines are              Medline, the Los Alamos Archive, and NCSTRL. The dotted lines
the best fit power-law forms. Inset: the equivalent histogram for the       show how power-law distributions with exponents 2 and 3
SPIRES database, showing a clear peak in the 200 to 500 author              would look on the same axes. Inset: the equivalent histogram for the
range.                                                                      SPIRES database, which is well fitted by a single power law dotted
                                                                            line .
how small it is. The hundreds strong megacollaborations of
CERN and Fermilab are sufficiently diluted by theoretical                   peak in the SPIRES data around 700--presumably again a
and smaller experimental groups that the number is only 9,                  result of the presence of large collaborations.
and not 100.                                                                   For the other three databases, the distributions show some
   Distributions of numbers of authors per paper are shown                  curvature. This may, as we have previously suggested 50 ,
in Fig. 2, and appear to have power-law tails with widely                   be the signature of an exponential cutoff, produced once
varying exponents of 6.2(3) Medline , 3.34(5) Los                           again by the finite time window of the study. Redner 57 has
Alamos Archive ,        4.6(1) NCSTRL , and          2.18(7)                suggested an alternative origin for the cutoff using growth
 SPIRES . The SPIRES data, which are again shown in a                       models of networks--see Ref. 10 . Another possibility has
separate inset, also display a pronounced peak in the distri-                                              ´
                                                                            been put forward by Barabasi 58 , based on models of the
bution around 200­500 authors. This peak presumably cor-                    collaboration process. In one such model 51 , the distribu-
responds to the large experimental collaborations that domi-                tion of the number of collaborators of an author follows a
nate the upper end of this histogram.                                       power law with slope 2 initially, changing to slope 3 in
   The largest number of authors on a single paper was 1681                 the tail, the position of the crossover depending on the length
 in high-energy physics, of course .                                        of time for which the collaboration network has been evolv-
                                                                            ing. We show slopes 2 and 3 as dotted lines on the
            D. Numbers of collaborators per author                          figure, and the agreement with the curvature seen in the data
                                                                            is moderately good, particularly for the Medline data. For
    The differences between the various disciplines repre-
                                                                            the Los Alamos and NCSTRL databases, the slope in the tail
sented in the databases are emphasized still more by the
                                                                            seems to be somewhat steeper than 3.
numbers of collaborators that a scientist has, the total num-
ber of people with whom a scientist wrote papers during the
                                                                                             E. Size of the giant component
five year period. The average number of collaborators is
markedly lower in the purely theoretical disciplines (3.87 in                   In the theory of random graphs 24,59­61 it is known
high-energy theory, 3.59 in computer science than in the                    that there is a continuous phase transition with increasing
wholly or partly experimental ones (18.1 in biomedicine,                    density of edges in a graph at which a ``giant component''
15.1 in astrophysics . But the SPIRES high-energy physics                   forms, i.e., a connected subset of vertices whose size scales
database takes the prize once again, with scientists having an              extensively. Well above this transition, in the region where
impressive 173 collaborators, on average, over a five year                  the giant component exists, the giant component fills a large
period. This clearly begs the question whether the high-                    portion of the graph, and all other components i.e., con-
energy coauthorship network can be considered an accurate                   nected subsets of vertices are small, with average size inde-
representation of the high-energy physics community at all;                 pendent of the number n of vertices in the graph. We see a
it seems unlikely that many authors would know 173 col-                     situation reminiscent of this in all of the graphs studied here:
leagues well.                                                               a single large component of connected vertices that fills the
    The distributions of numbers of collaborators are shown                 majority of the volume of the graph, and a number of much
in Fig. 3. In all cases they appear to have long tails, but only            smaller components filling the rest. In Table I we show the
the SPIRES data inset fit a power-law distribution well,                    size of the giant component for each of our databases, both
with a low measured exponent of 1.20. Note also the small                   as total number of vertices and as a fraction of system size.

                                                                      016131-5
M. E. J. NEWMAN                                                                                  PHYSICAL REVIEW E 64 016131

In all cases the giant component fills around 80% or 90% of        efficient will take a nonzero value even in very large net-
the total volume, except for high-energy theory and com-           works, because there is a finite and probably quite large
puter science, which give smaller figures. A possible expla-       probability that two people will be acquainted if they have
nation of these two anomalies may be that the corresponding        another acquaintance in common. This is a hypothesis we
databases give poorer coverage of their subjects. The hep-th       can test with our collaboration networks. In Table I we show
high-energy database is quite widely used in the field, but        values of the clustering coefficient C, calculated from Eq.
overlaps to an extent with the longer established SPIRES            2 , for each of the databases studied, and as we see the
database, and it is possible that some authors neglect it for      values are indeed large, as large as 0.7 in the case of the
this reason 53 . The NCSTRL computer science database              SPIRES database, and around 0.3 or 0.4 for most of the
differs from the others in this study in that the preprints it     others.
contains are submitted by participating institutions, of which         There are a number of possible explanations for these
there are about 160. Preprints from institutions not partici-      high values of C. First of all, it may be that they indicate
pating are mostly left out of the database, and its coverage of    simply that collaborations of three or more people are com-
the subject area is, as a result, incomplete.                      mon in science. Every paper that has three authors clearly
    We also show in Table I the size of the second largest         contributes a triangle to the numerator of Eq. 2 and hence
component in each of our networks. This component is in all        increases the clustering coefficient. This is, in a sense, a
cases far smaller than the giant component--typically con-         trivial form of clustering, although it is by no means socially
sisting of only 20 or 30 authors--in qualitative agreement         uninteresting.
with our expectations from the theory of random graphs.                In fact it turns out that this effect can account for some
    The figure of 80­90 % for the size of the giant component      but not all of the clustering seen in our graphs. One can
is a promising one. It indicates that the vast majority of sci-    construct a random graph model of a collaboration network
entists are connected via collaboration, and hence via per-        that mimics the trivial clustering effect, and the results indi-
sonal contact, with the rest of their field. Furthermore, as we    cate that only about a half of the clustering that we see is a
show in the following paper 62 , the path through the net-         result of authors collaborating in groups of three or more
work that connects two scientists is typically very short. De-      24 . The rest of the clustering must have a social explana-
spite the prevalence of journal publishing and conferences in      tion, and there are some obvious possibilities.
the sciences, person-to-person contact is still of paramount            1 A scientist may collaborate with two colleagues indi-
importance in the communication of scientific information,         vidually, who may then become acquainted with one another
and it is reasonable to suppose that the scientific enterprise     through their common collaborator, and so end up collabo-
would be significantly hindered if scientists were not so well     rating themselves. This is the usual explanation for transitiv-
connected to one another.                                          ity in acquaintance networks 1 .
                                                                        2 Three scientists may all revolve in the same circles--
                   F. Clustering coefficients                      read the same journals, attend the same conferences--and, as
    An interesting idea circulating in the social networks         a result, independently start up separate collaborations in
community currently is that of ``transitivity,'' which, along      pairs, and so contribute to the value of C, although only the
with its sibling ``structural balance,'' describes symmetry of     workings of the community, and not any specific personal
interaction among trios of actors. ``Transitivity'' has a dif-     interaction, is responsible for introducing them.
ferent meaning in sociology from its meaning in mathemat-               3 As a special case of the previous possibility--and per-
ics and physics, although the two are related. It refers to the    haps the most likely case--three scientists may all work at
extent to which the existence of ties between actors A and B       the same institution, and as a result may collaborate with one
and between actors B and C implies a tie between A and C.          another in pairs.
The transitivity, or more precisely the fraction of transitive         Interesting studies could no doubt be made of these pro-
triples, is that fraction of connected triples of vertices which   cesses by combining our network data with data on, for in-
also form ``triangles'' of interaction. Here a connected triple    stance, institutional affiliations of scientists. Such studies are,
means an actor who is connected to two others. In the phys-        however, perhaps better left to the social scientists.
ics literature, this quantity is usually called the clustering         The clustering coefficient of the Medline database is wor-
coefficient C 5 , and can be written                               thy of brief mention, since its value is far smaller than those
                                                                   for the other databases. One possible explanation of this
              3 number of triangles on the graph                   comes from the unusual social structure of biomedical re-
        C                                            .       2     search, which, unlike the other sciences, has traditionally
             number of connected triples of vertices
                                                                   been organized into laboratories, each with a ``principal in-
                                                                   vestigator'' supervising a large number of postdoctoral asso-
The factor of 3 in the numerator compensates for the fact that     ciates, students, and technicians working on different
each complete triangle of three vertices contributes three         projects. This organization produces a treelike hierarchy of
connected triples, one centered on each of the three vertices,     collaborative ties. A tree has no loops in it, and hence no
and ensures that C 1 on a completely connected graph. On           triangles to contribute to the clustering coefficient. Although
all random graphs C O(n 1 ) 5,24 , where n is the number           the biomedicine hierarchy is certainly not a perfect tree, it
of vertices, and hence goes to zero in the limit of large graph    may be sufficiently treelike for the difference to show up in
size. In social networks it is believed that the clustering co-    the value of C. Another possible explanation comes from the

                                                             016131-6
SCIENTIFIC COLLABORATION NETWORKS . . . .             I. . . .                                      PHYSICAL REVIEW E 64 016131

generous tradition of authorship in the biomedical sciences.           find that in biomedicine the degree of network clustering is
It is common, for example, for a researcher to be made a               much lower than in other fields, possibly indicating differ-
coauthor of a paper in return for synthesizing reagents used           ences in social organization between biomedical and other
in an experimental procedure. Such a researcher will in many           research communities.
cases have a less than average likelihood of developing new               In the following paper 62 , we continue the study of the
collaborations with their collaborators' friends, and therefore        networks introduced here, looking at a variety of nonlocal
of increasing the clustering coefficient.                              network properties. Among other things, we look at the typi-
                                                                       cal distances between pairs of scientists through the network,
                     IV. CONCLUSIONS                                   evaluate a number of centrality indices for our networks, and
                                                                       propose a method for calculating the strength of collabora-
   In this paper we have studied social networks of scientists         tion between scientists.
in which the actors are authors of scientific papers, and a tie
between two actors represents coauthorship of one or more                                ACKNOWLEDGMENTS
papers. Drawing on the lists of authors in four databases of
papers in physics, biomedical research, and computer sci-                 The author would particularly like to thank Paul Ginsparg
ence, we have constructed explicit networks for papers ap-             for his invaluable help in obtaining the data used for this
pearing between the beginning of 1995 and the end of 1999.             study. The data were generously made available by Oleg
We have calculated a large number of statistics for our net-           Khovayko, David Lipman, and Grigoriy Starchenko Med-
works, including typical numbers of papers per author, au-             line , Paul Ginsparg and Geoffrey West Los Alamos e-Print
thors per paper, and numbers of collaborators per author in            Archive , Heath O'Connell SPIRES , and Carl Lagoze
the various fields. We note that the distributions of these             NCSTRL . The Los Alamos e-Print Archive is funded by
quantities roughly follow a power-law form, although there             the NSF under Grant No. PHY-9413208. NCSTRL is funded
are some deviations which may be due to the finite time                through the DARPA/CNRI test suites program under
window used for the study. We also note that in all the net-           DARPA Grant No. N66001-98-1-8908. The author would
works studied there exists a giant component of scientists                                  ´ ´        ´            ´
                                                                       also like to thank Laszlo Barabasi and Erzsebet Ravasz for
any two of whom can be connected by a short path of inter-                                                                    ´ ´
                                                                       making available an early version of Ref. 51 , and Laszlo
mediate collaborators.                                                        ´
                                                                       Barabasi, Sankar Das Sarma, Paul Ginsparg, Rick Grannis,
   A number of differences are apparent between the fields             Jon Kleinberg, Laura Landweber, Sid Redner, Ronald Rous-
studied. Researchers in experimental disciplines are found to          seau, Steve Strogatz, Duncan Watts, and Doug White for
have larger numbers of collaborators on average than those             many useful comments and suggestions. This work was
in theoretical disciplines, with high-energy physicists having         funded in part by the National Science Foundation and Intel
easily the largest average number of collaborators. We also            Corporation.




 1 S. Wasserman and K. Faust, Social Network Analysis Cam-                                                    ´
                                                                       14 R. Albert, H. Jeong, and A.-L. Barabasi, Nature London 401,
   bridge University Press, Cambridge, 1994 .                             130 1999 .
 2 J. Scott, Social Network Analysis: A Handbook, 2nd ed. Sage                      ´´
                                                                       15 M. Barthelemy and L.A.N. Amaral, Phys. Rev. Lett. 82, 3180
   Publications, London, 2000 .                                            1999 .
 3 M.E.J. Newman, J. Stat. Phys. 101, 819 2000 .                       16 M.E.J. Newman and D.J. Watts, Phys. Lett. A 263, 341
 4 S.H. Strogatz, Nature London 410, 268 2001 .                            1999 ; Phys. Rev. E 60, 7332 1999 .
 5 D.J. Watts and S.H. Strogatz, Nature London 393, 440                17 M.A. de Menezes, C.F. Moukarzel, and T.J.P. Penna, Euro-
    1998 .                                                                phys. Lett. 50, 574 2000 .
                ´
 6 A.-L. Barabasi and R. Albert, Science 286, 509 1999 .                               ´
                                                                       18 A.-L. Barabasi, R. Albert, and H. Jeong, Physica A 272, 173
 7 R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A.                 1999 .
   Tomkins, and E. Upfal, in Proceedings of the IEEE Sympo-            19 M.E.J. Newman, C. Moore, and D.J. Watts, Phys. Rev. Lett.
   sium on Foundations of Computer Science IEEE, New York,                84, 3201 2000 .
   2000 .                                                              20 C. Moore and M.E.J. Newman, Phys. Rev. E 61, 5678 2000 ;
 8 C.F. Moukarzel, Phys. Rev. E 60, 6263 1999 .                           62, 7059 2000 .
 9 S.N. Dorogovtsev and J.F.F. Mendes, Europhys. Lett. 50, 1           21 R. Cohen, K. Erez, D. ben-Avraham, and S. Havlin, Phys. Rev.
    2000 .                                                                Lett. 85, 4626 2000 .
10 P.L. Krapivsky, S. Redner, and F. Leyvraz, Phys. Rev. Lett.         22 D.S. Callaway, M.E.J. Newman, S.H. Strogatz, and D.J. Watts,
   85, 4629 2000 .                                                        Phys. Rev. Lett. 85, 5468 2000 .
11 S.N. Dorogovtsev, J.F.F. Mendes, and A.N. Samukhin, Phys.           23 A. Barrat and M. Weigt, Eur. Phys. J. B 13, 547 2000 .
   Rev. Lett. 85, 4633 2000 .                                          24 M.E.J. Newman, S.H. Strogatz, and D.J. Watts, e-print
12 R.V. Kulkarni, E. Almaas, and D. Stroud, Phys. Rev. E 61,              cond-mat/0007235.
   4268 2000 .                                                         25 P. Mariolis, Soc. Sci. Q. 56, 425 1975 .
13 J.M. Kleinberg, Nature London 406, 845 2000 .                       26 J. Galaskiewicz and P.V. Marsden, Soc. Sci. Res. 7, 89 1978 .

                                                                 016131-7
M. E. J. NEWMAN                                                                                 PHYSICAL REVIEW E 64 016131

27 J.F. Padgett and C.K. Ansell, Am. J. Sociol. 98, 1259 1993 .    43 V. Batagelj and A. Mrvar, Soc. Networks 22, 173 2000 .
28 C.C. Foster, A. Rapoport, and C.J. Orwant, Behav. Sci. 8, 56    44 L. Egghe and R. Rousseau, Introduction to Informetrics
    1963 .                                                             Elsevier, Amsterdam, 1990 .
29 T.J. Fararo and M. Sunshine, A Study of a Biased Friendship     45 O. Persson and M. Beckmann, Scientometrics 33, 351 1995 .
   Network Syracuse University Press, Syracuse, NY, 1964 .         46 G. Melin and O. Persson, Scientometrics 36, 363 1996 .
30 H.R. Bernard, P.D. Kilworth, M.J. Evans, C. McCarty, and        47 H. Kretschmer, Z. Sozialpsychol. 29, 307 1998 .
   G.A. Selley, Ethnology 2, 155 1988 .                            48 Y. Ding, S. Foo, and G. Chowdhury, Int. Inf. Lib. Rev. 30, 367
31 A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajago-            1999 .
   palan, R. Stata, A. Tomkins, and J. Wiener, Comput. Networks    49 There has been a considerable amount of work on networks of
   33, 309 2000 .                                                     citations between papers, both in information science see D.J.
                                        ´                             de Solla Price, Science 149, 510 1965 , for instance , and
32 R. Albert, H. Jeong, and A.-L. Barabasi, Nature London 406,
                                                                      more recently in physics see S. Redner, Eur. Phys. J. B 4, 131
   378 2000 .
                                                                       1998 . In these networks, the actors are papers and the di-
33 J. Abello, A. Buchsbaum, and J. Westbrook, in Proceedings of
                                                                      rected ties between them are citations of one paper by an-
   the Sixth European Symposium on Algorithms, edited by G.
                                                                      other. However, while citation data are plentiful and many
   Bilardi Springer, Berlin, 1998 .                                   results are known, citation networks are not true social net-
                                        ´´
34 L.A.N. Amaral, A. Scala, M. Barthelemy, and H.E. Stanley,          works since the authors of two papers need not be acquainted
   Proc. Natl. Acad. Sci. U.S.A. 97, 11 149 2000 .                    for one of them to cite the other's work. On the other hand,
35 G.F. Davis and H.R. Greve, Am. J. Sociol. 103, 1 1997 .            citation probably does imply a certain congruence in the sub-
36 A. Davis, B.B. Gardner, and M.R. Gardner, Deep South Uni-          ject matter of the two papers, which, although not a social
   versity of Chicago Press, Chicago, 1941 .                          relationship, may certainly be of interest for other reasons.
37 http://www.imdb.com/                                            50 M.E.J. Newman, Proc. Natl. Acad. Sci. U.S.A. 98, 404 2001 .
38 If one considers the worldwide web to be a social network an                    ´                   ´
                                                                   51 A.-L. Barabasi, H. Jeong, Z. Neda, E. Ravasz, A. Schubert,
   issue of some debate--see B. Wellman, J. Salaff, D. Dim-           and T. Vicsek, e-print cond-mat/0104162.
   itrova, L. Garton, M. Gulia, and C. Haythornthwaite, Annu.      52 M.E.J. Newman, e-print cond-mat/0104209.
   Rev. Sociol. 22, 213 1996 , then it certainly dwarfs the net-   53 H.B. O'Connell, e-print physics/0007040.
   works studied here, having, it is estimated, about a billion    54 A.J. Lotka, J. Wash. Acad. Sci. 16, 317 1926 .
   vertices at the time of writing.                                55 H. Voos, J. Am. Soc. Inf. Sci. 25, 270 1974 .
39 P. Hoffman, The Man Who Loved Only Numbers Hyperion,            56 M.L. Pao, J. Am. Soc. Inf. Sci. 37, 26 1986 .
   New York, 1998 .                                                57 S. Redner private communication .
                                                                                   ´
                                                                   58 A.-L. Barabasi private communication .
40 P. Erdos and M. Kac, Am. J. Math. 26, 738 1940 ; R.M. Ziff,
                                                                                         ´
                                                                   59 P. Erdos and A. Renyi, Publ. Math. Inst. Hung. Acad. Sci. 5,
   G.E. Uhlenbeck, and M. Kac, Phys. Rep. 32, 169 1977 ;
                                                                      17 1960 .
   M.E.J. Newman and R.M. Ziff, Phys. Rev. Lett. 85, 4104
                                                                                 ´
                                                                   60 B. Bollobas, Random Graphs Academic Press, New York,
    2000 .
                                                                      1985 .
41 J.W. Grossman and P.D.F. Ion, Congressus Numerantium 108,
                                                                   61 M. Molloy and B. Reed, Random Struct. Algorithms 6, 161
   129 1995 .
                                                                       1995 ; Combinatorics, Prob. Comput. 7, 295 1998 .
42 R. De Castro and J.W. Grossman, Math. Intelligencer 21, 51      62 M.E.J. Newman, following paper, Phys. Rev. E 64, 016132
    1999 .                                                             2001 .




                                                             016131-8