Tags: apparent differences, applied mathematics, centrality, collaboration networks, computer databases, connectedness, contained community, cornell university, fundamental results, giant component, hyde park, hyde park road, ithaca new york, network construction, nity, paper numbers, rhodes hall, santa fe institute, santa fe new mexico, statistical properties,
PHYSICAL REVIEW E, VOLUME 64, 016131
Scientific collaboration networks. I. Network construction and fundamental results
M. E. J. Newman
Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501
and Center for Applied Mathematics, Cornell University, Rhodes Hall, Ithaca, New York 14853
Received 5 December 2000; revised manuscript received 1 February 2001; published 28 June 2001
Using computer databases of scientific papers in physics, biomedical research, and computer science, we
have constructed networks of collaboration between scientists in each of these disciplines. In these networks
two scientists are considered connected if they have coauthored one or more papers together. We study a
variety of statistical properties of our networks, including numbers of papers written by authors, numbers of
authors per paper, numbers of collaborators that scientists have, existence and size of a giant component of
connected scientists, and degree of clustering in the networks. We also highlight some apparent differences in
collaboration patterns between the subjects studied. In the following paper, we study a number of measures of
centrality and connectedness in the same networks.
DOI: 10.1103/PhysRevE.64.016131 PACS number s : 89.75.Hc, 89.65. s, 89.70. c, 01.30. y
I. INTRODUCTION fairly self-contained community such as a business commu-
nity 2527 , a school 28,29 , a religious or ethnic commu-
A social network 1,2 is a set of people or groups each of nity 30 , and so forth, and constructs the network of ties by
which has connections of some kind to some or all of the interviewing participants, or by circulating questionnaires. A
others. In the language of social network analysis, the people study will ask respondents to name those with whom they
or groups are called ``actors'' and the connections ``ties.'' have the closest ties, probably ranked by subjective close-
Both actors and ties can be defined in different ways depend- ness, and may optionally call for additional information
ing on the questions of interest. An actor might be a single about those people or about the nature of the ties.
person, a team, or a company. A tie might be a friendship
Studies of this kind have revealed much about the struc-
between two people, a collaboration or common member be-
ture of communities, but they suffer from two substantial
tween two teams, or a business relationship between compa-
nies. problems that make them poor sources of data for the kind of
Social network analysis has a history stretching back at approach to network analysis that physics has adopted. First,
least half a century, and has produced many results concern- the data they return are not numerous. Collecting and com-
ing social influence, social groupings, inequality, disease piling data from these studies is an arduous process and most
propagation, communication of information, and indeed al- data sets contain no more than a few tens or hundreds of
most every topic that has interested 20th century sociology. actors. It is a rare study that exceeds 1000 actors. This makes
The Physical Review is, however, a physics journal. Why the statistical accuracy of many results poor, a particular dif-
should a physicist be interested in social networks? There ficulty for the large-system-size methods used in statistical
has, in fact, been a substantial surge of interest in social physics. Second, they contain significant and uncontrolled
networks within the physics community recently, as evi- errors as a result of the subjective nature of respondents'
denced by the large body of papers on the topic--see Refs. replies. What one respondent considers to be a friendship or
324 and references therein. The techniques of statistical acquaintance, for example, may be completely different from
physics in particular turn out to be well suited to the study of what another respondent does. In studies of schoolchildren
these networks. Profitable use has been made of a variety of 28,29 , for instance, it is found that some children will claim
physical modeling techniques 57 , exact solutions 813 , friendship with every single one of their hundreds of school-
Monte Carlo simulation 1417 , scaling and renormaliza- mates, while others will name only one or two friends.
tion group methods 1517 , mean-field theory 18,19 , per- Clearly these respondents are employing different definitions
colation theory 2022 , the replica method 23 , generating of friendship.
functions 20,22,24 , and a host of other techniques familiar Reliable statistics do exist for some other types of net-
to the readers of this publication. works. Examples include the world-wide web 14,31,32 ,
In this paper and the following one, we make use of some power grids 5 , telephone call graphs 33 , and airline time-
of these techniques in the study of some specific examples of tables 34 . These graphs are certainly interesting in their
social networks. However, our subject matter will be of in- own right, and furthermore might loosely be regarded as so-
terest to physicists for another reason: it's about them. In cial networks, since their structure clearly reflects something
these two papers, we study networks in which the actors are about the structure of the society that created them. How-
scientists, and the ties between them are scientific collabora- ever, their connection to the ``true'' social networks dis-
tions, as documented by the learned articles that they write. cussed here is tenuous at best and so, for our purposes, they
II. COLLABORATION NETWORKS
cannot offer a great deal of insight.
A more promising source of data is the affiliation net-
Traditional investigations of social networks have been work. An affiliation network is a network of actors con-
carried out through field studies. Typically one looks at a nected by common membership in groups of some sort, such
1063-651X/2001/64 1 /016131 8 /$20.00 64 016131-1 ©2001 The American Physical Society
M. E. J. NEWMAN PHYSICAL REVIEW E 64 016131
as clubs, teams, or organizations. Examples that have been 644 papers, on 643 of which he was the sole author. He has
studied in the past include company CEOs and the clubs they
no finite Erdos number 41 . Clearly sheer size of output is
frequent 26 , company directors and the boards of directors not a sufficient condition for high connectedness.
on which they sit 25,35 , women and the social events they There is also a substantial body of work in bibliometrics
attend 36 , and movie actors and the movies in which they a specialty within information science on extraction of col-
appear 5,34 . Data on affiliation networks tend to be more laboration patterns from publication data--see Refs. 4448
reliable than those on other social networks, since member- and references therein. However, these studies have not so
ship of a group can often be determined with a precision not far attempted to reconstruct entire collaboration networks
available when considering friendship or other types of ac- from bibliographic data, concentrating more on organiza-
quaintance. Very large networks can be assembled in this tional and institutional aspects of collaboration 49 .
way as well, since in many cases group membership can be In this paper and the following one, we study networks of
ascertained from membership lists, making time-consuming scientists using bibliographic data drawn from four publicly
interviews or questionnaires unnecessary. A network of available databases of papers.
movie actors, for example, and the movies in which they 1 Los Alamos e-Print Archive: a database of unrefereed
appear has been compiled using the resources of the Internet preprints in physics, self-submitted by their authors, running
Movie Database 37 , and contains the names of nearly half from 1992 to the present. This database is subdivided into
a million actors--a much better sample on which to perform specialties within physics, such as condensed matter and
statistics than most social networks, although it is unclear high-energy physics.
whether this particular network has any real social interest. 2 Medline: a database of articles on biomedical research
In this paper we construct affiliation networks of scientists published in refereed journals, stretching from 1961 to the
in which a link between two scientists is established by their present. Entries in the database are updated by the database's
coauthorship of one or more scientific papers. Thus the maintainers, rather than papers' authors, giving it relatively
groups to which scientists belong in this network are the thorough coverage of its subject area. The inclusion of bio-
groups of coauthors of a single paper. This network is in medicine is crucial in a study such as this one. In most coun-
some ways more truly a social network than many affiliation tries biomedical research easily dwarfs civilian research on
networks; it is probably fair to say that most pairs of people any other topic, in terms of both expenditure and human
who have written a paper together are genuinely acquainted effort. Any study that omitted it would be leaving out the
with one another, in a way that movie actors who appeared largest part of current scientific research.
together in a movie may not be. There are exceptions--some 3 Stanford Public Information Retrieval System
very large collaborations, for example in high-energy phys- SPIRES : a database of preprints and published papers in
ics, will contain coauthors who have never even met--and high-energy physics, both theoretical and experimental, from
we will discuss these at the appropriate point. By and large, 1974 to the present. The contents of this database are profes-
however, the network reflects genuine professional interac- sionally maintained. High-energy physics is an interesting
tion between scientists, and may be the largest social net- case socially, having a tradition of much larger experimental
work ever studied 38 . collaborations than other disciplines.
The idea of constructing a network of coauthorship is not 4 Networked Computer Science Technical Reference
new. Many readers will be familiar with the concept of the Library NCSTRL : a database of preprints in computer sci-
Erdos number, named after Paul Erdos, the Hungarian math- ence, submitted by participating institutions and stretching
ematician, one of the founding fathers of graph theory back about ten years.
among other things 39 . At some point, it became a popular We have constructed complete collaboration networks for
cocktail party pursuit for mathematicians to calculate how each of these databases separately, and analyzed them using
far removed they were in terms of publication from Erdos. a variety of techniques, some standard, some invented for the
Those who had published a paper with Erdos were given an purpose. A brief report of some of the work described here
Erdos number of 1, those who had published with one of has appeared previously as Ref. 50 .
those people but not with Erdos a number of 2, and so forth.
The present author, for example, has an Erdos number of 3, III. FUNDAMENTAL RESULTS
via Robert Ziff and Mark Kac 40 . In the jargon of social
networks, your Erdos number is the geodesic distance be- For this study, we constructed collaboration networks us-
tween you and Erdo s in the coauthorship network. In recent ing data from a five year period from 1995 to 1999 inclusive,
studies 4143 , it has been found that the average Erdos although data for much longer periods were available in
number is about 4.7, and the maximum known finite Erdos some of the databases. There were several reasons for using
number within mathematics is 15. These results are prob- this fairly short time window. First, older data are less com-
ably influenced to some extent by Erdos' prodigious math- plete than newer for all databases. Second, we wanted to
ematical output: he published at least 1512 papers, more than study the same time period for all databases, so as to be able
any other mathematician ever except possibly Leonhard Eu- to make valid comparisons between collaboration patterns in
ler. However, quantitatively similar, if not quite so impres- different fields. The coverage provided by both the Los Ala-
sive, results are in most cases found if the network is cen- mos Archive and the NCSTRL database is relatively poor
tered on another mathematician. On the other hand, the fifth before 1995, and this sets a limit on how far back we can
most published mathematician, Lucien Godeaux, produced look. Third, networks change over time, both because people
016131-2
SCIENTIFIC COLLABORATION NETWORKS . . . . I. . . . PHYSICAL REVIEW E 64 016131
TABLE I. Some fundamental statistics for the scientific collaboration networks studied here. Numbers in
parentheses are standard errors on the least significant figures.
Los Alamos e-Print Archive
Medline complete astro-ph cond-mat hep-th SPIRES NCSTRL
Total number of papers 2163923 98502 22029 22016 19085 66652 13169
Total number of authors 1520251 52909 16706 16726 8361 56627 11994
First initial only 1090584 45685 14303 15451 7676 47445 10998
Mean papers per author 6.4(6) 5.1(2) 4.8(2) 3.65(7) 4.8(1) 11.6(5) 2.55(5)
Mean authors per paper 3.754(2) 2.530(7) 3.35(2) 2.66(1) 1.99(1) 8.96(18) 2.22(1)
Collaborators per author 18.1(1.3) 9.7(2) 15.1(3) 5.86(9) 3.87(5) 173(6) 3.59(5)
Size of giant component 1395693 44337 14845 13861 5835 49002 6396
First initial only 1019418 39709 12874 13324 5593 43089 6706
As a percentage 92.6(4)% 85.4(8)% 89.4(3) 84.6(8)% 71.4(8)% 88.7(1.1)% 57.2(1.9)%
2nd largest component 49 18 19 16 24 69 42
Clustering coefficient 0.066(7) 0.43(1) 0.414(6) 0.348(6) 0.327(2) 0.726(8) 0.496(6)
enter and leave the professions they represent and because Wright, Francis Wright, and Frank Lloyd Wright could all be
practices of scientific collaboration and publishing change. the same person. Also, two authors may have the same name.
In this particular study we have not examined time evolution Grossman and Ion 41 point out that there are two American
in the network, although this is certainly an interesting topic mathematicians named Norman Lloyd Johnson, who are
for research and indeed is currently under investigation known to be distinct people but between whom computer
51,52 . For our purposes, a short window of data is desir- programs such as ours cannot hope to distinguish. Even ad-
able, to ensure that the collaboration network is roughly ditional clues such as home institution or field of specializa-
static during the study. tion cannot be used to distinguish such people, since many
The raw data for the networks described here are com- scientists have more than one institution or publish in more
puter files containing lists of papers, including authors' than one field. The present author, for example, has ad-
names and possibly other information such as title, abstract, dresses at the Santa Fe Institute and Cornell University, and
date, journal reference, and so forth. Construction of the col- publishes in both statistical physics and paleontology.
laboration networks is straightforward. The files are parsed In order to control for these biases, we constructed two
to extract author names and as names are found a list is different versions of each of the collaboration networks stud-
maintained of the ones seen so far--vertices already in the ied here, as follows. In the first, we identify each author by
network--so that recurring names can be correctly assigned his or her surname and first initial only. This method is
to extant vertices. Edges are added between each pair of clearly prone to confusing two people for one, but will rarely
authors on each paper. A naive implementation of this cal- fail to identify two names which genuinely refer to the same
culation, in which names are stored in a simple array, would person. In the second version of each network, we identify
take time O(pn), where p is the total number of papers in authors by surname and all initials. This method can much
the database and n the number of authors. This, however, more reliably distinguish authors from one another, but will
turns out to be prohibitively slow for large networks since p also identify one person as two if they give their initials
and n are of similar size and may be a million or more. differently on different papers. Indeed this second measure
Instead therefore, we store the names of the authors in an appears to overestimate the number of authors in a database
ordered binary tree, which reduces the running time to substantially. Networks constructed in these two different
O( p log n), making the calculation tractable, even for the fashions therefore give upper and lower bounds on the num-
largest databases studied here. ber of authors, and hence also give bounds on many of the
In Table I we give a summary of some of the basic results other quantities studied here. In Table I we give numbers of
for the networks studied here. We discuss these results in authors in each network using both methods, but for many of
detail in the rest of this section. the other quantities we give only an error estimate based on
the separation of the bounds.
A. Number of authors
B. Number of papers per author
The size of the databases varies considerably from about a
million authors for Medline to about ten thousand for NC- The average number of papers per author in the various
STRL. In fact, it is difficult to say with precision how many subject areas is in the range of around three to six over the
authors there are. One can say how many distinct names five year period. The only exception is the SPIRES database,
appear in a database, but the number of names is not the covering high-energy physics, in which the figure is signifi-
same as the number of authors. A single author may report cantly higher at 11.6. One possible explanation for this is
their name differently on different papers. For example, F. L. that SPIRES is the only database that contains both preprints
016131-3
M. E. J. NEWMAN PHYSICAL REVIEW E 64 016131
of five years used in this study, which prevents any one
author from publishing a very large number of papers. Lotka
and subsequent authors who have confirmed his law have not
usually used such a window.
It is interesting to speculate why the cutoff appears only
in physics and not in computer science or biomedicine.
Surely the five year window limits everyone's ability to pub-
lish very large numbers of papers, regardless of their area of
specialization? For the case of Medline one possible expla-
nation is suggested by an inspection of the list of the most
published authors: it transpires that most of these authors
have names that are known to occur frequently. It is thus
conceivable that these apparently highly published authors
are really each several different people who have been con-
flated in our analysis, and hence that there is not after all any
fat tail in the distribution, only the illusion of one produced
FIG. 1. Histograms of the number of papers written by authors by the large number of scientists with commonly occurring
in Medline, the Los Alamos Archive, and NCSTRL. The dotted names. This does not, however, explain why the tail appears
lines are fits to the data as described in the text. Inset: the equivalent to follow a power law. This argument is strengthened by the
histogram for the SPIRES database. sheer numbers of papers involved. For instance, the number
1 author in the Medline database published, it appears, 1697
and published papers. It is possible that the high figure for papers, or about one paper a day, including weekends and
papers per author reflects duplication of papers in both pre- holidays, every day for the entire course of our five year
print and published form. However, the maintainers of the study. This seems to be an improbably large output.
database go to some lengths to avoid this 53 , and a more Interestingly, the names that top the list in physics and
probable explanation is perhaps that publication rates are computer science are not ones that are known to be common.
higher for the large collaborations favored by high-energy Thus it is still unclear why the NCSTRL database should
physics, since a large group of scientists has more person- have a power-law tail, although this database is small and it
hours available for the writing of papers. is possible that it does possess a cutoff in the productivity
In addition to the average numbers of papers per author in distribution which is just not visible because of the limits of
each database, it is interesting to look at the distribution p k the data set.
of numbers k of papers per author. In 1926, Alfred Lotka For the SPIRES database, which is shown separately in
54 showed, using a data set compiled by hand, that this the inset of the figure, neither pure nor truncated power law
distribution followed a power law, with exponent approxi- fits the data well, the histogram displaying a significant
mately 2, a result that is sometimes referred to as Lotka's bump around the 100 paper mark. A possible explanation for
law of scientific productivity. In other words, in addition to this is that a small number of large collaborations published
the many authors who publish only a small number of pa- around this number of papers during the time period studied.
pers, one expects to see a ``fat tail'' consisting of a small Since each author in such a collaboration is then credited
number of authors who publish a very large number of pa- with publishing 100 papers, the statistics in the tail of the
pers. In Fig. 1 we show on logarithmic scales histograms for distribution can be substantially skewed by such practices.
each of our four databases of the numbers of papers pub-
lished. These histograms and all the others shown here were
C. Numbers of authors per paper
created using the ``all initials'' versions of the collaboration
networks. For the Medline and NCSTRL databases these Grossman and Ion 41 have given results showing that
histograms follow a power law quite closely, at least in their the average number of authors on papers in mathematics has
tails, with exponents of 2.86(3) and 3.41(7), increased steadily over the last 60 years, from a little over 1
respectively--somewhat steeper than those found by Lotka, to its current value of about 1.5. Higher numbers still seem to
but in reasonable agreement with other more recent studies apply to current studies in the sciences. Purely theoretical
44,55,56 . For the Los Alamos Archive the pure power law papers appear to be typically the work of two scientists, with
is a poor fit. An exponentially truncated power law does high-energy theory and computer science showing averages
much better: of 1.99 and 2.22 authors per paper in our calculations. For
k/
databases covering experimental or partly experimental sub-
p k Ck e , 1 ject areas the averages are, not surprisingly, higher: 3.75 for
biomedicine, 3.35 for astrophysics, 2.66 for condensed mat-
where and are constants and C is fixed by the require- ter physics. The SPIRES high-energy physics database, how-
ment of normalization. The probability p 0 of having zero ever, shows the most startling results, with an average of
papers is taken to be zero, since the names of scientists who 8.96 authors per paper, obviously a result of the presence of
have not written any papers do not appear in the database. papers in the database written by very large collaborations.
The exponential cutoff we attribute to the finite time window Perhaps what is most surprising about this result is actually
016131-4
SCIENTIFIC COLLABORATION NETWORKS . . . . I. . . . PHYSICAL REVIEW E 64 016131
FIG. 2. Histograms of the number of authors on papers in Med- FIG. 3. Histograms of the number of collaborators of authors in
line, the Los Alamos Archive, and NCSTRL. The dotted lines are Medline, the Los Alamos Archive, and NCSTRL. The dotted lines
the best fit power-law forms. Inset: the equivalent histogram for the show how power-law distributions with exponents 2 and 3
SPIRES database, showing a clear peak in the 200 to 500 author would look on the same axes. Inset: the equivalent histogram for the
range. SPIRES database, which is well fitted by a single power law dotted
line .
how small it is. The hundreds strong megacollaborations of
CERN and Fermilab are sufficiently diluted by theoretical peak in the SPIRES data around 700--presumably again a
and smaller experimental groups that the number is only 9, result of the presence of large collaborations.
and not 100. For the other three databases, the distributions show some
Distributions of numbers of authors per paper are shown curvature. This may, as we have previously suggested 50 ,
in Fig. 2, and appear to have power-law tails with widely be the signature of an exponential cutoff, produced once
varying exponents of 6.2(3) Medline , 3.34(5) Los again by the finite time window of the study. Redner 57 has
Alamos Archive , 4.6(1) NCSTRL , and 2.18(7) suggested an alternative origin for the cutoff using growth
SPIRES . The SPIRES data, which are again shown in a models of networks--see Ref. 10 . Another possibility has
separate inset, also display a pronounced peak in the distri- ´
been put forward by Barabasi 58 , based on models of the
bution around 200500 authors. This peak presumably cor- collaboration process. In one such model 51 , the distribu-
responds to the large experimental collaborations that domi- tion of the number of collaborators of an author follows a
nate the upper end of this histogram. power law with slope 2 initially, changing to slope 3 in
The largest number of authors on a single paper was 1681 the tail, the position of the crossover depending on the length
in high-energy physics, of course . of time for which the collaboration network has been evolv-
ing. We show slopes 2 and 3 as dotted lines on the
D. Numbers of collaborators per author figure, and the agreement with the curvature seen in the data
is moderately good, particularly for the Medline data. For
The differences between the various disciplines repre-
the Los Alamos and NCSTRL databases, the slope in the tail
sented in the databases are emphasized still more by the
seems to be somewhat steeper than 3.
numbers of collaborators that a scientist has, the total num-
ber of people with whom a scientist wrote papers during the
E. Size of the giant component
five year period. The average number of collaborators is
markedly lower in the purely theoretical disciplines (3.87 in In the theory of random graphs 24,5961 it is known
high-energy theory, 3.59 in computer science than in the that there is a continuous phase transition with increasing
wholly or partly experimental ones (18.1 in biomedicine, density of edges in a graph at which a ``giant component''
15.1 in astrophysics . But the SPIRES high-energy physics forms, i.e., a connected subset of vertices whose size scales
database takes the prize once again, with scientists having an extensively. Well above this transition, in the region where
impressive 173 collaborators, on average, over a five year the giant component exists, the giant component fills a large
period. This clearly begs the question whether the high- portion of the graph, and all other components i.e., con-
energy coauthorship network can be considered an accurate nected subsets of vertices are small, with average size inde-
representation of the high-energy physics community at all; pendent of the number n of vertices in the graph. We see a
it seems unlikely that many authors would know 173 col- situation reminiscent of this in all of the graphs studied here:
leagues well. a single large component of connected vertices that fills the
The distributions of numbers of collaborators are shown majority of the volume of the graph, and a number of much
in Fig. 3. In all cases they appear to have long tails, but only smaller components filling the rest. In Table I we show the
the SPIRES data inset fit a power-law distribution well, size of the giant component for each of our databases, both
with a low measured exponent of 1.20. Note also the small as total number of vertices and as a fraction of system size.
016131-5
M. E. J. NEWMAN PHYSICAL REVIEW E 64 016131
In all cases the giant component fills around 80% or 90% of efficient will take a nonzero value even in very large net-
the total volume, except for high-energy theory and com- works, because there is a finite and probably quite large
puter science, which give smaller figures. A possible expla- probability that two people will be acquainted if they have
nation of these two anomalies may be that the corresponding another acquaintance in common. This is a hypothesis we
databases give poorer coverage of their subjects. The hep-th can test with our collaboration networks. In Table I we show
high-energy database is quite widely used in the field, but values of the clustering coefficient C, calculated from Eq.
overlaps to an extent with the longer established SPIRES 2 , for each of the databases studied, and as we see the
database, and it is possible that some authors neglect it for values are indeed large, as large as 0.7 in the case of the
this reason 53 . The NCSTRL computer science database SPIRES database, and around 0.3 or 0.4 for most of the
differs from the others in this study in that the preprints it others.
contains are submitted by participating institutions, of which There are a number of possible explanations for these
there are about 160. Preprints from institutions not partici- high values of C. First of all, it may be that they indicate
pating are mostly left out of the database, and its coverage of simply that collaborations of three or more people are com-
the subject area is, as a result, incomplete. mon in science. Every paper that has three authors clearly
We also show in Table I the size of the second largest contributes a triangle to the numerator of Eq. 2 and hence
component in each of our networks. This component is in all increases the clustering coefficient. This is, in a sense, a
cases far smaller than the giant component--typically con- trivial form of clustering, although it is by no means socially
sisting of only 20 or 30 authors--in qualitative agreement uninteresting.
with our expectations from the theory of random graphs. In fact it turns out that this effect can account for some
The figure of 8090 % for the size of the giant component but not all of the clustering seen in our graphs. One can
is a promising one. It indicates that the vast majority of sci- construct a random graph model of a collaboration network
entists are connected via collaboration, and hence via per- that mimics the trivial clustering effect, and the results indi-
sonal contact, with the rest of their field. Furthermore, as we cate that only about a half of the clustering that we see is a
show in the following paper 62 , the path through the net- result of authors collaborating in groups of three or more
work that connects two scientists is typically very short. De- 24 . The rest of the clustering must have a social explana-
spite the prevalence of journal publishing and conferences in tion, and there are some obvious possibilities.
the sciences, person-to-person contact is still of paramount 1 A scientist may collaborate with two colleagues indi-
importance in the communication of scientific information, vidually, who may then become acquainted with one another
and it is reasonable to suppose that the scientific enterprise through their common collaborator, and so end up collabo-
would be significantly hindered if scientists were not so well rating themselves. This is the usual explanation for transitiv-
connected to one another. ity in acquaintance networks 1 .
2 Three scientists may all revolve in the same circles--
F. Clustering coefficients read the same journals, attend the same conferences--and, as
An interesting idea circulating in the social networks a result, independently start up separate collaborations in
community currently is that of ``transitivity,'' which, along pairs, and so contribute to the value of C, although only the
with its sibling ``structural balance,'' describes symmetry of workings of the community, and not any specific personal
interaction among trios of actors. ``Transitivity'' has a dif- interaction, is responsible for introducing them.
ferent meaning in sociology from its meaning in mathemat- 3 As a special case of the previous possibility--and per-
ics and physics, although the two are related. It refers to the haps the most likely case--three scientists may all work at
extent to which the existence of ties between actors A and B the same institution, and as a result may collaborate with one
and between actors B and C implies a tie between A and C. another in pairs.
The transitivity, or more precisely the fraction of transitive Interesting studies could no doubt be made of these pro-
triples, is that fraction of connected triples of vertices which cesses by combining our network data with data on, for in-
also form ``triangles'' of interaction. Here a connected triple stance, institutional affiliations of scientists. Such studies are,
means an actor who is connected to two others. In the phys- however, perhaps better left to the social scientists.
ics literature, this quantity is usually called the clustering The clustering coefficient of the Medline database is wor-
coefficient C 5 , and can be written thy of brief mention, since its value is far smaller than those
for the other databases. One possible explanation of this
3 number of triangles on the graph comes from the unusual social structure of biomedical re-
C . 2 search, which, unlike the other sciences, has traditionally
number of connected triples of vertices
been organized into laboratories, each with a ``principal in-
vestigator'' supervising a large number of postdoctoral asso-
The factor of 3 in the numerator compensates for the fact that ciates, students, and technicians working on different
each complete triangle of three vertices contributes three projects. This organization produces a treelike hierarchy of
connected triples, one centered on each of the three vertices, collaborative ties. A tree has no loops in it, and hence no
and ensures that C 1 on a completely connected graph. On triangles to contribute to the clustering coefficient. Although
all random graphs C O(n 1 ) 5,24 , where n is the number the biomedicine hierarchy is certainly not a perfect tree, it
of vertices, and hence goes to zero in the limit of large graph may be sufficiently treelike for the difference to show up in
size. In social networks it is believed that the clustering co- the value of C. Another possible explanation comes from the
016131-6
SCIENTIFIC COLLABORATION NETWORKS . . . . I. . . . PHYSICAL REVIEW E 64 016131
generous tradition of authorship in the biomedical sciences. find that in biomedicine the degree of network clustering is
It is common, for example, for a researcher to be made a much lower than in other fields, possibly indicating differ-
coauthor of a paper in return for synthesizing reagents used ences in social organization between biomedical and other
in an experimental procedure. Such a researcher will in many research communities.
cases have a less than average likelihood of developing new In the following paper 62 , we continue the study of the
collaborations with their collaborators' friends, and therefore networks introduced here, looking at a variety of nonlocal
of increasing the clustering coefficient. network properties. Among other things, we look at the typi-
cal distances between pairs of scientists through the network,
IV. CONCLUSIONS evaluate a number of centrality indices for our networks, and
propose a method for calculating the strength of collabora-
In this paper we have studied social networks of scientists tion between scientists.
in which the actors are authors of scientific papers, and a tie
between two actors represents coauthorship of one or more ACKNOWLEDGMENTS
papers. Drawing on the lists of authors in four databases of
papers in physics, biomedical research, and computer sci- The author would particularly like to thank Paul Ginsparg
ence, we have constructed explicit networks for papers ap- for his invaluable help in obtaining the data used for this
pearing between the beginning of 1995 and the end of 1999. study. The data were generously made available by Oleg
We have calculated a large number of statistics for our net- Khovayko, David Lipman, and Grigoriy Starchenko Med-
works, including typical numbers of papers per author, au- line , Paul Ginsparg and Geoffrey West Los Alamos e-Print
thors per paper, and numbers of collaborators per author in Archive , Heath O'Connell SPIRES , and Carl Lagoze
the various fields. We note that the distributions of these NCSTRL . The Los Alamos e-Print Archive is funded by
quantities roughly follow a power-law form, although there the NSF under Grant No. PHY-9413208. NCSTRL is funded
are some deviations which may be due to the finite time through the DARPA/CNRI test suites program under
window used for the study. We also note that in all the net- DARPA Grant No. N66001-98-1-8908. The author would
works studied there exists a giant component of scientists ´ ´ ´ ´
also like to thank Laszlo Barabasi and Erzsebet Ravasz for
any two of whom can be connected by a short path of inter- ´ ´
making available an early version of Ref. 51 , and Laszlo
mediate collaborators. ´
Barabasi, Sankar Das Sarma, Paul Ginsparg, Rick Grannis,
A number of differences are apparent between the fields Jon Kleinberg, Laura Landweber, Sid Redner, Ronald Rous-
studied. Researchers in experimental disciplines are found to seau, Steve Strogatz, Duncan Watts, and Doug White for
have larger numbers of collaborators on average than those many useful comments and suggestions. This work was
in theoretical disciplines, with high-energy physicists having funded in part by the National Science Foundation and Intel
easily the largest average number of collaborators. We also Corporation.
1 S. Wasserman and K. Faust, Social Network Analysis Cam- ´
14 R. Albert, H. Jeong, and A.-L. Barabasi, Nature London 401,
bridge University Press, Cambridge, 1994 . 130 1999 .
2 J. Scott, Social Network Analysis: A Handbook, 2nd ed. Sage ´´
15 M. Barthelemy and L.A.N. Amaral, Phys. Rev. Lett. 82, 3180
Publications, London, 2000 . 1999 .
3 M.E.J. Newman, J. Stat. Phys. 101, 819 2000 . 16 M.E.J. Newman and D.J. Watts, Phys. Lett. A 263, 341
4 S.H. Strogatz, Nature London 410, 268 2001 . 1999 ; Phys. Rev. E 60, 7332 1999 .
5 D.J. Watts and S.H. Strogatz, Nature London 393, 440 17 M.A. de Menezes, C.F. Moukarzel, and T.J.P. Penna, Euro-
1998 . phys. Lett. 50, 574 2000 .
´
6 A.-L. Barabasi and R. Albert, Science 286, 509 1999 . ´
18 A.-L. Barabasi, R. Albert, and H. Jeong, Physica A 272, 173
7 R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. 1999 .
Tomkins, and E. Upfal, in Proceedings of the IEEE Sympo- 19 M.E.J. Newman, C. Moore, and D.J. Watts, Phys. Rev. Lett.
sium on Foundations of Computer Science IEEE, New York, 84, 3201 2000 .
2000 . 20 C. Moore and M.E.J. Newman, Phys. Rev. E 61, 5678 2000 ;
8 C.F. Moukarzel, Phys. Rev. E 60, 6263 1999 . 62, 7059 2000 .
9 S.N. Dorogovtsev and J.F.F. Mendes, Europhys. Lett. 50, 1 21 R. Cohen, K. Erez, D. ben-Avraham, and S. Havlin, Phys. Rev.
2000 . Lett. 85, 4626 2000 .
10 P.L. Krapivsky, S. Redner, and F. Leyvraz, Phys. Rev. Lett. 22 D.S. Callaway, M.E.J. Newman, S.H. Strogatz, and D.J. Watts,
85, 4629 2000 . Phys. Rev. Lett. 85, 5468 2000 .
11 S.N. Dorogovtsev, J.F.F. Mendes, and A.N. Samukhin, Phys. 23 A. Barrat and M. Weigt, Eur. Phys. J. B 13, 547 2000 .
Rev. Lett. 85, 4633 2000 . 24 M.E.J. Newman, S.H. Strogatz, and D.J. Watts, e-print
12 R.V. Kulkarni, E. Almaas, and D. Stroud, Phys. Rev. E 61, cond-mat/0007235.
4268 2000 . 25 P. Mariolis, Soc. Sci. Q. 56, 425 1975 .
13 J.M. Kleinberg, Nature London 406, 845 2000 . 26 J. Galaskiewicz and P.V. Marsden, Soc. Sci. Res. 7, 89 1978 .
016131-7
M. E. J. NEWMAN PHYSICAL REVIEW E 64 016131
27 J.F. Padgett and C.K. Ansell, Am. J. Sociol. 98, 1259 1993 . 43 V. Batagelj and A. Mrvar, Soc. Networks 22, 173 2000 .
28 C.C. Foster, A. Rapoport, and C.J. Orwant, Behav. Sci. 8, 56 44 L. Egghe and R. Rousseau, Introduction to Informetrics
1963 . Elsevier, Amsterdam, 1990 .
29 T.J. Fararo and M. Sunshine, A Study of a Biased Friendship 45 O. Persson and M. Beckmann, Scientometrics 33, 351 1995 .
Network Syracuse University Press, Syracuse, NY, 1964 . 46 G. Melin and O. Persson, Scientometrics 36, 363 1996 .
30 H.R. Bernard, P.D. Kilworth, M.J. Evans, C. McCarty, and 47 H. Kretschmer, Z. Sozialpsychol. 29, 307 1998 .
G.A. Selley, Ethnology 2, 155 1988 . 48 Y. Ding, S. Foo, and G. Chowdhury, Int. Inf. Lib. Rev. 30, 367
31 A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajago- 1999 .
palan, R. Stata, A. Tomkins, and J. Wiener, Comput. Networks 49 There has been a considerable amount of work on networks of
33, 309 2000 . citations between papers, both in information science see D.J.
´ de Solla Price, Science 149, 510 1965 , for instance , and
32 R. Albert, H. Jeong, and A.-L. Barabasi, Nature London 406,
more recently in physics see S. Redner, Eur. Phys. J. B 4, 131
378 2000 .
1998 . In these networks, the actors are papers and the di-
33 J. Abello, A. Buchsbaum, and J. Westbrook, in Proceedings of
rected ties between them are citations of one paper by an-
the Sixth European Symposium on Algorithms, edited by G.
other. However, while citation data are plentiful and many
Bilardi Springer, Berlin, 1998 . results are known, citation networks are not true social net-
´´
34 L.A.N. Amaral, A. Scala, M. Barthelemy, and H.E. Stanley, works since the authors of two papers need not be acquainted
Proc. Natl. Acad. Sci. U.S.A. 97, 11 149 2000 . for one of them to cite the other's work. On the other hand,
35 G.F. Davis and H.R. Greve, Am. J. Sociol. 103, 1 1997 . citation probably does imply a certain congruence in the sub-
36 A. Davis, B.B. Gardner, and M.R. Gardner, Deep South Uni- ject matter of the two papers, which, although not a social
versity of Chicago Press, Chicago, 1941 . relationship, may certainly be of interest for other reasons.
37 http://www.imdb.com/ 50 M.E.J. Newman, Proc. Natl. Acad. Sci. U.S.A. 98, 404 2001 .
38 If one considers the worldwide web to be a social network an ´ ´
51 A.-L. Barabasi, H. Jeong, Z. Neda, E. Ravasz, A. Schubert,
issue of some debate--see B. Wellman, J. Salaff, D. Dim- and T. Vicsek, e-print cond-mat/0104162.
itrova, L. Garton, M. Gulia, and C. Haythornthwaite, Annu. 52 M.E.J. Newman, e-print cond-mat/0104209.
Rev. Sociol. 22, 213 1996 , then it certainly dwarfs the net- 53 H.B. O'Connell, e-print physics/0007040.
works studied here, having, it is estimated, about a billion 54 A.J. Lotka, J. Wash. Acad. Sci. 16, 317 1926 .
vertices at the time of writing. 55 H. Voos, J. Am. Soc. Inf. Sci. 25, 270 1974 .
39 P. Hoffman, The Man Who Loved Only Numbers Hyperion, 56 M.L. Pao, J. Am. Soc. Inf. Sci. 37, 26 1986 .
New York, 1998 . 57 S. Redner private communication .
´
58 A.-L. Barabasi private communication .
40 P. Erdos and M. Kac, Am. J. Math. 26, 738 1940 ; R.M. Ziff,
´
59 P. Erdos and A. Renyi, Publ. Math. Inst. Hung. Acad. Sci. 5,
G.E. Uhlenbeck, and M. Kac, Phys. Rep. 32, 169 1977 ;
17 1960 .
M.E.J. Newman and R.M. Ziff, Phys. Rev. Lett. 85, 4104
´
60 B. Bollobas, Random Graphs Academic Press, New York,
2000 .
1985 .
41 J.W. Grossman and P.D.F. Ion, Congressus Numerantium 108,
61 M. Molloy and B. Reed, Random Struct. Algorithms 6, 161
129 1995 .
1995 ; Combinatorics, Prob. Comput. 7, 295 1998 .
42 R. De Castro and J.W. Grossman, Math. Intelligencer 21, 51 62 M.E.J. Newman, following paper, Phys. Rev. E 64, 016132
1999 . 2001 .
016131-8