Information about http://dpennock.com/papers/popescul-uai-2001-cf.pdf

In Proceedings of the Seventeenth Conference on Uncertainty in…

Tags: explicit ratings, hybrid systems, implicit data, independence way, information science, menders, morgan kaufmann, mystery men, nec research institute, pennock, philadelphia pa, princeton nj, probabilistic framework, probabilistic models, recommender systems, seventeenth conference, sparse data, uncertainty in artificial intelligence, ungar, university of pennsylvania,
Pages: 8
Language: english
Created: Fri Nov 30 16:57:32 2001
Display cached document
Page 1
image
Page 2
image
Page 3
image
Page 4
image
Page 5
image
Page 6
image
Page 7
image
Page 8
image
In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI-2001),
   pp. 437-444, Morgan Kaufmann, San Francisco, 2001.



                    Probabilistic Models for Unified Collaborative and Content-Based
                             Recommendation in Sparse-Data Environments


                       Alexandrin Popescul and Lyle H. Ungar               David M. Pennock and Steve Lawrence
                    Department of Computer and Information Science                NEC Research Institute
                              University of Pennsylvania                            4 Independence Way
                                Philadelphia, PA 19104                              Princeton, NJ 08540
                         popescul@unagi.cis.upenn.edu                        dpennock@research.nj.nec.com
                             ungar@cis.upenn.edu                             lawrence@research.nj.nec.com
                                 Abstract                               1997). These systems suggest products of interest to con-
                                                                        sumers based on their explicit and implicit preferences, the
            Recommender systems leverage product and                    preferences of other consumers, and consumer and prod-
            community information to target products to                 uct attributes. For example, a movie recommender might
            consumers. Researchers have developed col-                  combine explicit ratings data (e.g., Bob rates X-men a 7
            laborative recommenders, content-based recom-               out of 10), implicit data (e.g., Mary purchased Hannibal),
            menders, and a few hybrid systems. We pro-                  user demographic information (e.g., Mary is female), and
            pose a unified probabilistic framework for merg-            movie content information (e.g., Mystery Men is a comedy)
            ing collaborative and content-based recommen-               to make recommendations to specific users.
            dations. We extend Hofmann's (1999) aspect                  Traditionally, recommender systems have fallen into two
            model to incorporate three-way co-occurrence                main categories. Collaborative filtering methods utilize
            data among users, items, and item content. The              explicit or implicit ratings from many users to recom-
            relative influence of collaboration data versus             mend items to a given user (Breese et al., 1998; Resnick
            content data is not imposed as an exogenous pa-             et al., 1994; Shardanand & Maes, 1995). Content-based
            rameter, but rather emerges naturally from the              or information filtering methods make recommendations
            given data sources. However, global probabilis-             by matching a user's query, or other user information, to
            tic models coupled with standard EM learning al-            descriptive product information (Mooney & Roy, 2000;
            gorithms tend to drastically overfit in the sparse-         Salton & McGill, 1983). Pure collaborative systems tend
            data situations typical of recommendation appli-            to fail when little is known about a user, or when he or she
            cations. We show that secondary content in-                 has uncommon interests. On the other hand, content-based
            formation can often be used to overcome spar-               systems cannot account for community endorsements; for
            sity. Experiments on data from the ResearchIn-              example, an information filter might recommend The Mex-
            dex library of Computer Science publications                ican to a user who likes Brad Pitt and Julia Roberts, even
            show that appropriate mixture models incorpo-               though many like-minded users strongly dislike the film.
            rating secondary data produce significantly better
                                                                        Several researchers are exploring hybrid collaborative and
            quality recommenders than -nearest neighbors                content-based recommenders to smooth out the disadvan-
                 
            ( -NN). Global probabilistic models also allow              tages of each (Basu et al., 1998; Claypool et al., 1999;
            more general inferences than local methods like
                                                                        Good et al., 1999).
              -NN.
                                                                        In this paper, we propose a generative probabilistic model
                                                                        for combining collaborative and content-based recommen-
                                                                        dations in a normative manner. The model builds on previ-
       1 INTRODUCTION                                                   ous two-way co-occurrence models for information filter-
                                                                        ing (Hofmann, 1999) and collaborative filtering (Hofmann
       The Internet offers tremendous opportunities for mass per-
                                                                        & Puzicha, 1999). Our model incorporates three-way co-
       sonalization of commercial transactions. Web businesses
                                                                        occurrence data by presuming that users are interested in a
       ideally strive for global reach, while maintaining the feel of
                                                                        set of latent topics which in turn "generate" both items and
       a neighborhood shop where the customers know the own-
                                                                        item content information. Model parameters are learned
       ers, and the owners are familiar with the customers and
                                                                        using expectation maximization (EM), so the relative con-
       their specific needs. To show a personal face on a mas-
                                                                        tributions of collaborative and content-based data are de-
       sive scale, businesses must turn to automated techniques
                                                                        termined in a sound statistical manner. When data is ex-
       like so-called recommender systems (Resnick & Varian,
tremely sparse, as is typically the case for collaboration       a list-ranking problem (Cohen et al., 1999; Freund et al.,
data, EM can suffer from overfitting. In Sections 4 and 5,       1998; Pennock et al., 2000a). Singular Value Decomposi-
we present two techniques to effectively increase the den-       tion (SVD) was used to improve scalability of collaborative
sity of the data by exploiting secondary data. The first uses    filtering systems by dimensionality reduction (Sarwar et al.,
a similarity measure to fill in the user-item co-occurrence      2000).
matrix by inferring which items users are likely to have ac-
                                                                 Pure information filtering systems use only content to make
cessed without the system's knowledge. The second creates
                                                                 recommendations. For example, search engines recom-
an implicit user-content co-occurrence matrix by treating
                                                                 mend web pages with content similar to (e.g., containing)
each user's access to an item as if it were many accesses to
                                                                 user queries (Salton & McGill, 1983). In contrast to collab-
all of the pieces of content in the item's descriptive infor-
                                                                 orative methods, content-based systems can even recom-
mation. We evaluate these models in the context of a doc-
                                                                 mend new (previously unaccessed) items to users without
ument recommendation system. Specifically, we train and
                                                                 any history in the system. Mooney & Roy (2000) develop
test the models on data from ResearchIndex,1 an online dig-
                                                                 a content-based book recommender using information ex-
ital library of Computer Science papers (Lawrence et al.,
                                                                 traction and machine learning techniques for text catego-
1999; Bollacker et al., 2000). Section 6 presents empiri-
                                                                 rization.
cal results and evaluations. In Section 6.2, we demonstrate
the potential ineffectiveness of EM in sparse-data situa-        Several authors suggest methods for combining collabora-
tions, using both ResearchIndex data and synthetic data. In      tive filtering with information filtering. Basu et al. (1998)
Section 6.3, we show that both of our density-augmenting         present a hybrid collaborative and content-based movie rec-
methods are effective at reducing overfitting and improv-        ommender. Collaborative features (e.g., Bob and Mary like
ing predictive accuracy. Our models yield more accurate
                                                                 Titanic) are encoded as set-valued attributes. These fea-
recommendations than the commonly-employed -nearest              tures are combined with more typical content features (e.g.,
             
neighbors ( -NN) algorithm. Moreover, our global models          Traffic is rated R) to inductively learn a binary classifier that
can produce predictions for any user-item pair, whereas lo-      separates liked and disliked movies. Also in a movie rec-
                   
cal methods like -NN are simply incapable of producing           ommender domain, Good et al. (1999) suggest using con-
meaningful recommendations for many user-item combi-             tent based software agents to automatically generate rat-
nations.                                                         ings to reduce data sparsity. Claypool et al. (1999) employ
                                                                 separate collaborative and content-based recommenders in
                                                                 an online newspaper domain, combining the two predic-
2 BACKGROUND AND RELATED                                         tions using an adaptive weighted average: as the number
  WORK                                                           of users accessing an item increases, the weight of the col-
                                                                 laborative component tends to increase. Web hyperlinks
A variety of collaborative filtering algorithms have been        and document citations can be thought of as implicit en-
designed and deployed. The Tapestry system relied on             dorsements or ratings. Cohn and Hofmann (2001) combine
each user to identify like-minded users manually (Gold-          document content information with this type of connectiv-
berg et al., 1992). GroupLens (Resnick et al., 1994) and         ity information to identify principle topics and authoritative
Ringo (Shardanand & Maes, 1995), developed indepen-              documents in a collection.
dently, were the first to automate prediction. Typical al-
                                                                 Recommender systems technology is in current use in
gorithms compute similarity scores between all pairs of
                                                                 many Internet commerce applications. For example, the
users; predictions for a given user are generated by weight-
                                                                 University of Minnesota's GroupLens and MovieLens2 re-
ing other users' ratings proportionally to their similarity to
                                                                 search projects spawned Net Perceptions,3 a successful In-
the given user. A variety of similarity metrics are possible,
                                                                 ternet startup offering personalization and recommendation
including correlation (Resnick et al., 1994), mean-squared
                                                                 services. Alexa4 is a web browser plug-in that recommends
difference (Shardanand & Maes, 1995), vector similarity
                                                                 related links based in part on other people's web surfing
(Breese et al., 1998), or probability that users are of the
                                                                 habits. A growing number of companies,5 including Ama-
same type (Pennock et al., 2000b). Other algorithms con-
                                                                 zon.com, CDNow.com, and Levis.com, employ or provide
struct a model of underlying user preferences, from which
                                                                 recommender system solutions (Schafer et al., 1999). Rec-
predictions are inferred. Examples include Bayesian net-
                                                                 ommendation tools originally developed at Microsoft Re-
work models (Breese et al., 1998), dependency network
                                                                 search are now included with the Commerce Edition of Mi-
models (Heckerman et al., 2000), clustering models (Un-
                                                                 crosoft's SiteServer,6 and are currently in use at multiple
gar & Foster, 1998), and models of how people rate items
(Pennock et al., 2000b). Collaborative filtering has also           2
                                                                      http://movielens.umn.edu/
been cast as a machine learning problem (Basu et al., 1998;         3
                                                                      http://www.netperceptions.com/
Billsus & Pazzani, 1998; Nakamura & Abe, 1998) and as               4
                                                                      http://www.alexa.com/
                                                                    5
                                                                      http://www.cis.upenn.edu/~ungar/CF/
   1                                                                6
       http://researchindex.org/                                      http://www.microsoft.com/siteserver
sites.                                                                                                                                      P( u)


3 THREE-WAY ASPECT MODEL                                                                                                            u
Hofmann (1999) proposes an aspect model--a latent class                                                                                     P( z | u)
statistical mixture model--for associating word-document
co-occurrence data with a set of latent variables. Hofmann
and Puzicha (1999) apply the aspect model to user-item
                                                                                                            P( d | z)
                                                                                                                                     z                P( w | z)
co-occurrence data for collaborative filtering. In the con-
text of a document recommender system, users                    §¥£¡
                                                               ¦ ¤ ¢
                   ¨
   1# ))("¡# ¨ %¡ #
  0  ! '©   &$ ¡© ¢ , together with the documents they access          1# 43¡2
                                                                      5 
!         ¦                 , form observations         , which                                         d                                                       w
are associated with one of the latent variables                  @976
                                                                ¦ 8 ¢
               B'6))( © 6 ¨
               !A 
              . Conceptually, the latent variables are top-
ics. Users choose among topics according to their interests;                        Figure 1: Graphical representation of the three-way aspect
topic variables in turn "generate" documents. Users are as-                         model.
sumed independent of documents, given the topics. The
joint probability distribution over users, topics, and doc-
                                                                                                                            5 # 2
uments is
                          5 6 G # EC 5 ¡ HF6EC 5 3¡EC
                                  2D G 2D 2D
                                     . An equivalent specifica-
                                                                                    Let
                                                                                       in document . That is,
                                                                                                                        # a YF¡Ec a
                                                                                                                                 5 fa # Ec e1# ¡ 43¡E£¦ 5 a # YF¡Ec
                                                                                                  be the number of times user "saw" word
                                                                                                                                        2 d5  2c
                                                                                                                                          ,                    2
tion of the joint distribution that treats users and documents
                                                                                                                             5 `#  2
               5 6 G # DEC 5 6 IF¡EC 5 F6EC
symmetrically is       2 G 2D 2D             . The joint distri-
                                                                                    where
                                                                                    ment , and
                                                                                                                  5 af # Ec YF¡Ec #
                                                                                                 is the number of times user accessed docu-
                                                                                                                         2#
                                                                                                        is the number of times word occursa
                                                                                                                                                    ¡
bution over just users and documents is                                             in document . Given training data of this form, the log
                                                                                    likelihood of the data is                     g
            5 6 G # VC 5 6 U3¡EC 5 F6ETSQ¦ 1# 43¡EC
                    2D G 2D 2D C RP 5  2D                                            5 a # Y¡FEC v)u15 af # 43¡Ec P hg
                                                                                              2D    t          2 s pr qi ¦
                                                                                                                      p
Model parameters are learned using EM (or variants) to
find a local maximum of the log-likelihood of the training                          The corresponding EM algorithm is:
data. After the model is learned, documents can be ranked
                                         `# YF¡VC X5 ¡ G # VC
                                        5  2D W 2D                                  E step:
for a given user according to                      ; that is,
                                                                                            5 G 2 D 5 # 5 D 5
according to how likely it is that the user will access the
corresponding document. Documents with high
                                                                       5 ¡ G # EC
                                                                               2D    5 `6 GB3a6 w3aC 5VC G6 # 6 G2 VC2 EqC5D 6 I6 F¡U3¡2G VCE5 C `F6F2 6 EECDD2 C yR x ¦ 5 a # Y¡ HF6VC
                                                                                                2E q D G D 
                                                                                                 D 
                                                                                                                                          2                                     G 2D
that the user has not yet seen are good candidates for rec-
ommendation. Note that the aspect model allows multiple                             M step:
topics per user, unlike most clustering algorithms that as-                                        5 6 U3¡EC
                                                                                                  W G 2D                      5 a # Y¡ HF6'EC 5 a # YF¡Ec P
sign each user to a single class.                                                                                                    G 2D            2 s r    p
This model is a pure collaborative filtering model; docu-                                        W5 6 G # EC                 5 a # Y¡ HF6'EC 5 a # YF¡Ec P
ment content is not taken into account. We propose an                                                      2D                        G 2D            2 s i  p
extension of the aspect model to include three-way co-
occurrence data among users, documents, and document
                                                  5 a # YF¡2                                    W5 6 w3aVC
                                                                                                       G 2D                 5 a # Y¡ HF6'EC 5 a # YF¡Ec P
                                                                                                                                     G 2D            2 r qi
content. An observation is a triple                     
                                                correspond-        #                                                                                      p
                                  ¡
ing to an event of a user accessing document contain-                                          W5 F6VC  2D                5 fa # 4¡ HF6'EC 5 fa # 43¡Ec P
                                                                                                                                   G 2D            2 s r qi
           a
ing word . Conceptually, users choose (latent) topics ,                        6                                                                         pp
which in turn generate both documents and their content
words. Users, documents, and words are assumed indepen-                             The E and M steps are repeated alternately until a local
dent, given the topics. An asymmetric specification of the                          maximum of the log-likelihood is reached.
joint distribution corresponding to this conceptual view-
                 5 6 BFaEC 5 G6 # EC 5 ¡ b6VC 5 F¡VC                                As in the two-way model,
                                                                                                                                    5 fa # 43¡VC s x 5 ¡ G # VC
                                                                                                                                            2D
                                                                                                                                            is
                                                                                                                                                     W 2D
point is             G 2D 2D G 2D 2D     . Figure 1 depicts
                                                                                    used to recommend documents to users. Both content and
this model as a Bayesian network. An equivalent symmet-
ric specification (obtained by reversing the arc from users                         collaboration data can influence recommendations. The
         5 6 GBFaEC 5 G6 # EC 5 6 IF¡VC 5 b6VC
to topics) is     2D 2D G 2D 2D              . Marginaliz-                          relative weight of each type of data depends on the nature
ing out , we obtain                                         6                       of the given data; EM automatically exploits whatever data
                                                                                    source is most informative.
     5 6 GBFaEC 5 G6 # DEC 5 6 GI¡FDVC 5 b62VC R P ¦ 5 a # YF¡EC
             2D 2                  2 D                        2D                    Hofmann (1999) proposes a variant of EM called tempered
                                                                                    EM (TEM) to help avoid overfitting and improve general-
ization. TEM makes use of an inverse computational tem-                                   where     and       kj         mj   are vectors with tf-idf coordinates as de-
          
perature . EM is modified by raising the conditionals in                                  scribed above.
the right-hand side of the E step equation to the power .                                 In our setting, the user-document co-occurrence data ma-
          d
          ¦
TEM starts with          , and decreases with the rate                                Q   trix is smoothed by replacing zero entries with average sim-
using        ¥
              ¦   , when the performance on a held-out por-
                                                                                          ilarities above a certain threshold between the correspond-
tion of the training set deteriorates.
                                                                                          ing document and all documents that the user has accessed.
In Section 6.2, we see that even TEM fails to generalize                                  This effectively increases the density (i.e., the fraction of
when data is extremely sparse. In the next two sections, we                               non-zero entries) in the matrix. Figure 2 shows how the
propose two methods that effectively increase data density,                               density of the ResearchIndex data (described in detail in
thereby improving learning performance.                                                   Section 6.1) changes depending on the similarity threshold
                                                                                          used in smoothing.
4 SIMILARITY-BASED DATA




                                                                                                    0.5
  SMOOTHING




                                                                                                    0.4
One approach to overcoming the overfitting problem with
sparse data is to use the similarity between items to smooth




                                                                                                    0.3
the co-occurrence data matrix. The co-occurrence matrix



                                                                                          density
contains integer entries that are the number of times the cor-



                                                                                                    0.2
responding row and column items co-occur in the observed
data set. Similarity between items in the database can be
                                                                                                    0.1
used to fill some zeros in the co-occurrence data matrix,
thus reducing sparsity and helping to address overfitting.
                                                                             q#
                                                                                                    0.0




                      ¡
Consider a user who has accessed document once, and  S#                                                                 0.2         0.4                0.6         0.8   1.0

assume there exists a document          that has not been ac-                                                                              threshold


              ¡
cessed by , and that documents and are very similar
                                                      q#                       S#
                                                                                          Figure 2: Density of the data against the similarity threshold used
in content (e.g., they share many words in common). Con-                                  in smoothing.
sider a similarity metric which yields                       .gfe§¦ 5  #   # 2
                                                                 d              
Informally, we may believe that there is a 70% chance that
     ¡
user actually has seen document , even though the sys-
                                                       #
tem does not know it. Using this reasoning, we propose to                                 5 IMPLICIT USER-WORDS ASPECT
preprocess the initial co-occurrence data matrix, by filling                                MODEL
in some of the zeros with the aggregate similarity between
the corresponding document and the documents definitely                                   As another method to overcome overfitting due to sparsity,
                  ¡
seen by user . The co-occurrence matrix will no longer                                    we propose a model where the co-occurrence data points
be integer valued, but may also contain similarity values                                 represent events corresponding to users looking at words
which range between 0 and 1. The EM algorithm used in                                     in a particular document. The concept of a document is
the original aspect model also converges in this situation.                               removed to create observations
                                                                                                                                                  5 fa43¡2
                                                                                                                               . Sparsity is drasti-  
                                                                                          cally reduced because documents contain many words, and
The most frequently used similarity measure in informa-
                                                                                          many words are contained in multiple documents.
tion retrieval is vector-space cosine similarity (Salton &
McGill, 1983). Each document is viewed as a vector whose                                  In this case, the aspect model produces estimates of con-
dimensions correspond to words in the vocabulary; the
                                                                                                                                                5 G 2D 5 6 B3aEC
component magnitudes are the tf-idf weights of the words.
                                                                                          ditional probabilities
                                                                                          latent class variable priors
                                                                                                                        and
                                                                                                                                           5 b6E6D U3¡VC
                                                                                                                                     , as well as the
                                                                                                                                               2 C
                                                                                                                          , allowing us to compute
                                                                                                                                                           G 2D
Tf-idf is the product of term frequency         --the num-
                                                               1# "FaIfih
                                                              5  2
#                         a
ber of times word occurs in the corresponding document                                                              5 6 GBa3DVC 5 6 GU¡3DEC 5 F6EC R P ¦ 5 fa43¡VC
                                                                                                                            2           2 2D                  2D
  --and inverse document frequency

                               5 IG23aG $ i # q(ut ¦ 5 FaIi # 
                                              v          2                                But we are still interested in estimating probabilities
                                                                                          5 ¡ G # EC
                                                                                                  2Dto produce recommendations of the papers that
                                                                                                                                                 5 ¡ G # VC
G  G$ 5 #
where        is the number of documents in a collection and                               have the highest scores on the             scale for a given   2D
       3aIi
         2
        is the number of documents in which word occurs                           a                       ¡
                                                                                          user . By assuming conditional independence of words in
at least once. The similarity between two documents is then                               a document, we can overcome this problem by treating a
                                                                                          document as a bag of words: the probability of a document
                       )GG m m j j(G)Gok kGn )G j j (GG ¦ 5 eljYbj
                                                            m k 2                         is the product of the probabilities of the words it contains,
                                                                                          adjusted for different document lengths with the geometric
mean:                                                           periments reported in this paper were conducted using the
           rr
         r fq© f55 ¡ G  3aEC  p2 E5 ¡ G # EC
                           2D           W 2D                    relatively dense subset of 1,000 users and 5,000 papers.


where     are words in    G #5G  #
                           and
                                             a      #
                                   is the length of . Con-
                                                                6.2 OVERFITTING
ditional probabilities         ¡ G FaEC
                                     2D
                                  follow directly from the
                                                                6.2.1 User-Document And User-Document-Word
model:             5 aYF¡VC 5                                         Aspect Models
            5 aYF¡2 V2C D s x ¦ ¡ G 3aEC
                  D                       2D                    Training the two-way user-document aspect model on the
                                                                relatively dense set of 1000 users and 5000 documents re-
Inclusion of words through documents, and eliminating           sulted in immediate overfitting of EM, meaning that the test
documents from direct participation in modeling, increased      data log-likelihood began to fall after only the first or sec-
the density of our dataset (described below) from 0.38% to      ond iteration. This immediate overfitting occured for num-
almost 9%.                                                      bers of latent classes ranging from 3 to 50. Using tempered
                                                                EM (under several reasonable temperature change sched-
                                                                ules) only kept the test data log-likelihood approximately
6 RESULTS AND EVALUATION
                                                                at the same level as the initial random seed, without signif-
                                                                icant improvements.
Section 6.1 describes the ResearchIndex data. In Sec-
tion 6.2, we examine under what conditions learning oc-         Including the words contained in the 5,000 documents, and
curs at all, by measuring the increase in the log-likelihood    fitting the three-way aspect model also resulted in imme-
of test data as EM proceeds. We find that if data is too        diate overfitting. Again, TEM failed to yield significant
sparse, neither EM nor TEM succeeds in significantly in-        improvements in the test data log-likelihood.
creasing the test data log-likelihood over a random initial
guess. In Section 6.3, we evaluate the recommendations of       6.2.2 Standard Aspect Model, Synthetic Data
our density-augmented models, according to Breese et al.'s
(1998) rank scoring metric.                                     To examine whether this extreme overfitting was specific
                                                                to the ResearchIndex data, we tested the aspect model on
6.1 RESEARCHINDEX DATA                                          a simple synthetic data set. Users are divided into three
                                                                disjoint groups according to the following scheme:
The data for our experiments was taken from ResearchIn-
dex (formerly CiteSeer), the largest freely available             1. users 0­49 read papers 0­299,
database of scientific literature (Lawrence et al., 1999;
Bollacker et al., 2000). ResearchIndex catalogs scientific        2. users 50­99 read papers 300­599, and
publications available on the web in PostScript and PDF
formats. The full-text of documents as well as the cita-          3. users 100­149 read papers 600­899,
tions made in them are indexed. ResearchIndex supports
keyword-based retrieval and browsing of the database, for
example by following the links between papers formed by         where the probabilities that users read papers in their inter-
citations. Document detail page access information was          est set are uniform.
obtained for July to November, 2000 (multiple accesses by       We designed the data so that the "correct" model with three
the same user were included). Heuristics were used to filter    latent states is obvious. We generated several datasets of
out robots. Words from the first 5 kbytes of the text of each   differing densities and trained a three-latent-variable aspect
document were extracted.                                        model on each to see whether EM converges to the correct
We used the data from July to October as the training set,      model. We performed validation tests at each iteration with
and the data from November as the testing set. Due to           test sets of the same density as the corresponding training
the rapid growth in usage of ResearchIndex, November            set. Figure 3 plots the iteration (averaged over fifty ran-
accounted for 31% of the total five month activity. The         dom restarts of EM) where overfitting7 first occurs versus
data included 33,050 unique users accessing the details         the dataset density. In datasets of density less than 1.5%
of 177,232 documents. Density of this dataset was only          the process consistently overfits from the first iteration. For
0.01%.                                                          datasets of density 2.5%, test performance begins to deteri-
                                                                orate after about five iterations on average. For datasets of
We extracted a relatively dense (0.38%) subset of the 1000      density 4%, overfitting begins after ten iterations.
most active users and the 5000 documents they accessed
the most. We believe these very low density levels are typ-         7
                                                                      Defined as the point where test data log-likelihood starts de-
ical of many real-world recommendation applications. Ex-        teriorating.
                                                                                                                                                                s
                                                                                                                                                                         s
                                                                                                                                   The maximum value achieved in these experiments was
                     10
                                                                                                                                   1.87 for       w ¦  
                                                                                                                                                     . scores have local maxima, suggesting
                                                                                                                                   their sensitivity to the sparsity of the user-document data.
                     8
overfitAtIteration

                     6




                                                                                                                                       1.8
                     4




                                                                                                                                       1.7
                                                                                                                                       1.6
                     2




                                                                                                                                   R

                                                                                                                                       1.5
                                                      0.01                        0.02                       0.03           0.04

                                                                                        density




                                                                                                                                       1.4
Figure 3: Iteration (averaged over fifty random restarts) where
overfitting occurs versus density of the synthetic data.
                                                                                                                                             10                     20                 30                     40    50       60

                                                                                                                                                                                               k


6.3 RECOMMENDATION ACCURACY
                                                                                                                                   Figure 4: Total utility of the ranked lists over all users produced
We find that both EM and TEM fail on very sparse data, in-
                                                                                                                                         
                                                                                                                                   by -NN.
cluding ResearchIndex data and synthetic data. In contrast,
EM is effective on both of our density-augmented models                                                                                                                                                                  s
                                                                                                                                   6.3.3 Smoothed Aspect Model
(Sections 4 and 5). Here we compare these two models
                                   
to the -NN algorithm, commonly employed in commer-                                                                                 Figure 5 shows the total utility of the ranked lists ( ) for
cial recommender systems. We use the rank scoring metric                                                                                          s
                                                                                                                                   all users against the similarity threshold used for smooth-
(Breese et al., 1998) to evaluate recommendations.                                                                                 ing for the example of 25 latent variables. Although the
                                                                                                                                                                    s
                                                                                                                                   values of fluctuate, the pattern is clear through the sig-
6.3.1 Evaluation Criteria                                                                                                                                                                       
                                                                                                                                   nificant linear least squares fit ( -value of the slope coeffi-
                                                                                                                                   cient is 0.02)-- is larger when more content is included
Breese et al. (1998) define the expected utility of a ranked
                                                                                                                                   (smaller similarity threshold). As the similarity threshold
list of items as                                             s                                                                     grows, the initial data matrix becomes sparser, until it be-
                                                                  {}© y 5 )|vux 4f{Fq 3¡© 2 zty  1w  P ¦ i
                                                                                                 x                                 comes impossible to learn (immediate overfitting). Local
                                                                                                                                   fluctuations are due to the stochastic nature of EM; in par-
                                                                                                                                   ticular, its sensitivity to the randomly initialized parameter
                              u
where is the rank of an item in the full list of suggestions                                                                       values and the number of restarts attempted (five in these
proposed by a recommender,               is 1 if user accessed
                                                                                          5 vu43¡2
                                                                                                                    ¡              experiments) when the data matrix becomes sparser as the
                          u
item in the test set and 0 otherwise, and is the viewing                            t                         ~                    similarity threshold grows.
half-life, which is the place of an item in the list such that it
has a 50% chance of being viewed.8 As in their paper, we
                                                                                                                                       2.0




use                       7~
                           ¦
            , and found that our resulting conclusions were
                                                                                                                                       1.8




not sensitive to the precise value of this parameter. The
final score reflecting the utilities of all users in the test set            s                    s
                                                                                                                                       1.6




is                                                                   s x
                                                                   i
                                                                  i i i x %¦
                                                                                                                                       1.4
                                                                                                                                   R




                                      s                                   dd
                                                                                                                                       1.2




where
                                              i
                                              
               is the maximum possible utility obtained
                                                                                                                                       1.0




when all items that user has accessed appear at the top                    ¡
                                                                                                                                       0.8




of the ranked list.

6.3.2
                                                      s
                                              -Nearest Neighbors
                                                                                                                                                      0.2                        0.4

                                                                                                                                                                                            threshold
                                                                                                                                                                                                        0.6        0.8       1.0




Figure 4 gives scores for the experiments with -NN in
                                                                                                                                   Figure 5: Total utility of the ranked lists over all users produced
                                                                                                                                   by the similarity-based User-Document model against the simi-
standard formulation on the user-document data for differ-
                                                                                                                                                                             s
                                                                                                                                   larity threshold used in smoothing (25 latent class variables).
ent values of , ranging from 10 to 60 with an interval of 5.
   8
     We modify Breese et al.'s formula slightly for the case of                                                                    The maximum value has reached is 2.10, which is greater
observed accesses rather than ratings.
                                                                                                                                                             
                                                                                                                                   than the best -NN result (1.87), but not as good as the
User-Words model (2.92), discussed below.                             our sparsity reduction techniques, similarity-based smooth-
                                                                      ing and an equivalent of a user-words aspect model, can be
                            s
6.3.4 User-Words Aspect Model                                         used.

Figure 6 shows the        scores for the User-Words aspect            EM is guaranteed to reach only a local maximum of the
model recommender. Experiments include models with the                training data log-likelihood. Multiple restarts need to be
                                      s          6
number of hidden class variables ranging from 10 to 60                performed if one desires a higher quality model. We are
with an interval of 10 (two restarts were performed for each          planning to investigate ways to intelligently seed EM to
experiment). The maximum value achieved in these ex-             s    reduce the need for multiple restarts, which can be costly
periments is 2.92 for the model with 50 hidden class vari-            when fitting datasets of non-trivial size.
ables, which is significantly higher than 1.87, the best
                                                                      The user-words model does not explicitly use the popu-
value achieved with -NN algorithm.                                    larity of items. Including such information may further
                                                                      improve the quality of the recommendations made by the
                                                                      model, but requires additional work on combining and cal-
    2.8




                                                                      ibrating model predictions with document popularity.
    2.6




                                                                      Finally, predictive accuracy was used to validate our mod-
                                                                      els in this paper. We are planning to deploy our recom-
    2.4




                                                                      menders in ResearchIndex and perform a user study col-
    2.2
R




                                                                      lecting information on which recommendations are actu-
    2.0




                                                                      ally followed by users.
    1.8




                                                                      References
    1.6




          10       20           30                   40   50    60    Basu, C., Hirsh, H., & Cohen, W. (1998). Recommenda-
                                 numberOfLatentClasses
                                                                            tion as classification: Using social and content-based
Figure 6: Total utility of the ranked lists over all users produced         information in recommendation. In Proceedings of
by the User-Words aspect model.                                             the Fifteenth National Conference on Artificial Intel-
                                                                            ligence, pp. 714­720.
                                                                      Billsus, D., & Pazzani, M. J. (1998). Learning collaborative
7 CONCLUSIONS AND FUTURE WORK                                               information filters. In Proceedings of the Fifteenth
                                                                            International Conference on Machine Learning, pp.
We presented three probabilistic mixture models for recom-                  46­54.
mending items based on collaborative and content-based                Bollacker, K., Lawrence, S., & Giles, C. L. (2000). Discov-
evidence merged in a unified manner. Incorporating con-                     ering relevant scientific literature on the web. IEEE
tent into a collaborative filtering system can increase the                 Intelligent Systems, 15(2), 42­47.
flexibility and quality of the recommender. Moreover,
when data is extremely sparse--as is typical in many real-            Breese, J. S., Heckerman, D., & Kadie, C. (1998). Em-
world applications--additional content information seems                   pirical analysis of predictive algorithms for collab-
almost necessary to fit global probabilistic models at all.                orative filtering. In Proceedings of the Fourteenth
The density of ResearchIndex data is only 0.01%. Even                      Conference on Uncertainty in Artificial Intelligence,
the most active users reading the most popular articles in-                pp. 43­52.
duce a subset of density only 0.38%, still too sparse for the
                                                                      Claypool, M., Gokhale, A., & Miranda, T. (1999).
straightforward EM and TEM approaches to work. We find
                                                                           Combining content-based and collaborative filters
that a particularly good way to include content information
                                                                           in an online newspaper. In Proceedings of the
in the context of a document recommendation system is to                   ACM SIGIR Workshop on Recommender Systems--
treat users as reading words of the document, rather than
                                                                           Implementation and Evaluation.
the document itself. In our case, this increased the density
from 0.38% to almost 9%, resulting in recommendations                 Cohen, W. W., Schapire, R. E., & Singer, Y. (1999). Learn-
                
superior to -NN.                                                           ing to order things. Journal of Artificial Intelligence
                                                                           Research, 10, 243­270.
There are many areas for future research. Similar meth-
ods to those presented here might be used to recommend                Cohn, D., & Hofmann, T. (2001). The missing link - a
items such as movies which have attributes other than text.                probabilistic model of document content and hyper-
A movie can be viewed as consisting of the director and                    text connectivity. In Advances in Neural Information
the actors in it, just as a document contains words. Both of               Processing Systems, Vol. 13. The MIT Press.
Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (1998).    Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., &
     An efficient boosting algorithm for combining pref-            Riedl, J. (1994). GroupLens: An open architecture
     erences. In Proceedings of the Fifteenth Interna-              for collaborative filtering of netnews. In Proceedings
     tional Conference on Machine Learning, pp. 170­                of the ACM Conference on Computer Supported Co-
     178.                                                           operative Work, pp. 175­186.

Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992).     Resnick, P., & Varian, H. R. (1997). Recommender sys-
     Using collaborative filtering to weave an information          tems. Communications of the ACM, 40(3), 56­58.
     tapestry. Communications of the ACM, 35(12), 61­
     70.                                                       Salton, G., & McGill, M. (1983). Introduction to Modern
                                                                     Information Retrieval. McGraw Hill.
Good, N., Schafer, J. B., Konstan, J. A., Borchers, A., Sar-
     war, B. M., Herlocker, J. L., & Riedl, J. (1999). Com-    Sarwar, B. M., Karypis, G., Konstan, J. A., & Riedl, J. T.
     bining collaborative filtering with personal agents for        (2000). Application of dimensionality reduction in
     better recommendations. In Proceedings of the Six-             recommender system ­ a case study. In ACM We-
     teenth National Conference on Artificial Intelligence,         bKDD Web Mining for E-Commerce Workshop.
     pp. 439­446.                                              Schafer, J. B., Konstan, J., & Riedl, J. (1999). Recom-
Heckerman, D., Chickering, D. M., Meek, C., Rounth-                 mender systems in e-commerce. In Proceedings of
     waite, R., & Kadie, C. (2000). Dependency networks             the ACM Conference on Electronic Commerce, pp.
     for collaborative filtering and data visualization. In         158­166.
     Proceedings of the Sixteenth Conference on Uncer-         Shardanand, U., & Maes, P. (1995). Social information fil-
     tainty in Artificial Intelligence, pp. 264­273.                tering: Algorithms for automating `word of mouth'.
                                                                    In Proceedings of Computer Human Interaction, pp.
Hofmann, T. (1999). Probabilistic latent semantic analy-
    sis. In Proceedings of the Fifteenth Conference on              210­217.
    Uncertainty in Artificial Intelligence, pp. 289­296.       Ungar, L. H., & Foster, D. P. (1998). Clustering methods
                                                                    for collaborative filtering. In Workshop on Recom-
Hofmann, T., & Puzicha, J. (1999). Latent class models
                                                                    mendation Systems at the Fifteenth National Confer-
    for collaborative filtering. In Proceedings of the Six-
                                                                    ence on Artificial Intelligence.
    teenth International Joint Conference on Artificial
    Intelligence, pp. 688­693.

Lawrence, S., Giles, C. L., & Bollacker, K. (1999). Digi-
     tal libraries and autonomous citation indexing. IEEE
     Computer, 32(6), 67­71.

Mooney, R. J., & Roy, L. (2000). Content-based book rec-
    ommending using learning for text categorization. In
    Proceedings of the Fifth ACM Conference on Digital
    Libraries, pp. 195­204.

Nakamura, A., & Abe, N. (1998). Collaborative filtering
    using weighted majority prediction algorithms. In
    Proceedings of the Fifteenth International Confer-
    ence on Machine Learning, pp. 395­403.

Pennock, D. M., Horvitz, E., & Giles, C. L. (2000a). So-
     cial choice theory and recommender systems: Anal-
     ysis of the axiomatic foundations of collaborative fil-
     tering. In Proceedings of the Seventeenth National
     Conference on Artificial Intelligence, pp. 729­734.

Pennock, D. M., Horvitz, E., Lawrence, S., & Giles, C. L.
     (2000b). Collaborative filtering by personality di-
     agnosis: A hybrid memory- and model-based ap-
     proach. In Proceedings of the Sixteenth Conference
     on Uncertainty in Artificial Intelligence, pp. 473­
     480.