Information about http://www.seroundtable.com/TR-2007-27.pdf

Spam Double-Funnel: Connecting Web Spammers with Advertisers…

Tags: advertisers, cheap ticket, doorway, funnel, hao chen, last updated february, microsoft corporation, microsoft research, niu, optimization seo, prevalent type, redmond wa, search engine optimization, spammers, ucdavis, university of california, university of california davis, urls, wang ming, yuan,
Pages: 11
Language: english
Created: Wed Mar 14 22:30:30 2007
Display cached document
Page 1
image
Page 2
image
Page 3
image
Page 4
image
Page 5
image
Page 6
image
Page 7
image
Page 8
image
Page 9
image
Page 10
image
Page 11
image
          Spam Double-Funnel:
Connecting Web Spammers with Advertisers


           A Strider Search Ranger Report




                   Yi-Min Wang
                     Ming Ma
                    Yuan Niu
                    Hao Chen



              Created: November 2006
           Last Updated: February 26, 2007



                 Technical Report
                 MSR-TR-2007-27




                Microsoft Research
               Microsoft Corporation
                One Microsoft Way
               Redmond, WA 98052
                            Spam Double-Funnel:
                  Connecting Web Spammers with Advertisers
                            Yi-Min Wang, Ming Ma                                      Yuan Niu, Hao Chen
                             Microsoft Research                                    University of California, Davis
                          Redmond, WA 98052, USA                                   Davis, CA 95616-8562, USA
                                425-882-8080                                               530-754-5375
                       {ymwang, mingma}@microsoft.com                              {niu, hchen}@cs.ucdavis.edu

ABSTRACT                                                                     We motivate our work using a real example. Around mid-
Spammers use questionable search engine optimization (SEO)                 October 2006, the following three doorway URLs appeared
techniques to promote their spam links into top search results. In         among the top-10 Live Search results for "cheap ticket":
this paper, we focus on one prevalent type of spam ­ redirection                  ·     http://-cheapticket.blogspot.com/
spam ­ where one can identify spam pages by the third-party                       ·     http://sitegtr.com/all/cheap-ticket.html
domains that these pages redirect traffic to. We propose a five-                  ·     http://cheap-ticketv.blogspot.com/
layer, double-funnel model for describing end-to-end redirection
                                                                           All these pages appeared to be spam: they used cloaking, their
spam, present a methodology for analyzing the layers, and
                                                                           URLs were posted as comments at numerous open forums1, and
identify prominent domains on each layer using two sets of
                                                                           they redirected traffic to known-spammer redirection domains
commercial keywords ­ one targeting spammers and the other
                                                                           vip-online-search.info, searchadv.com, and webresourses.info.
targeting advertisers. The methodology and findings are useful for
                                                                           Surprisingly, ads for orbitz.com, a reputable company, appeared
search engines to strengthen their ranking algorithms against
                                                                           on all these three spam pages. A search using similar keywords2 at
spam, for legitimate website owners to locate and remove spam
                                                                           Google and Yahoo! revealed another two spam pages, hosted on
doorway pages, and for legitimate advertisers to identify
                                                                           hometown.aol.com.au and megapage.de, that also displayed
unscrupulous syndicators who serve ads on spam pages.
                                                                           orbitz.com ads. If we believe that a reputable company is unlikely
Categories and Subject Descriptors                                         to buy service directly from spammers, a natural question to ask
H.3.5 [Information Storage and Retrieval]: Online Information              is: who are the middlemen who indirectly sell spammers' service
Services - Commercial services, Web-based services                         to sites like orbitz.com?
                                                                              We discovered the answer by "following the money": when we
General Terms: Measurement, Security, Experimentation                      clicked the orbitz.com ads on each of the five pages and
                                                                           monitored the resulting HTTP traffic using the Fiddler tool [27],
Keywords: Search Spam, Web Spam, Redirection and                           we saw that the ads click-through traffic got funneled into either
Cloaking, Advertisement Syndication                                        64.111.210.206 or the block of IP addresses between
                                                                           66.230.128.0 and 66.230.191.255 [30]. Moreover, the chain of
                                                                           redirections stopped at http://r.looksmart.com, which then
1. INTRODUCTION                                                            redirected to orbitz.com using HTTP 302.
   Search spammers (or web spammers) refer to those who use                   In this paper, we analyze end-to-end redirection spam activities
questionable search engine optimization (SEO) techniques to                comprehensively with an emphasis on syndication-based spam.
promote their low-quality links into top search rankings. Common           We propose a five-layer double-funnel model in which displayed
SEO techniques include stuffing keywords, creating link farms              ads flow in one direction and click-through traffic flows in the
(e.g., large number of mutually linked, made-for-ads websites),            other direction. By constructing two different benchmarks of
posting links to spam pages as comments at public forums                   commercial search terms and using the Strider Search Ranger
(referred to as comment spamming), and using crawler-browser               system [21] to analyze tens of thousands of spam links that
cloaking techniques [8] to serve different pages to crawlers and           appeared in top results across three major search engines, we
end users. To evade spam investigation, some spammers in recent            identified the major domains in each of the five layers and their
years have started using click-through cloaking techniques                 interesting characteristics.
[15,22] to display bogus content to spam investigators who visit
their pages directly without clicking through any search results.             The paper is organized as follows. Section 2 gives an overview
                                                                           of the Search Ranger system and introduces the double-funnel
  We use redirection spam to refer to the web pages that redirect          model. In Section 3 we construct a spammer-targeted search
browsers to visit known spammer-controlled third-party domains.            benchmark. Section 4 analyzes spam density and double-funnel
Many redirection spam pages use syndication where they                     for this benchmark. In Section 5 we construct an advertiser-
participate in pay-per-click programs and display ads-portal
pages.
                                                                           1
                                                                               For ease of presentation, throughout the paper, we use the term "forums"
 Copyright is held by the International World Wide Web Conference              to include all blogs, bulletin boards, message boards, guest books, web
 Committee (IW3C2). Distribution of these papers is limited to classroom       journals, diaries, galleries, archives, etc. that can be abused by web
 use, and personal use by others.                                              spammers to promote spam URLs.
                                                                           2
 WWW 2007, May 8­12, 2007, Banff, Alberta, Canada.                              We use the terms "keyword", "query", and "search term"
 ACM 978-1-59593-654-7/07/0005.                                                interchangeably in this paper to refer to the entire query phrase that a
                                                                               user enters into a search box to perform a query.
targeted benchmark and compare the analysis results using this         3. Similarity-based Grouping for Identifying Large-scale
benchmark with those in Section 4. Section 6 discusses non-            Spam ­ Rather than analyzing all crawler-indexed pages, Search
redirection spam that also connects to the double-funnel model.        Ranger focuses on monitoring search results of popular queries
Section 7 surveys related work, and Section 8 concludes the            targeted by spammers to obtain a list of URLs with high spam
paper. Since all the analyses in this paper are based on the data      densities. It then analyzes the similarity between the redirections
gathered in September and October of 2006, some spam URLs              from these pages to identify related pages, which are potentially
may no longer be active.                                               operated by large-scale spammers. In its simplest form, this
                                                                       similarity analysis identifies doorway pages that share the same
2. REDIRECTION SPAM                                                    redirection domain. After we verify that the domain is responsible
                                                                       for serving the spam content, we then use the domain as a seed to
2.1 Definitions: Search Spam and Redirection                           perform "backward propagation of distrust" [13] to detect other
   SEO techniques span a wide spectrum. Since the precise              related spam pages.
boundary between legitimate SEO techniques and search spam is             In summary, Search Ranger identifies spam URLs using the
often subjective and fuzzy, we focus on one type of spam ­             process summarized below.
redirection spam ­ which is widely used by large-scale spammers        Search Ranger Spam Detection Process
to associate many doorway pages with a single redirection
domain. These doorway pages often exhibit similar patterns in          Step 1: Given a set of search terms and a target search engine,
their appearance, their cloaking and code obfuscation techniques       Search Monkeys retrieve the top-N search results for each query,
for avoiding detection, and the way by which their URLs appear         remove duplicates, and scan each unique URL to produce an
in the comment fields of public forums. These repeated patterns        XML file that records all URL redirections.
allow human investigators to judge spam pages more easily and          Step 2: At the end of a batched scan, Search Ranger applies
confidently. We will describe the exact steps in detecting spam in     redirection analysis to all the XML files to classify URLs that
the next subsection. In Sections 4 and 5, we will show that            redirected to known-spammer redirection domains as spam.
redirection spam accounts for significant spam densities in both       Step 3: Search Ranger groups unclassified URLs by each of the
our benchmarks, which indicate that our spam detection                 third-party domains that received redirection traffic.
mechanism is effective in practice.
                                                                       Step 4: Search Ranger submits sample URLs from each group to
   After a user instructs the browser to visit a URL (the primary      a spam verifier, which gathers evidence of spam activities
URL), the browser may visit other URLs (secondary URLs)                associated with these URLs. Specifically, the spam verifier checks
automatically. The secondary URLs may contribute to inline             if each URL uses crawler-browser cloaking to fool search engines
contents (e.g., Google AdSense ads) on the primary page, or may        or uses click-through cloaking to evade manual spam
replace the primary page entirely (i.e., they replace the URL in the   investigation. It also checks if the URL has been widely
address bar). We consider both these types of secondary URLs           comment-spammed at public forums.
redirection. See [31] for screenshots of sample redirection spam.
                                                                       Step 5: Search Ranger submits groups of unclassified URLs,
                                                                       ranked by their group sizes and tagged by spam evidence, to
2.2 Strider Search Ranger System                                       human judges. Once the judges determine a group to be spam,
    The Strider Search Ranger system [21] is an automated spam         Search Ranger adds the redirection domains responsible for
detection system with the following three key features:                serving the spam content to the set of known spam domains,
                                                                       which will be used in Step-2 classification in future scans.
1. Web Patrol with Search Monkeys [19] - Since search engine
crawlers typically do not execute scripts, spammers exploit this       2.3 Spam Double-Funnel
fact using crawler-browser cloaking techniques, which serve one
page to crawlers for indexing but display a different page to             A typical advertising syndication business consists of three
browser users [8,23]. To defend against cloaking, Search               layers: the publishers who attract traffic by providing quality
Monkeys visit each web page with a full-fledged popular browser,       content on their websites to achieve high search rankings, the
which executes all client-side scripts. To combat the newer click-     advertisers who pay for displaying their ads on those websites,
through cloaking technique, which serves spam content only to          and the syndicators who provide the advertising infrastructure to
users who click through search results, our monkey programs            connect the publishers with the advertisers. The Google AdSense
mimic the click-through by first retrieving a search-result page to    program [29] is an example syndicator. Although some spammers
set the browser's document.referrer variable, then inserting a         have abused the AdSense program [28], the abuse is most likely
link to the spam page in the search-result page, and finally           the exception rather than the norm.
clicking through the inserted link.                                       In a questionable advertising business, spammers assume the
2. Follow the Money through Redirection Tracking ­ Common              role of publishers, who set up websites of low-quality content and
approaches to detecting "spammy" content and link structures           use black-hat SEO techniques to attract traffic. To better survive
merely catch "what" spammers are doing today. By contrast, if          spam detection and blacklisting by search engines, many
we follow the money by tracking traffic redirection, we would be       spammers have split their operations into two layers. At the first
closer to identifying "who" are behind spam activities, even if        layer are the doorway pages, whose URLs the spammers promote
their spam techniques evolve. Search Ranger uses the Strider URL       into top search results. When users click those links, their
Tracer [20] to intercept browser redirection traffic at the network    browsers are instructed to fetch spam content from redirection
layer to record all redirection URLs. As Sections 4 and 5 will         domains, which occupy the second layer.
demonstrate, we apply redirection analysis to tracking both the           To attract prudent legitimate advertisers who do not want to be
ads-fetching traffic and the ads click-through traffic.                too closely connected to the spammers, many syndicators have
                                                                       also split their operations into two or more layers, which are
connected by multiple redirections, to obfuscate the connection         extracting all the anchor text from a large number of spammed
between the advertisers and the spammers. Since these                   forums and ranking the keywords by their frequencies.
syndicators are typically smaller companies, they often join forces        Between June and August of 2006, we manually investigated
through traffic aggregation to attract sufficient traffic providers     spam reports from multiple sources including search user
and advertisers.                                                        feedback, heavily spammed forum types, online spam discussion
   We model this end-to-end search spamming business with the           forums, etc. We compiled a list of 323 keywords that returned
five-layer double-funnel illustrated in Figure 1: tens of thousands     spam URLs among the top 50 results at one of the three major
of advertisers (Layer #5) pay a handful of syndicators (Layer #4)       search engines. We then queried these keywords at all three
to display their ads. The syndicators buy traffic from a small          search engines, extracted the top-50 results, scanned them with an
number of aggregators (Layer #3), who in turn buy traffic from          earlier version of Search Ranger, and identified 4,803 unique
web spammers to insulate syndicators and advertisers from spam          redirection-spam URLs.
pages. The spammers set up hundreds to thousands of redirection            Next, we issued a "link:" query on each of the 4,803 URLs and
domains (Layer #2), create millions of doorway pages (Layer #1)         retrieved 35,878 unique pages that contained at least one of these
that fetch ads from these redirection domains, and widely spam          spam URLs. From these pages, we collected a total of 1,132,099
the URLs of these doorways at public forums. If any such URLs           unique keywords, with a total of 6,026,699 occurrences, and
are promoted into top search results and are clicked by users, all      ranked the keywords by their occurrence counts. The top-5
click-through traffic is funneled back through the aggregators,         keywords are all drugs-related: "phentermine" (8,117), "viagra"
who then de-multiplex the traffic to the right syndicators.             (6,438), "cialis" (6,053), "tramadol" (5,788), and "xanax"
Sometimes there is a chain of redirections between the                  (5,663). Among the top one hundred, 74 are drugs-related, 16 are
aggregators and the syndicators due to multiple layers of traffic       ringtone-related, and 10 are gambling-related.
affiliate programs, but almost always one domain at the end of
                                                                           Among the above 1,132,099 keywords, we could select a top
each chain is responsible for redirecting to the target advertiser's
                                                                        list, say top 1000, for our subsequent analyses. However, we
website.
                                                                        observed that keywords related to drugs and ringtones dominate
                                                                        the top-1000 list. Since it would be useful to study spammers who
     Doorway        Doorway      Doorway        Doorway      Layer #1
                                    s              s
                                                                        target different categories, we decided to construct our benchmark
        s              s
                                                                        by manually selecting ten of the most prominent categories from
                Redirection
                                                                        the list. They are:
                                  Redirection
                 Domain            Domain         Layer #2              1. Drugs: phentermine, viagra, cialis, tramadol, xanax, etc.
                                                                        2. Adult: porn, adult dating, sex, etc.
     Ads
    Display                                                             3. Gambling: casino, poker, roulette, texas holdem, etc.
                         Aggregators       Layer #3                     4. Ringtone: verizon ringtones, free polyphonic ringtones, etc.
   Click-Thru                                                           5. Money: car insurance, debt consolidation, mortgage, etc.
                  Syndicator     Syndicator     Layer #4                6. Accessories: rolex replica, authentic gucci handbag, etc.
                                                                        7. Travel: southwest airlines, cheap airfare, hotels las vegas, etc.
                                                                        8. Cars: bmw, dodge viper, audi monmouth new jersey, etc.
                Advertiser         Advertiser         Layer #5          9. Music: free music downloads, music lyrics, 50 cent mp3, etc.
                                                                        10. Furniture: bedroom furniture, ashley furniture, etc.
                 Figure 1: Spam Double-Funnel                           We then selected the top-100 keywords from each category to
                                                                        form our first benchmark of 1,000 spammer-targeted search terms.
   In the case of AdSense-based spammers, the single domain
googlesyndication.com plays the role of the middle three layers,        4. REDIRECTION-SPAM ANALYSIS
responsible for serving ads, receiving click-through traffic, and
redirecting to advertisers. Specifically, browsers fetch AdSense           In late September 2006, we submitted the 1,000 keywords to
ads from the redirection domain googlesyndication.com and               the Search Ranger system, which retrieved the top-50 results from
display them on the doorway pages; ads click-through traffic goes       all three major search engines. In total, we collected 101,585
into the aggregator domain googlesyndication.com before                 unique URLs from 1,000x50x3=150,000 search results. With a set
reaching advertisers' websites.                                         of approximately 500 known-spammer redirection domains and
                                                                        AdSense IDs at that time, the system identified 12,635 unique
3. SPAMMER-TARGETED KEYWORDS                                            spam URLs, which accounted for 11.6% of all the top-50
                                                                        appearances. (The actual redirection-spam density should be
   To study the common characteristics of redirection spam, our         higher because some of the doorway pages had been deactivated,
first step was to discover the keywords and categories heavily          which were no longer causing URL redirections when we scanned
targeted by redirection spammers. In this section, we describe our      them.) We first give a brief analysis of per-category spam
methodology for deriving 10 spammer-targeted categories and a           densities in Section 4.1 and then focus on the double-funnel
benchmark of 1,000 keywords, which serve as the basis for the           analysis for the remainder of this section.
analyses presented in Section 4.
   Redirection spammers often use their targeted keywords as the        4.1 Spam Density Analysis
anchor text of their spam links at public forums, exploiting a             Figure 2 compares the per-category spam densities across the
typical algorithm by which common search engines index and              10 spammer-targeted categories. The numbers range from 2.7%
rank URLs. For example, the anchor text for the spam URL                for Money to 30.8% for Drugs. Two categories, Drugs and
http://coach-handbag-top.blogspot.com/ is typically "coach              Ringtone, are well above twice the average (shown on the far
handbag". Therefore, we collect spammer-targeted keywords by            right). Three categories ­ Money, Cars, and Furniture ­ are well
below half the average. We also calculated DCG (Discounted                                                                                                                 3,882




                                                                                                                                       # of Spam Appearance
                                                                                                                                                                   600
Cumulated Gain) [10] spam densities, which give more weights to                                                                                                                    493
                                                                                                                                                                   500                   396
spam URLs appearing near the top of the search-result list, but                                                                                                    400                         296
found no significant difference from Figure 2.                                                                                                                     300                               242 225 218 207
                                                                                                                                                                                                                     178 172 150
                                                                                                                                                                   200                                                           131 124 123 110
Per-Category Spam Density




                            35% 30.8%                                                                                                                              100
                            30%                        27.5%                                                                                                         0
                            25%




                                                                                                                                                                                             m


                                                                                                                                                                                          rg




                                                                                                                                                                                            v


                                                                                                                                                                                          m
                                                                                                                                                                                        om




                                                                                                                                                                                          m
                                                                                                                                                                                        om
                                                                                                                                                                                           it




                                                                                                                                                                                          m
                                                                                                                                                                              n.a m




                                                                                                                                                                           tud om
                                                                                                                                                                        tow ol .c o

                                                                                                                                                                                         de




                                                                                                                                                                                          m
                                                                                                                                                                                          m


                                                                                                                                                                                        om
                                                                                                                                                                                       .g o
                            20%




                                                                                                                                                                        me l ic e.




                                                                                                                                                                                      co


                                                                                                                                                                                      co
                                                                                                                                                                       om as .o




                                                                                                                                                                                      co
                                                                                                                                                                                      co




                                                                                                                                                                                     .c o
                                                                                                                                                                                    .c o
                                        14.2%




                                                                                                                                                                                    ol .
                                                                                                                                                                    ne pot. c




                                                                                                                                                                    fr e a ol. c
                                                                                                                                                                                  s. c




                                                                                                                                                                                 g.c
                                                                                                     11.6%




                                                                                                                                                                                  t r.
                                                                                                                                                                                a id




                                                                                                                                                                                 lx.
                            15%




                                                                                                                                                                                i o.
                                                                                                                                                                  me ape.




                                                                                                                                                                                ce
                                                                                                                                                                                es
                                                                                                                                                                              n.a
                                                                                              9.7%




                                                                                                                                                                                 a
                                                8.9%




                                                                                                                                                                                o




                                                                                                                                                                             eg
                                                                                                                                                                              eb
                                                                      7.6% 7.8%




                                                                                                                                                                            ri n
                                                                                                                                                                            er .




                                                                                                                                                                            g.h
                                                                                                                                                                            us
                            10%




                                                                                                                                                                           pa
                                                                                                                                                                           ag
                                                                                                                                                                              .
                                                                                                                                                                          gs


                                                                                                                                                                        tow




                                                                                                                                                                         s it
                                                                                                                                                                        ew
                                                                                                                                                                        t sc
                                                                                  3.3% 3.9%




                                                                                                                                                                        ha
                                                               2.7%




                                                                                                                                                                      blo
                                                                                                                                                                       os
                                                                                                                                                                       xp
                                                                                                                                                                       gs
                             5%




                                                                                                                                                               blo




                                                                                                                                                                    ho




                                                                                                                                                                    gs
                                                                                                                                                                  me


                                                                                                                                                                   xo




                                                                                                                                                                  for
                                                                                                                                                                  ma
                                                                                                                                                                 blo
                            0%




                                                                                                                                                                blo
                                                                                                                                                                ho

                                                                                                                                                                ho
                                  am ul t




                                            el
                                  es ey




                                 Fu ar s

                                            re



                                           ge
                                             g




                                            ic
                                           gs




                                             s
                                             e
                                        on




                                         rie
                                          in




                                        av




                                        us
                                       itu
                                                                                                                                       Figure 3: Layer #1: top-15 primary domains/sites by spam
                                       on
                                         d




                                       ra
                                       ru




                                       C
                                       bl
                                      A



                                      gt



                                     so

                                     Tr




                                     M

                                    ve
                                    rn
                            D




                                    M
                                   in




                                                                                                                                       doorway appearance counts



                                  A
                                 R
                                G




                               cc
                             A




                                                                                                                                          Figure 3 is useful for search engines to identify spam-heavy
                                                 Spammer-targeted Categories
                                                                                                                                       sites to scrutinize their URLs. Figure 4 shows that 14 of the top-
                                                                                                                                       15 doorway domains have a spam percentage3 higher than 74%;
Figure 2: Per-category and average redirection-spam densities
                                                                                                                                       that is, 3 out of 4 unique URLs on these domains (that appeared in
4.2 Double-Funnel Analysis                                                                                                             our search results) were detected as spam. To demonstrate the
                                                                                                                                       need for scrutinizing these sites, we scanned the top-1000 results
  We now analyze the five layers of the double-funnel, identify                                                                        from two queries ­ "site:blogspot.com phentermine" and
major domains involved at each layer, and categorize them to                                                                           "site:hometown.aol.com ringtone" ­ and identified more than half
provide insights into the current trends of search spamming.                                                                           of the URLs as spam easily. It is in the interest of the owners of
4.2.1 Layer #1: Doorway Domains                                                                                                        these legitimate websites to clean the heavy spam on their sites to
                                                                                                                                       avoid the reputation of spam magnets. We note that not all large,
   Figure 3 illustrates the top-15 primary domains/hosts by the                                                                        well-established web hosting sites are heavily abused by
occurrences of doorway URLs hosted on them. The first one is                                                                           spammers. For example, in our data, each of tripod.com (#19),
blogspot.com, with 3,882 appearances (of 2,244 unique doorway                                                                          geocities.com (#32), and angelfire.com (#38) had fewer spam
URLs), which is an order of magnitude higher than the others in                                                                        appearances than some newer, smaller web sites that rank among
the chart. This translates into a 2.6% spam density by blogspot                                                                        the top 15 in Figure 3.
URLs alone, which is around 22% of all detected spam
appearances. (By comparison, the last one in the chart                                                                                                                                     91%                 95%         99%
                                                                                                                                                                                                                                           93%
                                                                                                                                                                                                                                                 100% 95% 100%
                                                                                                             % URLs Detected as Spam




                                                                                                                                       100%                                          84%                                         81% 85%
blog.hlx.com has 110 occurrences of 61 unique URLs.) Typically,                                                                                                         77% 74%                      78% 77%
                                                                                                                                        80%
spammers         create       spam       blogs,       such     as                                                                                                                                                    52%
                                                                                                                                                    60%
http://PhentermineNoPrescriptionn.blogspot.com, and use these
                                                                                                                                                    40%
doorway URLs to spam the comment area of other forums. Since
                                                                                                                                                    20%
#2, #3, #4, and #7 in Figure 3 all belong to the same company, an
                                                                                                                                                              0%
alternative analysis would be to combine their numbers, resulting
                                                                                                                                                                                               m




                                                                                                                                                                                            .i t




                                                                                                                                                                                            m
                                                                                                                                                                                             v
                                                                                                                                                                                           rg




in 1,403 occurrences (0.9% density) of 948 unique URLs.




                                                                                                                                                                                           m
                                                                                                                                                                                          om



                                                                                                                                                                                           m
                                                                                                                                                                                           m




                                                                                                                                                                                           m
                                                                                                                                                                                          om




                                                                                                                                                                                          om



                                                                                                                                                                                           m
                                                                                                                                                                                          de
                                                                                                                                                                                          co




                                                                                                                                                                                         om
                                                                                                                                                                                        .go



                                                                                                                                                                                        co
                                                                                                                                                                                         ce




                                                                                                                                                                                        co
                                                                                                                                                                                        co




                                                                                                                                                                                       s.o




                                                                                                                                                                                      . co
                                                                                                                                                                                       co

                                                                                                                                                                                      . co
                                                                                                                                                                                      ol.

                                                                                                                                                                                      ol.




                                                                                                                                                                                     l. c
                                                                                                                                                                                    e. c




                                                                                                                                                                                    s. c




                                                                                                                                                                                    g. c
                                                                                                                                                                                    al i




                                                                                                                                                                                    t r.
                                                                                                                                                                                 aid




                                                                                                                                                                                   lx.
                                                                                                                                                                                  i o.



   The top-15 domains can be divided into four categories: five
                                                                                                                                                                  t.




                                                                                                                                                                                 oa




                                                                                                                                                                                 ce
                                                                                                                                                                                 es
                                                                                                                                                                               n.a

                                                                                                                                                                               n.a




                                                                                                                                                                               .ao
                                                                                                                                                               po




                                                                                                                                                                               eg
                                                                                                                                                                               ap




                                                                                                                                                                               eb
                                                                                                                                                                              er.




                                                                                                                                                                              ri n




                                                                                                                                                                             g.h
                                                                                                                                                                            tud



                                                                                                                                                                             us




                                                                                                                                                                             pa
                                                                                                                                                                            ag



are free blog/forum hosting sites, five are free web-hosting sites
                                                                                                                                                                          tow

                                                                                                                                                                          tow
                                                                                                                                                              gs




                                                                                                                                                                          me




                                                                                                                                                                           sit
                                                                                                                                                                     tsc




                                                                                                                                                                          ew
                                                                                                                                                                          om




                                                                                                                                                                          ha




                                                                                                                                                                        blo
                                                                                                                                                                         os
                                                                                                                                                                         xp
                                                                                                                                                                        gs
                                                                                                                                                 blo




                                                                                                                                                                      fr e




                                                                                                                                                                       gs
                                                                                                                                                                      ho
                                                                                                                                                                   ne




in English, three appear to be free web-hosting sites in foreign
                                                                                                                                                                      me

                                                                                                                                                                      me



                                                                                                                                                                      xo




                                                                                                                                                                     for
                                                                                                                                                                     ma
                                                                                                                                                                    blo




                                                                                                                                                                   blo
                                                                                                                                                                        ho

                                                                                                                                                                   ho




languages, and the remaining two (oas.org and usaid.gov) are
Universal Redirectors, which take an arbitrary URL as an
argument and redirect the browser to that URL [15]. For example,                                                                       Figure 4: Layer #1: top doorway domains and their spam
the known-spammer domain paysefeed.net, which appears to be                                                                            percentages (among the search results in our data)
exploiting tens of universal redirectors, was behind the following
spam          URLs:          http://www.oas.org/main/main.asp?                                                                         Spam Pages on .gov and .edu Domains
slang=s&slink=http://dir.kzn.ru/hydrocodone/ and http://www.                                                                              When a site within a non-commercial top-level domain, such as
usaid.gov/cgi-bin/goodbye?http://catalog-online.kzn.ru/free/                                                                           .gov and .edu, occurs prominently in the search results of
verizon-ringtones/. We note that none of these 15 sites hosts only                                                                     spammer-targeted commercial search terms, it often indicates that
spam and therefore cannot simply be blacklisted by search                                                                              the site has been spammed. Figure 5 illustrates the 15 .gov/.edu
engines. This confirms the anecdotal evidence that a significant                                                                       domains that host the largest number of spam URLs in our data.
portion of the web spam industry has moved towards setting up                                                                          These URLs can be divided into three categories:
"throw-away" doorway pages on legitimate domains, which then
redirect to their behind-the-scenes redirection domains, to be
discussed in the next subsection.
                                                                                                                                       3
                                                                                                                                                              We note that "spam percentage" is calculated on a per-domain
                                                                                                                                                              basis and is defined as the number of unique spam URLs