












Spam Double-Funnel:
Connecting Web Spammers with Advertisers
A Strider Search Ranger Report
Yi-Min Wang
Ming Ma
Yuan Niu
Hao Chen
Created: November 2006
Last Updated: February 26, 2007
Technical Report
MSR-TR-2007-27
Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
Spam Double-Funnel:
Connecting Web Spammers with Advertisers
Yi-Min Wang, Ming Ma Yuan Niu, Hao Chen
Microsoft Research University of California, Davis
Redmond, WA 98052, USA Davis, CA 95616-8562, USA
425-882-8080 530-754-5375
{ymwang, mingma}@microsoft.com {niu, hchen}@cs.ucdavis.edu
ABSTRACT We motivate our work using a real example. Around mid-
Spammers use questionable search engine optimization (SEO) October 2006, the following three doorway URLs appeared
techniques to promote their spam links into top search results. In among the top-10 Live Search results for "cheap ticket":
this paper, we focus on one prevalent type of spam redirection · http://-cheapticket.blogspot.com/
spam where one can identify spam pages by the third-party · http://sitegtr.com/all/cheap-ticket.html
domains that these pages redirect traffic to. We propose a five- · http://cheap-ticketv.blogspot.com/
layer, double-funnel model for describing end-to-end redirection
All these pages appeared to be spam: they used cloaking, their
spam, present a methodology for analyzing the layers, and
URLs were posted as comments at numerous open forums1, and
identify prominent domains on each layer using two sets of
they redirected traffic to known-spammer redirection domains
commercial keywords one targeting spammers and the other
vip-online-search.info, searchadv.com, and webresourses.info.
targeting advertisers. The methodology and findings are useful for
Surprisingly, ads for orbitz.com, a reputable company, appeared
search engines to strengthen their ranking algorithms against
on all these three spam pages. A search using similar keywords2 at
spam, for legitimate website owners to locate and remove spam
Google and Yahoo! revealed another two spam pages, hosted on
doorway pages, and for legitimate advertisers to identify
hometown.aol.com.au and megapage.de, that also displayed
unscrupulous syndicators who serve ads on spam pages.
orbitz.com ads. If we believe that a reputable company is unlikely
Categories and Subject Descriptors to buy service directly from spammers, a natural question to ask
H.3.5 [Information Storage and Retrieval]: Online Information is: who are the middlemen who indirectly sell spammers' service
Services - Commercial services, Web-based services to sites like orbitz.com?
We discovered the answer by "following the money": when we
General Terms: Measurement, Security, Experimentation clicked the orbitz.com ads on each of the five pages and
monitored the resulting HTTP traffic using the Fiddler tool [27],
Keywords: Search Spam, Web Spam, Redirection and we saw that the ads click-through traffic got funneled into either
Cloaking, Advertisement Syndication 64.111.210.206 or the block of IP addresses between
66.230.128.0 and 66.230.191.255 [30]. Moreover, the chain of
redirections stopped at http://r.looksmart.com, which then
1. INTRODUCTION redirected to orbitz.com using HTTP 302.
Search spammers (or web spammers) refer to those who use In this paper, we analyze end-to-end redirection spam activities
questionable search engine optimization (SEO) techniques to comprehensively with an emphasis on syndication-based spam.
promote their low-quality links into top search rankings. Common We propose a five-layer double-funnel model in which displayed
SEO techniques include stuffing keywords, creating link farms ads flow in one direction and click-through traffic flows in the
(e.g., large number of mutually linked, made-for-ads websites), other direction. By constructing two different benchmarks of
posting links to spam pages as comments at public forums commercial search terms and using the Strider Search Ranger
(referred to as comment spamming), and using crawler-browser system [21] to analyze tens of thousands of spam links that
cloaking techniques [8] to serve different pages to crawlers and appeared in top results across three major search engines, we
end users. To evade spam investigation, some spammers in recent identified the major domains in each of the five layers and their
years have started using click-through cloaking techniques interesting characteristics.
[15,22] to display bogus content to spam investigators who visit
their pages directly without clicking through any search results. The paper is organized as follows. Section 2 gives an overview
of the Search Ranger system and introduces the double-funnel
We use redirection spam to refer to the web pages that redirect model. In Section 3 we construct a spammer-targeted search
browsers to visit known spammer-controlled third-party domains. benchmark. Section 4 analyzes spam density and double-funnel
Many redirection spam pages use syndication where they for this benchmark. In Section 5 we construct an advertiser-
participate in pay-per-click programs and display ads-portal
pages.
1
For ease of presentation, throughout the paper, we use the term "forums"
Copyright is held by the International World Wide Web Conference to include all blogs, bulletin boards, message boards, guest books, web
Committee (IW3C2). Distribution of these papers is limited to classroom journals, diaries, galleries, archives, etc. that can be abused by web
use, and personal use by others. spammers to promote spam URLs.
2
WWW 2007, May 812, 2007, Banff, Alberta, Canada. We use the terms "keyword", "query", and "search term"
ACM 978-1-59593-654-7/07/0005. interchangeably in this paper to refer to the entire query phrase that a
user enters into a search box to perform a query.
targeted benchmark and compare the analysis results using this 3. Similarity-based Grouping for Identifying Large-scale
benchmark with those in Section 4. Section 6 discusses non- Spam Rather than analyzing all crawler-indexed pages, Search
redirection spam that also connects to the double-funnel model. Ranger focuses on monitoring search results of popular queries
Section 7 surveys related work, and Section 8 concludes the targeted by spammers to obtain a list of URLs with high spam
paper. Since all the analyses in this paper are based on the data densities. It then analyzes the similarity between the redirections
gathered in September and October of 2006, some spam URLs from these pages to identify related pages, which are potentially
may no longer be active. operated by large-scale spammers. In its simplest form, this
similarity analysis identifies doorway pages that share the same
2. REDIRECTION SPAM redirection domain. After we verify that the domain is responsible
for serving the spam content, we then use the domain as a seed to
2.1 Definitions: Search Spam and Redirection perform "backward propagation of distrust" [13] to detect other
SEO techniques span a wide spectrum. Since the precise related spam pages.
boundary between legitimate SEO techniques and search spam is In summary, Search Ranger identifies spam URLs using the
often subjective and fuzzy, we focus on one type of spam process summarized below.
redirection spam which is widely used by large-scale spammers Search Ranger Spam Detection Process
to associate many doorway pages with a single redirection
domain. These doorway pages often exhibit similar patterns in Step 1: Given a set of search terms and a target search engine,
their appearance, their cloaking and code obfuscation techniques Search Monkeys retrieve the top-N search results for each query,
for avoiding detection, and the way by which their URLs appear remove duplicates, and scan each unique URL to produce an
in the comment fields of public forums. These repeated patterns XML file that records all URL redirections.
allow human investigators to judge spam pages more easily and Step 2: At the end of a batched scan, Search Ranger applies
confidently. We will describe the exact steps in detecting spam in redirection analysis to all the XML files to classify URLs that
the next subsection. In Sections 4 and 5, we will show that redirected to known-spammer redirection domains as spam.
redirection spam accounts for significant spam densities in both Step 3: Search Ranger groups unclassified URLs by each of the
our benchmarks, which indicate that our spam detection third-party domains that received redirection traffic.
mechanism is effective in practice.
Step 4: Search Ranger submits sample URLs from each group to
After a user instructs the browser to visit a URL (the primary a spam verifier, which gathers evidence of spam activities
URL), the browser may visit other URLs (secondary URLs) associated with these URLs. Specifically, the spam verifier checks
automatically. The secondary URLs may contribute to inline if each URL uses crawler-browser cloaking to fool search engines
contents (e.g., Google AdSense ads) on the primary page, or may or uses click-through cloaking to evade manual spam
replace the primary page entirely (i.e., they replace the URL in the investigation. It also checks if the URL has been widely
address bar). We consider both these types of secondary URLs comment-spammed at public forums.
redirection. See [31] for screenshots of sample redirection spam.
Step 5: Search Ranger submits groups of unclassified URLs,
ranked by their group sizes and tagged by spam evidence, to
2.2 Strider Search Ranger System human judges. Once the judges determine a group to be spam,
The Strider Search Ranger system [21] is an automated spam Search Ranger adds the redirection domains responsible for
detection system with the following three key features: serving the spam content to the set of known spam domains,
which will be used in Step-2 classification in future scans.
1. Web Patrol with Search Monkeys [19] - Since search engine
crawlers typically do not execute scripts, spammers exploit this 2.3 Spam Double-Funnel
fact using crawler-browser cloaking techniques, which serve one
page to crawlers for indexing but display a different page to A typical advertising syndication business consists of three
browser users [8,23]. To defend against cloaking, Search layers: the publishers who attract traffic by providing quality
Monkeys visit each web page with a full-fledged popular browser, content on their websites to achieve high search rankings, the
which executes all client-side scripts. To combat the newer click- advertisers who pay for displaying their ads on those websites,
through cloaking technique, which serves spam content only to and the syndicators who provide the advertising infrastructure to
users who click through search results, our monkey programs connect the publishers with the advertisers. The Google AdSense
mimic the click-through by first retrieving a search-result page to program [29] is an example syndicator. Although some spammers
set the browser's document.referrer variable, then inserting a have abused the AdSense program [28], the abuse is most likely
link to the spam page in the search-result page, and finally the exception rather than the norm.
clicking through the inserted link. In a questionable advertising business, spammers assume the
2. Follow the Money through Redirection Tracking Common role of publishers, who set up websites of low-quality content and
approaches to detecting "spammy" content and link structures use black-hat SEO techniques to attract traffic. To better survive
merely catch "what" spammers are doing today. By contrast, if spam detection and blacklisting by search engines, many
we follow the money by tracking traffic redirection, we would be spammers have split their operations into two layers. At the first
closer to identifying "who" are behind spam activities, even if layer are the doorway pages, whose URLs the spammers promote
their spam techniques evolve. Search Ranger uses the Strider URL into top search results. When users click those links, their
Tracer [20] to intercept browser redirection traffic at the network browsers are instructed to fetch spam content from redirection
layer to record all redirection URLs. As Sections 4 and 5 will domains, which occupy the second layer.
demonstrate, we apply redirection analysis to tracking both the To attract prudent legitimate advertisers who do not want to be
ads-fetching traffic and the ads click-through traffic. too closely connected to the spammers, many syndicators have
also split their operations into two or more layers, which are
connected by multiple redirections, to obfuscate the connection extracting all the anchor text from a large number of spammed
between the advertisers and the spammers. Since these forums and ranking the keywords by their frequencies.
syndicators are typically smaller companies, they often join forces Between June and August of 2006, we manually investigated
through traffic aggregation to attract sufficient traffic providers spam reports from multiple sources including search user
and advertisers. feedback, heavily spammed forum types, online spam discussion
We model this end-to-end search spamming business with the forums, etc. We compiled a list of 323 keywords that returned
five-layer double-funnel illustrated in Figure 1: tens of thousands spam URLs among the top 50 results at one of the three major
of advertisers (Layer #5) pay a handful of syndicators (Layer #4) search engines. We then queried these keywords at all three
to display their ads. The syndicators buy traffic from a small search engines, extracted the top-50 results, scanned them with an
number of aggregators (Layer #3), who in turn buy traffic from earlier version of Search Ranger, and identified 4,803 unique
web spammers to insulate syndicators and advertisers from spam redirection-spam URLs.
pages. The spammers set up hundreds to thousands of redirection Next, we issued a "link:" query on each of the 4,803 URLs and
domains (Layer #2), create millions of doorway pages (Layer #1) retrieved 35,878 unique pages that contained at least one of these
that fetch ads from these redirection domains, and widely spam spam URLs. From these pages, we collected a total of 1,132,099
the URLs of these doorways at public forums. If any such URLs unique keywords, with a total of 6,026,699 occurrences, and
are promoted into top search results and are clicked by users, all ranked the keywords by their occurrence counts. The top-5
click-through traffic is funneled back through the aggregators, keywords are all drugs-related: "phentermine" (8,117), "viagra"
who then de-multiplex the traffic to the right syndicators. (6,438), "cialis" (6,053), "tramadol" (5,788), and "xanax"
Sometimes there is a chain of redirections between the (5,663). Among the top one hundred, 74 are drugs-related, 16 are
aggregators and the syndicators due to multiple layers of traffic ringtone-related, and 10 are gambling-related.
affiliate programs, but almost always one domain at the end of
Among the above 1,132,099 keywords, we could select a top
each chain is responsible for redirecting to the target advertiser's
list, say top 1000, for our subsequent analyses. However, we
website.
observed that keywords related to drugs and ringtones dominate
the top-1000 list. Since it would be useful to study spammers who
Doorway Doorway Doorway Doorway Layer #1
s s
target different categories, we decided to construct our benchmark
s s
by manually selecting ten of the most prominent categories from
Redirection
the list. They are:
Redirection
Domain Domain Layer #2 1. Drugs: phentermine, viagra, cialis, tramadol, xanax, etc.
2. Adult: porn, adult dating, sex, etc.
Ads
Display 3. Gambling: casino, poker, roulette, texas holdem, etc.
Aggregators Layer #3 4. Ringtone: verizon ringtones, free polyphonic ringtones, etc.
Click-Thru 5. Money: car insurance, debt consolidation, mortgage, etc.
Syndicator Syndicator Layer #4 6. Accessories: rolex replica, authentic gucci handbag, etc.
7. Travel: southwest airlines, cheap airfare, hotels las vegas, etc.
8. Cars: bmw, dodge viper, audi monmouth new jersey, etc.
Advertiser Advertiser Layer #5 9. Music: free music downloads, music lyrics, 50 cent mp3, etc.
10. Furniture: bedroom furniture, ashley furniture, etc.
Figure 1: Spam Double-Funnel We then selected the top-100 keywords from each category to
form our first benchmark of 1,000 spammer-targeted search terms.
In the case of AdSense-based spammers, the single domain
googlesyndication.com plays the role of the middle three layers, 4. REDIRECTION-SPAM ANALYSIS
responsible for serving ads, receiving click-through traffic, and
redirecting to advertisers. Specifically, browsers fetch AdSense In late September 2006, we submitted the 1,000 keywords to
ads from the redirection domain googlesyndication.com and the Search Ranger system, which retrieved the top-50 results from
display them on the doorway pages; ads click-through traffic goes all three major search engines. In total, we collected 101,585
into the aggregator domain googlesyndication.com before unique URLs from 1,000x50x3=150,000 search results. With a set
reaching advertisers' websites. of approximately 500 known-spammer redirection domains and
AdSense IDs at that time, the system identified 12,635 unique
3. SPAMMER-TARGETED KEYWORDS spam URLs, which accounted for 11.6% of all the top-50
appearances. (The actual redirection-spam density should be
To study the common characteristics of redirection spam, our higher because some of the doorway pages had been deactivated,
first step was to discover the keywords and categories heavily which were no longer causing URL redirections when we scanned
targeted by redirection spammers. In this section, we describe our them.) We first give a brief analysis of per-category spam
methodology for deriving 10 spammer-targeted categories and a densities in Section 4.1 and then focus on the double-funnel
benchmark of 1,000 keywords, which serve as the basis for the analysis for the remainder of this section.
analyses presented in Section 4.
Redirection spammers often use their targeted keywords as the 4.1 Spam Density Analysis
anchor text of their spam links at public forums, exploiting a Figure 2 compares the per-category spam densities across the
typical algorithm by which common search engines index and 10 spammer-targeted categories. The numbers range from 2.7%
rank URLs. For example, the anchor text for the spam URL for Money to 30.8% for Drugs. Two categories, Drugs and
http://coach-handbag-top.blogspot.com/ is typically "coach Ringtone, are well above twice the average (shown on the far
handbag". Therefore, we collect spammer-targeted keywords by right). Three categories Money, Cars, and Furniture are well
below half the average. We also calculated DCG (Discounted 3,882
# of Spam Appearance
600
Cumulated Gain) [10] spam densities, which give more weights to 493
500 396
spam URLs appearing near the top of the search-result list, but 400 296
found no significant difference from Figure 2. 300 242 225 218 207
178 172 150
200 131 124 123 110
Per-Category Spam Density
35% 30.8% 100
30% 27.5% 0
25%
m
rg
v
m
om
m
om
it
m
n.a m
tud om
tow ol .c o
de
m
m
om
.g o
20%
me l ic e.
co
co
om as .o
co
co
.c o
.c o
14.2%
ol .
ne pot. c
fr e a ol. c
s. c
g.c
11.6%
t r.
a id
lx.
15%
i o.
me ape.
ce
es
n.a
9.7%
a
8.9%
o
eg
eb
7.6% 7.8%
ri n
er .
g.h
us
10%
pa
ag
.
gs
tow
s it
ew
t sc
3.3% 3.9%
ha
2.7%
blo
os
xp
gs
5%
blo
ho
gs
me
xo
for
ma
blo
0%
blo
ho
ho
am ul t
el
es ey
Fu ar s
re
ge
g
ic
gs
s
e
on
rie
in
av
us
itu
Figure 3: Layer #1: top-15 primary domains/sites by spam
on
d
ra
ru
C
bl
A
gt
so
Tr
M
ve
rn
D
M
in
doorway appearance counts
A
R
G
cc
A
Figure 3 is useful for search engines to identify spam-heavy
Spammer-targeted Categories
sites to scrutinize their URLs. Figure 4 shows that 14 of the top-
15 doorway domains have a spam percentage3 higher than 74%;
Figure 2: Per-category and average redirection-spam densities
that is, 3 out of 4 unique URLs on these domains (that appeared in
4.2 Double-Funnel Analysis our search results) were detected as spam. To demonstrate the
need for scrutinizing these sites, we scanned the top-1000 results
We now analyze the five layers of the double-funnel, identify from two queries "site:blogspot.com phentermine" and
major domains involved at each layer, and categorize them to "site:hometown.aol.com ringtone" and identified more than half
provide insights into the current trends of search spamming. of the URLs as spam easily. It is in the interest of the owners of
4.2.1 Layer #1: Doorway Domains these legitimate websites to clean the heavy spam on their sites to
avoid the reputation of spam magnets. We note that not all large,
Figure 3 illustrates the top-15 primary domains/hosts by the well-established web hosting sites are heavily abused by
occurrences of doorway URLs hosted on them. The first one is spammers. For example, in our data, each of tripod.com (#19),
blogspot.com, with 3,882 appearances (of 2,244 unique doorway geocities.com (#32), and angelfire.com (#38) had fewer spam
URLs), which is an order of magnitude higher than the others in appearances than some newer, smaller web sites that rank among
the chart. This translates into a 2.6% spam density by blogspot the top 15 in Figure 3.
URLs alone, which is around 22% of all detected spam
appearances. (By comparison, the last one in the chart 91% 95% 99%
93%
100% 95% 100%
% URLs Detected as Spam
100% 84% 81% 85%
blog.hlx.com has 110 occurrences of 61 unique URLs.) Typically, 77% 74% 78% 77%
80%
spammers create spam blogs, such as 52%
60%
http://PhentermineNoPrescriptionn.blogspot.com, and use these
40%
doorway URLs to spam the comment area of other forums. Since
20%
#2, #3, #4, and #7 in Figure 3 all belong to the same company, an
0%
alternative analysis would be to combine their numbers, resulting
m
.i t
m
v
rg
in 1,403 occurrences (0.9% density) of 948 unique URLs.
m
om
m
m
m
om
om
m
de
co
om
.go
co
ce
co
co
s.o
. co
co
. co
ol.
ol.
l. c
e. c
s. c
g. c
al i
t r.
aid
lx.
i o.
The top-15 domains can be divided into four categories: five
t.
oa
ce
es
n.a
n.a
.ao
po
eg
ap
eb
er.
ri n
g.h
tud
us
pa
ag
are free blog/forum hosting sites, five are free web-hosting sites
tow
tow
gs
me
sit
tsc
ew
om
ha
blo
os
xp
gs
blo
fr e
gs
ho
ne
in English, three appear to be free web-hosting sites in foreign
me
me
xo
for
ma
blo
blo
ho
ho
languages, and the remaining two (oas.org and usaid.gov) are
Universal Redirectors, which take an arbitrary URL as an
argument and redirect the browser to that URL [15]. For example, Figure 4: Layer #1: top doorway domains and their spam
the known-spammer domain paysefeed.net, which appears to be percentages (among the search results in our data)
exploiting tens of universal redirectors, was behind the following
spam URLs: http://www.oas.org/main/main.asp? Spam Pages on .gov and .edu Domains
slang=s&slink=http://dir.kzn.ru/hydrocodone/ and http://www. When a site within a non-commercial top-level domain, such as
usaid.gov/cgi-bin/goodbye?http://catalog-online.kzn.ru/free/ .gov and .edu, occurs prominently in the search results of
verizon-ringtones/. We note that none of these 15 sites hosts only spammer-targeted commercial search terms, it often indicates that
spam and therefore cannot simply be blacklisted by search the site has been spammed. Figure 5 illustrates the 15 .gov/.edu
engines. This confirms the anecdotal evidence that a significant domains that host the largest number of spam URLs in our data.
portion of the web spam industry has moved towards setting up These URLs can be divided into three categories:
"throw-away" doorway pages on legitimate domains, which then
redirect to their behind-the-scenes redirection domains, to be
discussed in the next subsection.
3
We note that "spam percentage" is calculated on a per-domain
basis and is defined as the number of unique spam URLs