Tags: data loss prevention, eecs, encrypted data, flow 1, hypertext transfer protocol, impossible task, information flow, information leakage, information leaks, loss prevention systems, maximum volume, network bandwidth, network traffic, open pipe, outgoing traffic, protocol specifications, security inc, sensitive data, tap security, web traffic,
Towards Quantification of Network-Based Information Leaks via HTTP
Kevin Borders Atul Prakash
Web Tap Security, Inc. University of Michigan
White Lake, MI 48383 Ann Arbor, MI 48109
kborders@webtapsecurity.com aprakash@eecs.umich.edu
Abstract
As the Internet grows and network bandwidth continues to increase, administrators are faced with the task of
keeping confidential information from leaving their networks. Today's network traffic is so voluminous that manual
inspection would be unreasonably expensive. In response, researchers have created data loss prevention systems that
check outgoing traffic for known confidential information. These systems stop naïve adversaries from leaking data,
but are fundamentally unable to identify encrypted or obfuscated information leaks. What remains is a wide open
pipe for sending encrypted data to the Internet.
We present an approach for quantifying network-based information leaks. Instead of trying to detect the presence of
sensitive data--an impossible task in the general case--our goal is to measure and constrain its maximum volume.
We take advantage of the insight that most network traffic is repeated or determined by external information, such as
protocol specifications or messages sent by a server. By discounting this data, we can isolate and quantify true
information leakage. In this paper, we present leak measurement algorithms for the Hypertext Transfer Protocol
(HTTP), the main protocol for web browsing. When applied to real web traffic from different scenarios, the
algorithms show a reduction of 9499.7% over a raw measurement and are able to effectively isolate true
information flow.
1. Introduction look for excerpts in outbound traffic. Although they are
effective at stopping accidental and plaint-text leaks,
Network-based information leaks pose a serious threat DLP systems are fundamentally unable to detect
to confidentiality. They are the primary means by encrypted information flows. They leave an open
which hackers extract data from compromised channel for leaking encrypted data to the Internet.
computers. The network can also serve as an avenue for
insider leaks, which, according to a 2007 CSI/FBI In this paper, we introduce a new approach for precisely
survey, are the most prevalent security threat for quantifying network-based information leaks. Rather
organizations [5]. Because the volume of legitimate than searching for known sensitive data--an impossible
network traffic is so large, it is easy for attackers to task in the general case--we aim to measure and
blend in with normal activity, making leak prevention constrain its maximum volume. We exploit the fact that
difficult. In one experiment, a single computer a large portion of network traffic is repeated or
browsing a social networking site for 30 minutes constrained by protocol specifications. By ignoring this
generated over 1.3 MB of legitimate request data--the fixed data, we can isolate true information flows from a
equivalent of about 195,000 credit card numbers. client to the Internet, regardless of encryption or
Manually analyzing network traffic for leaks would be obfuscation. The algorithms presented in this paper
unreasonably expensive and error-prone. Limiting yield results that are 9499.7% smaller than raw request
network traffic based on the raw byte count would only sizes for several common browsing scenarios, and 75
help detect large information leaks due to the volume of 97% smaller than a simple calculation from prior
normal traffic. research [1]. They also measure information content
irrespective of data hiding techniques. The end result is
In response to the threat of network-based information a small, evasion-proof bandwidth measurement that can
leaks, researchers have developed data-loss prevention precisely quantify and isolate network-based
(DLP) systems [6, 8]. DLP systems work by searching information leaks.
through outbound network traffic for known sensitive
information, such as credit card and social security The leak measurement techniques in this paper focus on
numbers. Some even catalog sensitive documents and the Hypertext Transfer Protocol (HTTP), the main
protocol for web browsing. They take advantage of this paper with the simple methods used by Web Tap
HTTP and its interaction with Hypertext Markup [1] and demonstrate a 7597% measurement reduction
Language (HTML) documents to identify information for legitimate traffic.
originating from the user. The basic idea is to compute
the content of expected HTTP requests using only Research on limiting the capacity of channels for
externally available information, including previous information leakage has traditionally been done
network requests, previous server responses, and assuming that systems deploy mandatory access control
protocol specifications. Then, the amount of (MAC) policies [2] to restrict information flow.
unconstrained outbound bandwidth is equal to the edit However, mandatory access control systems are rarely
distance (edit distance is the size of the edit list required deployed because of their usability and management
to transform one string into another) between actual and overhead, yet organizations still have a strong interest
expected requests plus timing information. This in protecting confidential information.
unconstrained bandwidth measurement represents the
maximum amount of information that could have been One popular approach for protecting against network-
leaked by the client, while minimizing the influence of based information leaks is to limit where hosts with can
repeated or constrained data. send data using a content filter, such as Websense [9].
Content filters may help in some cases, but they do not
The reasons that we chose to focus on HTTP in this prevent all information leaks. A smart attacker can post
paper are twofold. First, it is the primary protocol for sensitive information on any website that receives input
web browsing. Many networks, particularly those in and displays it to other clients, including useful sites
which confidentiality is a high priority, will only allow such as www.wikipedia.org. We consider content filters
outbound HTTP traffic through a proxy server and to be complimentary to our measurement methods, as
block other protocols. In this scenario, HTTP would be they reduce but do not eliminate information leaks.
the only option for an attacker to leak data over the
network. Second, a large portion of information in Though little work has been done on quantifying
HTTP requests is constrained by protocol network-based information leaks, there has been a great
specifications. This is not the case for e-mail, where deal of research on methods for leaking data. Prior
most information is in the e-mail body--an work on convert network channels includes embedding
unconstrained field. Although the concepts in this paper data in IP fields [3], TCP fields [7], and HTTP protocol
would not apply well to all network traffic, we believe headers [4]. The methods presented in this paper aim to
that they will yield much lower bandwidth quantify the maximum amount of information that an
measurements for most protocols, including instant HTTP channel could contain, regardless of the
messaging and domain name system (DNS) queries. particular data hiding scheme employed.
Applying techniques for unconstrained bandwidth
measurement to other protocols is future work.
3. Problem Description
The remainder of this paper discusses related work in
Section 2, a formal problem description in Section 3, In this paper, we address the problem of quantifying
our measurement techniques in Section 4, the network-based information leaks by isolating
evaluation in Section 5, and concluding remarks in information from the client in network traffic. We will
Section 6. refer to information originating from the client as UI-
layer input. From a formal perspective, the problem can
be broken down to quantifying the set U of UI-layer
2. Related Work input to a network application given the following
information:
Prior research on detecting covert web traffic has
looked at measuring information flow via the HTTP · I The set of previous network inputs to an
protocol [1]. Borders et al. introduce a method for application.
computing bandwidth in outbound HTTP traffic that · O The set of current and previous network
involves discarding expected header fields. However, outputs from an application.
they use a stateless approach and therefore are unable to · A The application representation, which is a
discount information that is repeated or constrained mapping: U × I O of UI-layer information
from previous HTTP messages. In our evaluation, we
compare the leak measurement techniques presented in
combined with network input to yield network preferred language and the browser version. We only
output. count the size of these headers for the first request from
each client, and count the edit distance from previous
By definition, the set I cannot contain new information requests on subsequent changes. Here, we treat all
from the client because it is generated by the server. In HTTP headers except for Host, Referer, and Cookie as
this paper, the application representation A is based on fixed. Some of these header fields, such as
protocol specifications, but it could also be derived Authorization, may actually contain information from
from program analysis. In either case, it does not the user. When these fields contain new data, we again
contain information from the client. Therefore, the count the edit distance with respect to the most recent
information content of set O can be reduced to the request.
information in the set U. If the application has been
tampered with by malicious software yielding a Next, we look at the Host and Referer header fields.
different representation A', then the information content The Host field, along with the request path, specifies
of tampered output O' is equal to the information the request's uniform resource locator (URL). We only
content of the closest expected output O plus the edit count the size of the Host field if the request URL did
distance between O and O'. Input supplied to an not come from a link in another page, which we discuss
application from all sources other than the network is more in the next section. Similarly, we only count the
considered part of U. This includes file uploads and Referer field's size if does not contain the URL of a
system information. Timing information is also part of previous request.
the set U. Though we do not attempt to quantify timing
bandwidth in this paper, we plan to extend the methods Finally, we examine the Cookie header field to verify
presented by Cabuk et al. [3] in the future to measure its consistency with expected browser behavior. The
the bandwidth of active HTTP request timing channels. Cookie field is supposed to contain key-value pairs
from previous server responses. Cookies should never
contain UI-layer information from the client. If the
4. Measurement Techniques Cookie differs from its expected value or we do not
have a record from a previous response (this could
4.1 HTTP Request Overview happen if a mobile computer is brought into an
enterprise network), then we count the edit distance
There are two main types of HTTP requests used by between the expected and actual cookie values. At least
web browsers, GET and POST. GET is typically used one known tunneling program, Cooking Channel [4],
to obtain resources and POST is used to send data to a hides information inside of the Cookie header in
server. An example of an HTTP POST request can be violation of standard browser behavior. The techniques
seen in Figure 1. This request is comprised of three presented here correctly measure outbound bandwidth
distinct sections: the request line, headers, and the for the Cooking Channel program.
request body. GET requests are very similar except that
they do not have a request body. The request line
contains the path of the requested file on the server, and 4.3 Standard GET Requests
it may also have script parameters. The next part of the
HTTP request is the header field section, which consists HTTP GET requests are normally used to retrieve
of ": " pairs separated by line breaks. resources from a web server. Each GET request
Header fields relay information such as the browser identifies a resource by a URL that is comprised of the
version, preferred language, and cookies. Finally, the server host name, stored in the Hostname header field,
HTTP request body follows the headers and may and the resource path, stored in the request line.
consist of arbitrary data. In the example message, the Looking at each HTTP request independently, one
body contains an encoded name and e-mail address that cannot determine whether the URL contains UI-layer
was entered into a form. information or is the result of previous network input
(i.e., a link from another page). If we consider the entire
browsing session, however, then we can discount
4.2 HTTP Header Fields request URLs that have been seen in previous server
responses, thus significantly improving unconstrained
The first type of HTTP header field that we examine is bandwidth measurements.
a fixed header field. Fixed headers should be the same
for each request in most cases. Examples include the
1 POST /download HTTP/1.1
2 Host: www.webtapsecurity.com
2 User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1;
en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12
2 Keep-Alive: 300
2 Connection: keep-alive
2 Referer: http://www.webtapsecurity.com/download.html
2 Content-Type: application/x-www-form-urlencoded
2 Content-Length: 73
3 FirstName=John&LastName=Doe&Email=johndoe%40example.
com&Submit=Download
(a) (b)
Figure 1. (a) A sample HTTP POST request for submitting contact information to download a file. Line 1 is the
HTTP request line. Lines marked 2 are request headers, and line 3 is the request body. Bytes counted by a
simple algorithm are highlighted in gray. UI-layer data is highlighted in black with white text.
(b) A sample HTML document at http://www.webtapsecurity.com/download.html that generated request (a).
The first step in accurately measuring UI-layer bits where n is the original number of links and m is
information in request URLs is enumerating all of the the number of links that were loaded. For voluntary
links on each web page. In this paper, we only use links, we count log2(n) for each request where n is
simple HTML parsing to extract link URLs. An the number of voluntary links on the parent page. We
example of an HTML link is "Click log2(n) bits of information. When the user has
Here!" where the URL is multiple pages open, the number of available links may
"http://www.example.com/page" and the link text is be greater than those on one page. In practice, we count
"Click Here!". In the future, we plan to handle log2(n) bits for a link only if it came from the last
Javascript constructs such as such as "onclick = page that was loaded. Otherwise, we count log2(n) +
`this.document. location = '". We log2(p) bits where n is the number of links on the
also plan to employ program analysis techniques or run parent page and p is the number of pages in the user's
Javascript through a full-fledged interpreter to handle browsing history.
more advanced Javascript code related to links.
Although link extraction is undecidable in general,
these approaches are likely to be successful in 4.4 Form Submission Requests
extracting links for most real cases. Currently, we count
URLs that we cannot identify with HTML parsing as The primary method for transmitting information to a
UI-layer information. web server is form submission. Form submission
requests send information that the user enters into input
After the set of links has been determined for each controls such as text boxes and radio buttons. They may
page, we can measure the amount of UI-layer also include information originating from the server in
information conveyed by GET requests for those URLs. hidden or read-only fields. Form submissions contain a
The first step is dividing links up into two categories: sequence of delimited pairs, which can
mandatory and voluntary. A mandatory link is one that be seen in the body of the POST request in Figure 1a.
should always be loaded. Examples include images, The field names, field ordering, and delimiters between
scripts, etc. A voluntary link is selected by the user after fields can be derived from the page containing the form
the page has been loaded. Loading URLs from seen in Figure 1b and thus do not convey UI-layer
mandatory links does not directly leak any information. information. Field values may also be taken from the
Omission of some mandatory links may directly leak up encapsulating page in some circumstances. Check
to one bit of information per link (one bit for each link, boxes and radio buttons can transmit up to one bit of
not just omitted links). Reordering mandatory links information each, even though the value representing
(they have a predetermined order) may leak up to "on" is often several bytes. Servers can store client-side
log2(n!) bits where n is the number of mandatory state by setting data in "hidden" form fields, which are
links that were loaded. If we encounter missing or echoed back by the client upon form submission.
reordered mandatory links, we count log2(m!) + n Visible form fields may also have large default values,
Scenario Total Req. Count Raw (bytes) Simple (bytes/%) Precise (bytes/%) Avg. Precise Size
Web Mail 508 619,661 224,259 / 36.2% 37,218 / 6.01% 73.3 bytes
Sports News 911 1,187,579 199,119 / 16.8% 49,785 / 4.19% 54.6 bytes
News 547 502,208 74,497 / 14.8% 16,582 / 3.30% 30.3 bytes
Shopping 1,530 913,226 156,882 / 17.2% 26,390 / 2.89% 17.2 bytes
Social Networking 1,175 1,404,251 91,270 / 6.5% 15,453 / 1.10% 13.2 bytes
Blog 191 108,565 10,996 / 10.1% 351 / 0.32% 1.8 bytes
Table 1. Bandwidth measurement results for six web browsing scenarios using three different measurement
techniques, along with the average bytes/request for the precise technique.
as is the case when editing a blog post or a social simple bandwidth measurements from prior research
networking profile. For fields with default values, we [1], demonstrating a reduction to 325% of the original
measure the edit distance between the default and values. In the first five scenarios, the precise
submitted values. We measure the full size of any measurements were still significantly larger than the
unexpected form submissions or form fields, which amount of UI-layer information, which we
may result from active Javascript. approximated by recording the number of link
traversals and size of form submissions. After
examining the web pages in those scenarios, we found
4.5 Custom Web Requests and Edit this overestimate to be a direct result of Javascript code
Distance and Flash objects, particularly from advertisements. For
the Web Mail scenario, the median request size was 5
Custom network applications, Active Javascript, and bytes. When compared to the average of 73.3, this
malicious software may send arbitrary HTTP requests indicates that there were a few large requests whose
that do not conform to the behavior of a web browser. URLs did not appear as links in web pages. We believe
This leads us to the problem of measuring UI-layer that more advanced Javascript and Flash processing
information U for an unknown application will allow us to correctly extract many of these links in
representation A'. In this case, we reduce the effects of the future. Our goal is to approach an optimal case
repetition by counting the edit distance between each where we correctly read all links in each document. The
new request and recent requests to the same server from Blog reading scenario is representative of this best case
the same client, rather than counting the entire size of because no links came from Javascript code. We hope
each request. In the future, we plan to explore the use of to refine the precise analysis techniques so that the
an incremental compression algorithm on custom web average count is only a few bytes across all browsing
requests to further isolate true outbound information scenarios. In the future, we also plan on establishing
flows. Analyzing active Javascript to derive its long-term byte count thresholds similar to those in Web
application representation may also help to generate Tap [1] for identifying clients that leak suspiciously
more accurate measurements for custom web large amounts of data.
applications.
6. Conclusions and Research Challenges
5. Evaluation
In this paper, we presented methods for precisely
We evaluated our leak quantification techniques on web quantifying information leaks in outbound web traffic.
traffic from several legitimate web browsing scenarios. These methods exploit protocol knowledge to filter
The scenarios were 30-minute browsing sessions that repeated and constrained message fields, thus isolating
included web mail (Yahoo), social networking true information flows from the client. The resulting
(Facebook), news (New York Times), sports (ESPN), measurements can help identify leaks from spyware and
shopping (Amazon), and personal blog websites. The malicious insiders. We evaluated the precise analysis
results are shown in Table 1. The precise measurements techniques by applying them to web traffic from several
show a major reduction compared to the raw byte browsing scenarios, including web mail, online
counts, ranging from 0.36.0% of the original values. shopping, and social networking. They produced
The precise methods also perform much better than request size measurements that were 9499.7% smaller
than raw bandwidth values, demonstrating their ability [2] S. Brand. DoD 5200.28-STD Department of
to filter out constrained information. Defense Trusted Computer System Evaluation
Criteria (Orange Book). National Computer
The main research challenge we encountered was Security Center, 1985.
measuring web requests from pages with active [3] S. Cabuk, C. Brodley, and C. Shields. IP Covert
Javascript code or Flash objects. Correctly extracting Timing Channels: Design and Detection. In Proc.
links from Javascript and Flash is undecidable in of the 11th ACM Conference on Computer and
general, and may require running scripts in a full- Communications Security (CCS), 2004.
fledged interpreter or performing complex static [4] S. Castro. How to Cook a Covert Channel. hakin9,
analysis, even in common cases. Both of these http://www.gray-world.net/projects/ cooking_channels/
approaches would have a significant impact on hakin9_cooking_channels_en.pdf, 2006.
performance. Optimizing the precise bandwidth
[5] R. Richardson. CSI Computer Crime and Security
measurement techniques to handle large traffic volumes
Survey. http://i.cmpnet.com/v2.gocsi.com/pdf/
will be another research challenge. Caching parse
CSISurvey2007.pdf, 2007.
results and storing hash values instead of full strings
will reduce CPU and memory overhead but hurt [6] RSA Security, Inc. RSA Data Loss Prevention
accuracy for dynamic content and edited responses. In Suite. RSA Solution Brief, http://www.rsa.com/
the future, we plan to quantify these performance products/EDS/sb/DLPST_SB_1207-lowres.pdf, 2007.
tradeoffs and introduce more powerful Javascript [7] S. Servetto and M. Vetterli. Communication Using
analysis techniques. Phantoms: Covert Channels in the Internet. In
Proc. of the IEEE International Symposium on
Information Theory, 2001.
7. References [8] VONTU. Data Loss Prevention, Confidential Data
Protection Protect Your Data Anywhere.
[1] K. Borders and A. Prakash. Web Tap: Detecting http://www.vontu.com, 2008.
Covert Web Traffic. In Proc. of the 11th ACM [9] Websense, Inc. Web Security, Internet Filtering,
Conference on Computer and Communications and Internet Security Software.
Security (CCS), 2004. http://www.websense.com/global/en/, 2008.