Information about http://www.usenix.org/event/hotsec08/tech/full_papers/borders/borders.pdf

Towards Quantification of Network-Based Information Leaks via HTTP…

Tags: data loss prevention, eecs, encrypted data, flow 1, hypertext transfer protocol, impossible task, information flow, information leakage, information leaks, loss prevention systems, maximum volume, network bandwidth, network traffic, open pipe, outgoing traffic, protocol specifications, security inc, sensitive data, tap security, web traffic,
Pages: 6
Language: english
Created: Wed Jul 9 12:38:37 2008
Display cached document
Page 1
image
Page 2
image
Page 3
image
Page 4
image
Page 5
image
Page 6
image
    Towards Quantification of Network-Based Information Leaks via HTTP
                Kevin Borders                                                   Atul Prakash
            Web Tap Security, Inc.                                          University of Michigan
            White Lake, MI 48383                                            Ann Arbor, MI 48109
        kborders@webtapsecurity.com                                       aprakash@eecs.umich.edu

                                                    Abstract

As the Internet grows and network bandwidth continues to increase, administrators are faced with the task of
keeping confidential information from leaving their networks. Today's network traffic is so voluminous that manual
inspection would be unreasonably expensive. In response, researchers have created data loss prevention systems that
check outgoing traffic for known confidential information. These systems stop naïve adversaries from leaking data,
but are fundamentally unable to identify encrypted or obfuscated information leaks. What remains is a wide open
pipe for sending encrypted data to the Internet.

We present an approach for quantifying network-based information leaks. Instead of trying to detect the presence of
sensitive data--an impossible task in the general case--our goal is to measure and constrain its maximum volume.
We take advantage of the insight that most network traffic is repeated or determined by external information, such as
protocol specifications or messages sent by a server. By discounting this data, we can isolate and quantify true
information leakage. In this paper, we present leak measurement algorithms for the Hypertext Transfer Protocol
(HTTP), the main protocol for web browsing. When applied to real web traffic from different scenarios, the
algorithms show a reduction of 94­99.7% over a raw measurement and are able to effectively isolate true
information flow.

1. Introduction                                             look for excerpts in outbound traffic. Although they are
                                                            effective at stopping accidental and plaint-text leaks,
Network-based information leaks pose a serious threat       DLP systems are fundamentally unable to detect
to confidentiality. They are the primary means by           encrypted information flows. They leave an open
which hackers extract data from compromised                 channel for leaking encrypted data to the Internet.
computers. The network can also serve as an avenue for
insider leaks, which, according to a 2007 CSI/FBI           In this paper, we introduce a new approach for precisely
survey, are the most prevalent security threat for          quantifying network-based information leaks. Rather
organizations [5]. Because the volume of legitimate         than searching for known sensitive data--an impossible
network traffic is so large, it is easy for attackers to    task in the general case--we aim to measure and
blend in with normal activity, making leak prevention       constrain its maximum volume. We exploit the fact that
difficult. In one experiment, a single computer             a large portion of network traffic is repeated or
browsing a social networking site for 30 minutes            constrained by protocol specifications. By ignoring this
generated over 1.3 MB of legitimate request data--the       fixed data, we can isolate true information flows from a
equivalent of about 195,000 credit card numbers.            client to the Internet, regardless of encryption or
Manually analyzing network traffic for leaks would be       obfuscation. The algorithms presented in this paper
unreasonably expensive and error-prone. Limiting            yield results that are 94­99.7% smaller than raw request
network traffic based on the raw byte count would only      sizes for several common browsing scenarios, and 75­
help detect large information leaks due to the volume of    97% smaller than a simple calculation from prior
normal traffic.                                             research [1]. They also measure information content
                                                            irrespective of data hiding techniques. The end result is
In response to the threat of network-based information      a small, evasion-proof bandwidth measurement that can
leaks, researchers have developed data-loss prevention      precisely quantify and isolate network-based
(DLP) systems [6, 8]. DLP systems work by searching         information leaks.
through outbound network traffic for known sensitive
information, such as credit card and social security        The leak measurement techniques in this paper focus on
numbers. Some even catalog sensitive documents and          the Hypertext Transfer Protocol (HTTP), the main
protocol for web browsing. They take advantage of               this paper with the simple methods used by Web Tap
HTTP and its interaction with Hypertext Markup                  [1] and demonstrate a 75­97% measurement reduction
Language (HTML) documents to identify information               for legitimate traffic.
originating from the user. The basic idea is to compute
the content of expected HTTP requests using only                Research on limiting the capacity of channels for
externally available information, including previous            information leakage has traditionally been done
network requests, previous server responses, and                assuming that systems deploy mandatory access control
protocol specifications. Then, the amount of                    (MAC) policies [2] to restrict information flow.
unconstrained outbound bandwidth is equal to the edit           However, mandatory access control systems are rarely
distance (edit distance is the size of the edit list required   deployed because of their usability and management
to transform one string into another) between actual and        overhead, yet organizations still have a strong interest
expected requests plus timing information. This                 in protecting confidential information.
unconstrained bandwidth measurement represents the
maximum amount of information that could have been              One popular approach for protecting against network-
leaked by the client, while minimizing the influence of         based information leaks is to limit where hosts with can
repeated or constrained data.                                   send data using a content filter, such as Websense [9].
                                                                Content filters may help in some cases, but they do not
The reasons that we chose to focus on HTTP in this              prevent all information leaks. A smart attacker can post
paper are twofold. First, it is the primary protocol for        sensitive information on any website that receives input
web browsing. Many networks, particularly those in              and displays it to other clients, including useful sites
which confidentiality is a high priority, will only allow       such as www.wikipedia.org. We consider content filters
outbound HTTP traffic through a proxy server and                to be complimentary to our measurement methods, as
block other protocols. In this scenario, HTTP would be          they reduce but do not eliminate information leaks.
the only option for an attacker to leak data over the
network. Second, a large portion of information in              Though little work has been done on quantifying
HTTP requests is constrained by protocol                        network-based information leaks, there has been a great
specifications. This is not the case for e-mail, where          deal of research on methods for leaking data. Prior
most information is in the e-mail body--an                      work on convert network channels includes embedding
unconstrained field. Although the concepts in this paper        data in IP fields [3], TCP fields [7], and HTTP protocol
would not apply well to all network traffic, we believe         headers [4]. The methods presented in this paper aim to
that they will yield much lower bandwidth                       quantify the maximum amount of information that an
measurements for most protocols, including instant              HTTP channel could contain, regardless of the
messaging and domain name system (DNS) queries.                 particular data hiding scheme employed.
Applying techniques for unconstrained bandwidth
measurement to other protocols is future work.
                                                                3. Problem Description
The remainder of this paper discusses related work in
Section 2, a formal problem description in Section 3,           In this paper, we address the problem of quantifying
our measurement techniques in Section 4, the                    network-based information leaks by isolating
evaluation in Section 5, and concluding remarks in              information from the client in network traffic. We will
Section 6.                                                      refer to information originating from the client as UI-
                                                                layer input. From a formal perspective, the problem can
                                                                be broken down to quantifying the set U of UI-layer
2. Related Work                                                 input to a network application given the following
                                                                information:
Prior research on detecting covert web traffic has
looked at measuring information flow via the HTTP                 · I ­ The set of previous network inputs to an
protocol [1]. Borders et al. introduce a method for                 application.
computing bandwidth in outbound HTTP traffic that                 · O ­ The set of current and previous network
involves discarding expected header fields. However,                outputs from an application.
they use a stateless approach and therefore are unable to         · A ­ The application representation, which is a
discount information that is repeated or constrained                mapping: U × I  O of UI-layer information
from previous HTTP messages. In our evaluation, we
compare the leak measurement techniques presented in
      combined with network input to yield network           preferred language and the browser version. We only
      output.                                                count the size of these headers for the first request from
                                                             each client, and count the edit distance from previous
By definition, the set I cannot contain new information      requests on subsequent changes. Here, we treat all
from the client because it is generated by the server. In    HTTP headers except for Host, Referer, and Cookie as
this paper, the application representation A is based on     fixed. Some of these header fields, such as
protocol specifications, but it could also be derived        Authorization, may actually contain information from
from program analysis. In either case, it does not           the user. When these fields contain new data, we again
contain information from the client. Therefore, the          count the edit distance with respect to the most recent
information content of set O can be reduced to the           request.
information in the set U. If the application has been
tampered with by malicious software yielding a               Next, we look at the Host and Referer header fields.
different representation A', then the information content    The Host field, along with the request path, specifies
of tampered output O' is equal to the information            the request's uniform resource locator (URL). We only
content of the closest expected output O plus the edit       count the size of the Host field if the request URL did
distance between O and O'. Input supplied to an              not come from a link in another page, which we discuss
application from all sources other than the network is       more in the next section. Similarly, we only count the
considered part of U. This includes file uploads and         Referer field's size if does not contain the URL of a
system information. Timing information is also part of       previous request.
the set U. Though we do not attempt to quantify timing
bandwidth in this paper, we plan to extend the methods       Finally, we examine the Cookie header field to verify
presented by Cabuk et al. [3] in the future to measure       its consistency with expected browser behavior. The
the bandwidth of active HTTP request timing channels.        Cookie field is supposed to contain key-value pairs
                                                             from previous server responses. Cookies should never
                                                             contain UI-layer information from the client. If the
4. Measurement Techniques                                    Cookie differs from its expected value or we do not
                                                             have a record from a previous response (this could
4.1     HTTP Request Overview                                happen if a mobile computer is brought into an
                                                             enterprise network), then we count the edit distance
There are two main types of HTTP requests used by            between the expected and actual cookie values. At least
web browsers, GET and POST. GET is typically used            one known tunneling program, Cooking Channel [4],
to obtain resources and POST is used to send data to a       hides information inside of the Cookie header in
server. An example of an HTTP POST request can be            violation of standard browser behavior. The techniques
seen in Figure 1. This request is comprised of three         presented here correctly measure outbound bandwidth
distinct sections: the request line, headers, and the        for the Cooking Channel program.
request body. GET requests are very similar except that
they do not have a request body. The request line
contains the path of the requested file on the server, and   4.3    Standard GET Requests
it may also have script parameters. The next part of the
HTTP request is the header field section, which consists     HTTP GET requests are normally used to retrieve
of ": " pairs separated by line breaks.        resources from a web server. Each GET request
Header fields relay information such as the browser          identifies a resource by a URL that is comprised of the
version, preferred language, and cookies. Finally, the       server host name, stored in the Hostname header field,
HTTP request body follows the headers and may                and the resource path, stored in the request line.
consist of arbitrary data. In the example message, the       Looking at each HTTP request independently, one
body contains an encoded name and e-mail address that        cannot determine whether the URL contains UI-layer
was entered into a form.                                     information or is the result of previous network input
                                                             (i.e., a link from another page). If we consider the entire
                                                             browsing session, however, then we can discount
4.2     HTTP Header Fields                                   request URLs that have been seen in previous server
                                                             responses, thus significantly improving unconstrained
The first type of HTTP header field that we examine is       bandwidth measurements.
a fixed header field. Fixed headers should be the same
for each request in most cases. Examples include the
1 POST /download HTTP/1.1                                            
2 Host: www.webtapsecurity.com                                       
2 User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1;                
  en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12                  
2 Keep-Alive: 300                                                      
2 Connection: keep-alive
                                                                       
2 Referer: http://www.webtapsecurity.com/download.html
                                                                       
2 Content-Type: application/x-www-form-urlencoded
                                                                      
2 Content-Length: 73
                                                                     
3 FirstName=John&LastName=Doe&Email=johndoe%40example.
  com&Submit=Download                                                
                            (a)                                                           (b)
  Figure 1. (a) A sample HTTP POST request for submitting contact information to download a file. Line 1 is the
    HTTP request line. Lines marked 2 are request headers, and line 3 is the request body. Bytes counted by a
          simple algorithm are highlighted in gray. UI-layer data is highlighted in black with white text.
   (b) A sample HTML document at http://www.webtapsecurity.com/download.html that generated request (a).

The first step in accurately measuring UI-layer                bits where n is the original number of links and m is
information in request URLs is enumerating all of the          the number of links that were loaded. For voluntary
links on each web page. In this paper, we only use             links, we count log2(n) for each request where n is
simple HTML parsing to extract link URLs. An                   the number of voluntary links on the parent page. We
example     of     an   HTML       link    is    "Click                       log2(n) bits of information. When the user has
Here!"     where    the     URL      is                    multiple pages open, the number of available links may
"http://www.example.com/page" and the link text is             be greater than those on one page. In practice, we count
"Click Here!". In the future, we plan to handle                log2(n) bits for a link only if it came from the last
Javascript constructs such as such as "onclick =               page that was loaded. Otherwise, we count log2(n) +
`this.document. location = '". We                        log2(p) bits where n is the number of links on the
also plan to employ program analysis techniques or run         parent page and p is the number of pages in the user's
Javascript through a full-fledged interpreter to handle        browsing history.
more advanced Javascript code related to links.
Although link extraction is undecidable in general,
these approaches are likely to be successful in                4.4    Form Submission Requests
extracting links for most real cases. Currently, we count
URLs that we cannot identify with HTML parsing as              The primary method for transmitting information to a
UI-layer information.                                          web server is form submission. Form submission
                                                               requests send information that the user enters into input
After the set of links has been determined for each            controls such as text boxes and radio buttons. They may
page, we can measure the amount of UI-layer                    also include information originating from the server in
information conveyed by GET requests for those URLs.           hidden or read-only fields. Form submissions contain a
The first step is dividing links up into two categories:       sequence of delimited  pairs, which can
mandatory and voluntary. A mandatory link is one that          be seen in the body of the POST request in Figure 1a.
should always be loaded. Examples include images,              The field names, field ordering, and delimiters between
scripts, etc. A voluntary link is selected by the user after   fields can be derived from the page containing the form
the page has been loaded. Loading URLs from                    seen in Figure 1b and thus do not convey UI-layer
mandatory links does not directly leak any information.        information. Field values may also be taken from the
Omission of some mandatory links may directly leak up          encapsulating page in some circumstances. Check
to one bit of information per link (one bit for each link,     boxes and radio buttons can transmit up to one bit of
not just omitted links). Reordering mandatory links            information each, even though the value representing
(they have a predetermined order) may leak up to               "on" is often several bytes. Servers can store client-side
log2(n!) bits where n is the number of mandatory               state by setting data in "hidden" form fields, which are
links that were loaded. If we encounter missing or             echoed back by the client upon form submission.
reordered mandatory links, we count log2(m!) + n               Visible form fields may also have large default values,
Scenario          Total Req. Count Raw (bytes) Simple (bytes/%) Precise (bytes/%) Avg. Precise Size
Web Mail                        508    619,661 224,259 / 36.2%     37,218 / 6.01%        73.3 bytes
Sports News                     911  1,187,579 199,119 / 16.8%     49,785 / 4.19%        54.6 bytes
News                            547    502,208    74,497 / 14.8%   16,582 / 3.30%        30.3 bytes
Shopping                      1,530    913,226 156,882 / 17.2%     26,390 / 2.89%        17.2 bytes
Social Networking             1,175  1,404,251    91,270 / 6.5%    15,453 / 1.10%        13.2 bytes
Blog                            191    108,565    10,996 / 10.1%      351 / 0.32%         1.8 bytes
      Table 1. Bandwidth measurement results for six web browsing scenarios using three different measurement
                     techniques, along with the average bytes/request for the precise technique.

as is the case when editing a blog post or a social          simple bandwidth measurements from prior research
networking profile. For fields with default values, we       [1], demonstrating a reduction to 3­25% of the original
measure the edit distance between the default and            values. In the first five scenarios, the precise
submitted values. We measure the full size of any            measurements were still significantly larger than the
unexpected form submissions or form fields, which            amount of UI-layer information, which we
may result from active Javascript.                           approximated by recording the number of link
                                                             traversals and size of form submissions. After
                                                             examining the web pages in those scenarios, we found
4.5     Custom Web Requests and Edit                         this overestimate to be a direct result of Javascript code
        Distance                                             and Flash objects, particularly from advertisements. For
                                                             the Web Mail scenario, the median request size was 5
Custom network applications, Active Javascript, and          bytes. When compared to the average of 73.3, this
malicious software may send arbitrary HTTP requests          indicates that there were a few large requests whose
that do not conform to the behavior of a web browser.        URLs did not appear as links in web pages. We believe
This leads us to the problem of measuring UI-layer           that more advanced Javascript and Flash processing
information U for an unknown application                     will allow us to correctly extract many of these links in
representation A'. In this case, we reduce the effects of    the future. Our goal is to approach an optimal case
repetition by counting the edit distance between each        where we correctly read all links in each document. The
new request and recent requests to the same server from      Blog reading scenario is representative of this best case
the same client, rather than counting the entire size of     because no links came from Javascript code. We hope
each request. In the future, we plan to explore the use of   to refine the precise analysis techniques so that the
an incremental compression algorithm on custom web           average count is only a few bytes across all browsing
requests to further isolate true outbound information        scenarios. In the future, we also plan on establishing
flows. Analyzing active Javascript to derive its             long-term byte count thresholds similar to those in Web
application representation may also help to generate         Tap [1] for identifying clients that leak suspiciously
more accurate measurements for custom web                    large amounts of data.
applications.

                                                             6. Conclusions and Research Challenges
5. Evaluation
                                                             In this paper, we presented methods for precisely
We evaluated our leak quantification techniques on web       quantifying information leaks in outbound web traffic.
traffic from several legitimate web browsing scenarios.      These methods exploit protocol knowledge to filter
The scenarios were 30-minute browsing sessions that          repeated and constrained message fields, thus isolating
included web mail (Yahoo), social networking                 true information flows from the client. The resulting
(Facebook), news (New York Times), sports (ESPN),            measurements can help identify leaks from spyware and
shopping (Amazon), and personal blog websites. The           malicious insiders. We evaluated the precise analysis
results are shown in Table 1. The precise measurements       techniques by applying them to web traffic from several
show a major reduction compared to the raw byte              browsing scenarios, including web mail, online
counts, ranging from 0.3­6.0% of the original values.        shopping, and social networking. They produced
The precise methods also perform much better than            request size measurements that were 94­99.7% smaller
than raw bandwidth values, demonstrating their ability    [2] S. Brand. DoD 5200.28-STD Department of
to filter out constrained information.                        Defense Trusted Computer System Evaluation
                                                              Criteria (Orange Book). National Computer
The main research challenge we encountered was                Security Center, 1985.
measuring web requests from pages with active             [3] S. Cabuk, C. Brodley, and C. Shields. IP Covert
Javascript code or Flash objects. Correctly extracting        Timing Channels: Design and Detection. In Proc.
links from Javascript and Flash is undecidable in             of the 11th ACM Conference on Computer and
general, and may require running scripts in a full-           Communications Security (CCS), 2004.
fledged interpreter or performing complex static          [4] S. Castro. How to Cook a Covert Channel. hakin9,
analysis, even in common cases. Both of these                 http://www.gray-world.net/projects/ cooking_channels/
approaches would have a significant impact on                 hakin9_cooking_channels_en.pdf, 2006.
performance. Optimizing the precise bandwidth
                                                          [5] R. Richardson. CSI Computer Crime and Security
measurement techniques to handle large traffic volumes
                                                              Survey.           http://i.cmpnet.com/v2.gocsi.com/pdf/
will be another research challenge. Caching parse
                                                              CSISurvey2007.pdf, 2007.
results and storing hash values instead of full strings
will reduce CPU and memory overhead but hurt              [6] RSA Security, Inc. RSA Data Loss Prevention
accuracy for dynamic content and edited responses. In         Suite. RSA Solution Brief, http://www.rsa.com/
the future, we plan to quantify these performance             products/EDS/sb/DLPST_SB_1207-lowres.pdf, 2007.
tradeoffs and introduce more powerful Javascript          [7] S. Servetto and M. Vetterli. Communication Using
analysis techniques.                                          Phantoms: Covert Channels in the Internet. In
                                                              Proc. of the IEEE International Symposium on
                                                              Information Theory, 2001.
7. References                                             [8] VONTU. Data Loss Prevention, Confidential Data
                                                              Protection ­ Protect Your Data Anywhere.
[1] K. Borders and A. Prakash. Web Tap: Detecting             http://www.vontu.com, 2008.
    Covert Web Traffic. In Proc. of the 11th ACM          [9] Websense, Inc. Web Security, Internet Filtering,
    Conference on Computer and Communications                 and        Internet          Security        Software.
    Security (CCS), 2004.                                     http://www.websense.com/global/en/, 2008.