Dec 22

New tool expands tracking of personal data on the Web

Navigating the Web gets easier by the day as corporate monitoring of our emails and browsing habits fine-tune the algorithms that serve us personalized ads and recommendations. But convenience comes at a cost. In the wrong hands, our personal information can be used against us, to discriminate on housing and health insurance, and overcharge on goods and services, among other risks.

“The Web is like the Wild West,” says Roxana Geambasu, a computer scientist at Columbia Engineering and the Data Science Institute. “There’s no oversight of how our data are being collected, exchanged and used.”

With computer scientists, Augustin Chaintreau and Daniel Hsu, and graduate students Mathias Lecuyer, Riley Spahn and Yannis Spiliopoulos, Geambasu has designed a second-generation tool for bringing transparency to the Web. It’s called Sunlight and builds on its predecessor, XRay, which linked ads shown to Gmail users with text in their emails, and recommendations on Amazon and YouTube with their shopping and viewing patterns. The researchers will present the new tool and a related study on Oct. 14 in Denver, at the Association for Computing Machinery’s annual conference on security.

Sunlight works at a wider scale than XRay, and more accurately matches user-tailored ads and recommendations to tidbits of information supplied by users, the researchers say. Prior researchers have traced specific ads, product recommendations and prices to specific inputs like location, search terms and gender, one by one. One tool, AdFisher, received attention earlier this year after showing that fake Web users thought to be male job seekers were more likely than female job seekers to be shown ads for executive jobs when later visiting a news site.

Sunlight, by contrast, is the first to analyze numerous inputs and outputs together to form hypotheses that are tested on a separate dataset carved out from the original. At the end, each hypothesis, and its linked input and output, is rated for statistical confidence. “We’re trying to strike a balance between statistical confidence and scale so that we can start to see what’s happening across the Web as a whole,” said Hsu.

The researchers set up 119 Gmail accounts, and over a month last fall sent 300 messages with sensitive words in the subject line and body of the email. About 15 percent of the ads that followed appeared to be targeted; some seemed to contradict Google’s policy to not target ads based “on race, religion, sexual orientation, health or sensitive financial categories,” the researchers said. For example, words typed into the subject line of a message– “unemployed,” “depressed,” and “Jewish,” were found to trigger ads for “easy auto financing,” a service to find “cheating spouses,” and a “free ancestor” search, respectively.

The researchers also set up fake browsing profiles and surfed the 40 most popular sites on the Web to see what ads popped up. They found that just 5 percent of the ads appeared to be targeted, but some seemed to violate Google’s advertising ban on products and services facilitating drug use, they said. For example, a visit to “hightime.com” triggered an ad for bongs at AquaLab Technologies, researchers said. Interestingly, the algorithms also seemed to pick up on the political leanings of popular news sites, pitching Israeli bonds to Fox News readers, and an anti-Tea Party candidate to Huffington Post readers.

The researchers caution against inferring that Google and other companies are intentionally using sensitive information to target ads and recommendations. The flow of personal data on the Web has become so complex, they said, that companies themselves may not know how targeting is taking place.

In Nov. 10, 2014, Google abruptly shut down Gmail ads — the last day that Geambasu and her colleagues were able to collect data. The ads appear to have been replaced by so-called organic ads displayed in the promotions tab. Sunlight has the ability to detect targeting in those ads, too, said Geambasu, but the researchers haven’t yet given that a try.

Sunlight’s intended audience is regulators, consumer watchdogs and journalists. The tool lets them explore how personal information is being used and decide where closer investigation is needed, they said. “In many ways the Web has been a force for good, but there needs to be accountability if it’s going to remain that way,” said Chaintreau.

“Sunlight is distinctive in that it can examine multiple types of inputs simultaneously (e.g., gender, age, browsing activity) to develop hypotheses about which of these inputs impact certain outputs (e.g., ads on Gmail),” said Anupam Datta, a researcher at Carnegie Mellon who led the development of the AdFisher tool and was not involved in the current study. “This tool takes us closer to the critical goal of discovering personal data use effects at scale.”

A copy of the study “Sunlight: Fine-grained Targeting Detection at Scale with Statistical Confidence” is available online.

Dec 22

Taking the grunt work out of web development

A Web page today is the result of a number of interacting components — like cascading style sheets, XML code, ad hoc database queries, and JavaScript functions. For all but the most rudimentary sites, keeping track of how these different elements interact, refer to each other, and pass data back and forth can be a time-consuming chore.

In a paper being presented at the Association for Computing Machinery’s Symposium on Principles of Programming Languages, Adam Chlipala, the Douglas Ross Career Development Professor of Software Technology, describes a new programming language, called Ur/Web, that lets developers write Web applications as self-contained programs. The language’s compiler — the program that turns high-level instructions into machine-executable code — then automatically generates the corresponding XML code and style-sheet specifications and embeds the JavaScript and database code in the right places.

In addition to making Web applications easier to write, Ur/Web also makes them more secure. “Let’s say you want to have a calendar widget on your Web page, and you’re going to use a library that provides the calendar widget, and on the same page there’s also an advertisement box that’s based on code that’s provided by the ad network,” Chlipala says. “What you don’t want is for the ad network to be able to change how the calendar works or the author of the calendar code to be able to interfere with delivering the ads.” Ur/Web automatically prohibits that kind of unauthorized access between page elements.

Typing, scoping

Ur/Web’s ability to both provide security protection and coordinate disparate Web technologies stems from two properties it shares with most full-blown programming languages, like C++ or Java. One is that it is “strongly typed.” That means that any new variable that a programmer defines in Ur/Web is constrained to a particular data type. Similarly, any specification of a new function has to include the type of data the function acts on and the type of data it returns.

In computing the value to return, the function may need to create new variables. (A function that returned an average of values in a database, for instance, would first need to calculate their sum.) But those variables are inaccessible to the rest of the program. This is the second property, known as “variable scoping,” because it limits the scope — the breadth of accessibility — of variables defined within functions.

“You might want to write a library that has inside of it as private state the database table that records usernames and passwords,” Chlipala says. “You don’t want any other part of your application to be able to just read and overwrite passwords. Most Web frameworks don’t support that style. They assume that every part of your program has complete access to the database.”

Typing helps with security, too. Many Web development frameworks generate database queries in such a way that someone ostensibly logging into a website can type code into the username field that in fact overwrites data in the database. With Ur/Web, usernames would constitute their own data type, which would be handled much differently than database queries.

Meeting expectations

Typing is also what enables coordination across Web technologies. Suppose that a bit of JavaScript code is supposed to act on data fetched from a database and that the result is supposed to be displayed on a Web page at a location determined by some XML code. If an Ur/Web programmer wrote a database query that extracted data of a type the JavaScript wasn’t expecting, or if the JavaScript generated an output of a type that the XML page wasn’t expecting, the compiler would register the discrepancy and flag the code as containing an error.

Often, code that isn’t explicitly typed still has implicit consistency rules. For instance, if you write a query in the SQL database language that asks for the average numerical value of a bunch of text fields, the database server will tell you that it can’t process your request. To enable Ur/Web to coordinate the flow of data between Web technologies, Chlipala had to create libraries of new data types for SQL, XML, and cascading style sheets (CSS) that embody these rules.

While the Ur/Web compiler does generate XML, JavaScript, and SQL code in its current version, it doesn’t produce style sheets automatically. But, Chlipala says, “One thing the compiler can do is analyze your full program and say, ‘Here is an exhaustive list of all the CSS classes that might be mentioned, and here is a description of the context in which each class might be used, which tells you what properties might be worth setting.’ So, for instance, some particular class might never be used in a position where table properties would have any meaning, so you don’t have to bother setting those.”