Seemingly innocuous bits of self-revelation can increasingly be collected and reassembled by computers to help create a complete picture of a person's identity.

If a stranger came up to you on the street, would you give him your name, Social Security number and e-mail address?

Probably not.

Yet people often dole out all kinds of personal information on the Internet that allows such identifying data to be deduced. Services like Facebook, Twitter and Flickr are oceans of personal minutia — birthday greetings sent and received, school and work gossip, photos of family vacations, movies watched and books read.

Computer scientists and policy experts say that such small, seemingly innocuous bits of self-revelation can increasingly be collected and reassembled by computers to help create a complete picture of a person's identity, sometimes down to the Social Security number.

“Technology has rendered the conventional definition of personally identifiable information obsolete,” said Maneesha Mithal, associate director of the Federal Trade Commission's privacy division. “You can find out who an individual is without it.”

In a class project at the Massachusetts Institute of Technology that received some attention last year, Carter Jernigan and Behram Mistree analysed more than 4,000 Facebook profiles of MIT students, including links among online friends. The pair created software that predicted, with 78 per cent accuracy, whether a profile belonged to a gay male. The technique was verified using a group of students who had freely identified themselves as gay.

So far, this type of powerful data mining, which relies on sophisticated statistical correlations to build individual dossiers, is mostly in the realm of university researchers, not identity thieves and marketers.

But the FTC is worried that laws and regulations to protect privacy have not kept up with changing technology, and the agency is convening on Wednesday the third of three workshops on the issue. Its concerns are hardly far-fetched. Last fall, Netflix awarded $1 million to a team of statisticians and computer scientists who won a three-year contest to analyse the movie rental history of 500,000 subscribers and improve the predictive accuracy of Netflix's recommendation software by at least 10 per cent.

On Friday, Netflix said that it was shelving plans for a second contest — bowing to privacy concerns raised by the FTC and a private litigant. In 2008, a pair of researchers at the University of Texas showed that the customer data released for that first contest, despite being stripped of names and other direct identifying information, could often be “de-anonymised” by statistically analysing an individual's distinctive pattern of movie ratings and recommendations.

Not enough

In social networks, people can increase their defences against identification by adopting tight privacy controls on information in personal profiles. Yet an individual's actions, researchers say, are rarely enough to protect privacy in the interconnected world of the Internet. You may not disclose personal information, but your online friends and colleagues may do it for you, referring to your school or employer, gender, location and interests. Patterns of social communication, researchers say, are revealing.

“Personal privacy is no longer an individual thing,” said Harold Abelson, the computer science professor at MIT. “In today's online world, what your mother told you is true, only more so: people really can judge you by your friends.''

When collected together, the pool of information about each individual can form a distinctive ``social signature,'' researchers say.

The power of computers to identify people from social patterns alone was demonstrated last year in a study by the same pair of researchers that cracked Netflix's anonymous database: Vitaly Shmatikov, an associate professor of computer science at the University of Texas, and Arvind Narayanan, who is now a postgraduate researcher at Stanford University.

By examining correlations between various online accounts, the scientists showed that they could identify more than 30 per cent of the users of both Twitter, the microblogging service, and Flickr, an online photo-sharing service, even though the accounts had been stripped of identifying information like account names and e-mail addresses.

“When you link these large data sets together, a small slice of our behaviour and the structure of our social networks can be identifying,” Shmatikov said.

A person's pattern of online communications, he explained, can be assembled into a distinctive social graph, almost like a digital—age fingerprint. In research, the computer-generated predictions are then verified against a data set that includes identifiers.

Even more unnerving to privacy advocates is the work of two researchers from Carnegie Mellon University. In a paper published last year, Alessandro Acquisti and Ralph Gross reported that they could accurately predict the full, nine-digit Social Security numbers for 8.5 per cent of the people born in the United States between 1989 and 2003 — nearly 5 million individuals.

Social Security numbers are especially prized by identity thieves because they are used both as identifiers and to authenticate banking, credit card and other transactions.

The Carnegie Mellon researchers used publicly available information from many sources, including profiles on social networks, to narrow their search for two pieces of data crucial to identifying people — birthdates and city or state of birth. That helped them figure out the first three digits of each Social Security number, which the government had assigned by location. The remaining six digits had been assigned through methods the government didn't disclose, although they were related to when the person applied for the number. The researchers used projections about those applications as well as other data, such as the Social Security numbers of dead people, and then ran repeated cycles of statistical correlation and inference to partly re-engineer the government's number-assignment system.

To be sure, the work by Acquisti and Gross suggests a potential, not actual, risk. But unpublished research by the two men explores how criminals could use similar techniques and clusters of compromised computers, called botnets, for large-scale identity-theft schemes.

“Online redlining”

More generally, privacy advocates worry that the new frontiers of data collection, brokering and mining are largely unregulated. Such sophisticated data analysis could open a door to “online redlining,” the offering of products and services like bank loans and health care to some consumers and not others based on statistical inferences and predictions about individuals and their behaviour.

The FTC and Congress are weighing a range of steps, from tighter industry requirements to alert consumers about data collection and use to the creation of a “do not track” list, similar to the federal “do not call” list, that would try to stop online monitoring of Internet users who opt out.

But Jon Kleinberg, a professor of computer science at Cornell University who studies social networks, is sceptical that rules will have much impact, given the powerful social pressure to share information online. His advice: “When you're doing stuff online, you should behave as if you're doing it in public — because increasingly, it is.” — ©2010 New York Times News Service

More In: Comment | Opinion