UNIVERSITY PARK, Pa. — Twitter may have started out as a way to connect to other people and share news quickly, but the social media platform is also a powerful tool, with the data generated representing the largest publicly accessible archive of human behavior in existence.
Guangqing Chi, associate professor of rural sociology and demography and public health sciences in Penn State's Department of Agricultural Economics, Sociology, and Education and director of the Computational and Spatial Analysis (CSA) Core in the Social Science Research Institute, and his team have collected over 30 terabytes of geo-tagged tweets over the last four years.
“Our work has the potential to change the landscape of population research,” said Chi. “It could open the door for demographers to take advantage of rich geo-tagged Twitter data and strengthen research in many other disciplines that use demographic data.”
Geo-tagged tweets are tagged with real-world geographic location information which are derived from location-based-service-enabled devices such as smartphones and tablets via GPS and Wi-Fi positioning. “Each geo-located tweet is essentially a digital trace of the Twitter user, including information such as location, time, and the content of the message,” Chi said. “Twitter data can provide a significant amount of individual social, behavioral and emotional information for researchers of many disciplines.”
Junjun Yin, CSA research associate on the project, and Chi have built an infrastructure to collect, manage, and analyze the data. “We’re storing the data in a high-performance computing cluster with large amounts of storage capacity and memory,” Yin explained. “In addition, a distributed computing environment with integrated machine learning and data-mining packages and toolsets is up and running to provide efficient parallel data processing, which includes data extraction, calculation and analysis. We’ve also developed data processing programs so the data can be useful to researchers from many disciplines.”
According to Chi, although this digital trace is not a complete trajectory tracking every movement of a user over space and time, nor is the whole data collection a representative sample of the whole population, the geo-located Twitter data can offer certain unique qualities for potential interdisciplinary research.
“Geographically annotated social media is extremely valuable for modern information retrieval. The data offers large spatial coverage and multiple years of a large sample of the population, making it helpful in determining geographical uses of space, such as urban mobility and understanding functions of urban regions,” Chi explained. “The data can also be used to explore quality of life issues, such as health, education and income. Other uses include analysis of social ties and dissemination pattern of news and events, as well as enriching existing survey data.”
In one project, Chi and his team are developing a set of methods to accurately predict demographics in real time. Knowing the demographics of a group is usually the first step in population research. Previously, Twitter data was limited to only a few demographics of Twitter users, and the Twitter user demographics and language use changed frequently, making prediction methods inaccurate.
Chi and his team are also developing algorithm models to predict the composition of a group of twitter users. “Our goal is to find a way to predict Twitter user demographics, so that we will know each Twitter user represents how many people with similar characteristics. When we can do that, we can develop weights and make the data representative.”
The approach is based on the premise that it is difficult to make predictions about an individual but is much easier to make predictions about large groups of individuals. The researchers compare their findings to U.S. Census data to determine how effective their models are.
CSA plans to offer workshops starting this fall to promote the use of Big Data for social science research and packaging the Twitter data and capacity into a product for collaboration with Penn State researchers.
Additional researchers participating in this project include Daniel Kifer, associate professor of computer science; Jennifer Van Hook, professor of sociology and demography; Lee Giles, professor of information sciences and technology and director of the Intelligent Information Systems Research Laboratory, all at Penn State; as well as Xiaopeng Li, assistant professor of civil engineering at the University of South Florida; and Tse-Chuan Yang, associate professor of sociology at the State University of New York at Albany.