UNIVERSITY PARK, Pa. — Seth Wilcox came to Penn State with three years of Java programming already under his belt. Though he doesn’t plan to pursue a career in the field, he did enroll in a first-semester introductory computer programming course in the College of Information Sciences and Technology (IST). Now, thanks to a $3,500 Erickson Discovery Grant, Wilcox can keep his love for computer programming alive through a partnership with an IST researcher that he formed in that class.
Wilcox, a first-year student majoring in secondary education with a focus in earth and space sciences, will spend his summer coding a program that will extract keyphrases from scientific publications and determine their usefulness through crowdsourcing. His code will analyze more than 10 million academic documents and extract keyphrases, which are short combinations of words used to describe the content of the publication, found in the text.
“Currently, keyphrases are selected by the publication’s author, and there is very little ground truth we can use to evaluate them,” said Jian Wu, assistant teaching professor of IST and Wilcox’s faculty mentor on the project. “There’s no consensus that automatically extracted keyphrases represent information that is useful or meaningful to the reader.”
Their project — “Large Scale Evaluation Keyphrase Extraction Models Through Crowdsourcing” — will run the program on every document hosted on CiteSeerX, a digital library search engine. Taking roughly a week to comb through the complete library, the program will automatically extract keyphrases from each document using four different models. Then, they will display the keyphrases for each publication and ask users to cast a yes or no vote to indicate whether a particular phrase was appropriate and useful for the document.
“Having the author determine these keyphrases leaves them open to bias and interpretation, and having an individual person analyze these results obviously takes a lot of time,” said Wilcox. “Crowdsourcing allows the user to label the data for you, letting a lot of people do the work in a shorter period of time.”
The initial round of voting will last four to six weeks. Once enough results are received, the researchers will analyze which terms were the most positively and negatively received. Finally, they will assess which keyphrases were selected by which extraction model to determine the effectiveness of each model — the higher the percentage of positive votes for a keyphrase, the more effective the model that extracted it. Users won’t know which models produced which keyphrases during the voting process.
“The goal is to get objective comparisons of four state of the art models for keyphrase extraction that can be used for future scientific papers,” said Wilcox.
Wu, the first faculty member in the College of IST to receive an Erickson Discovery Grant, is pursuing the research through a collaboration with Cornelia Caragea, associate professor of computer science and the Lloyd T. Smith Creativity in Engineering Chair at Kansas State University. He is eager not only to review the results but also to showcase the research opportunities available to undergraduate students.
“There has been no previous effort to complete such a large scale evaluation with crowdsourcing on keyphrases, and we hope to present our findings at the Innovative Applications of Artificial Intelligence Conference this December,” Wu said. “I personally hope to engage more undergraduate students in the project to show them that it’s something that they can do and do well.”
Wilcox will monitor the extraction and voting process throughout summer for major issues, with Wu serving as a consultant for challenges that arise and providing context for the various project milestones. Because of the grant, Wilcox can fully commit himself to the project.
“The grant pays for a few technical items we need like additional storage, but it mainly helps me cover living expenses so I can remain in State College over the summer to conduct this research,” said Wilcox.
“I started this project last fall and wanted to continue doing the research. We established project goals and I wanted to see that through,” he concluded. “Thanks to this grant, I can.”
The Erickson Discovery Grant program is coordinated by Penn State’s Office of Undergraduate Education and provides student recipients with $3,500 to immerse themselves in their projects by using the grant to cover living expenses and project costs including supplies, materials, books, specialized software, and travel for the purpose of data collection. To learn more about the program, visit the Erickson Discovery Grant webpage.