Tuesday, September 11, 2007

Trolling the arXiv for plagiarism

http://arstechnica.com/news.ars/post/20061206-8364.html

Trolling the arXiv for plagiarism

By John Timmer | Published: December 06, 2006 - 09:16AM CT

In a subscription-only report on an upcoming conference presentation, Nature spills the beans on what may be our best handle yet on plagiarism in the world of academic science. Most research into this area has been limited by the inaccessibility of many of the peer-reviewed journals, which require subscription access. As such, it's hard to build a global picture of the literature. In physics and astronomy, however, many publications appear in the arXiv database, which typically hosts them in advance of publication.

Researchers created an arXiv crawler, and had it parse each paper into seven-word pieces. After throwing out common phrases (such as acknowledgments of support and affiliation), the program then looked for high numbers of shared text fragments. Plagiarism was defined as cases where there were high amounts of shared text, but no shared authors. Here, the news appears good: out of over 280,000 publications scanned, only 677 possible cases were identified. A detailed examination of 20 of these showed that just three were cases with serious, paper-wide duplications. The rest included a few minor mistakes, and a majority where individual sections of a manuscript appeared problematic, but the manuscript as a whole was okay. Considering that arXiv manuscripts are often not in their final form, the real rate of problems may be even lower than that seen by the authors, as more citations may be added later in the preparation process.

Duplications of text came when similar problems were found, but at least one author appeared on both of the matching manuscripts. Here, the rate was much higher: approximately 10 percent. But the news here wasn't as bad as it may seem. arXiv contains conference abstracts as well, and it's generally considered ethical for the authors to use similar or identical language in these presentations and later publications on the same work. The researchers performing this study indicated that the majority of cases here were derived from instances of this sort.

The authors suggest that making a similar screen part of arXiv's process of accepting manuscripts would provide a valuable resource to the fields covered by this database. It would, however, also highlight other limitations of the current situation. For one, there's no indication that this sort of screening would be made available to scientific journals. There's also a complete absence of resources such as arXiv for many other scientific fields, including the two that probably create the majority of scientific publications: biology and medicine. Ironically, this is one situation where science and high-tech may lag old-school areas such as literature. As Google scans the world's books, it's becoming increasingly easy to spot books with passages lifted from earlier sources.