In a project that could unlock the world’s research papers for easier computerized analysis, an American technologist has released online a gigantic index of the words and short phrases contained in more than 100 million journal articles — including many paywalled papers. The catalogue, which was released on 7 October and is free to use, holds tables of more than 355 billion words and sentence fragments listed next to the articles in which they appear. It is an effort to help scientists use software to glean insights from published work even if they have no legal access to the underlying papers, says its creator, Carl Malamud. He released the files under the auspices of Public Resource, a non-profit corporation in Sebastopol, California, that he founded.
This is an interesting development. More on that later.
The index can be accessed here.
The General Index consists of 3 tables derived from 107,233,728 journal articles. A table of n-grams, ranging from unigrams to 5-grams, is extracted using SpaCy. Each of the 355,279,820,087 rows of the n-gram table consists of an n-gram coupled with a journal article id. A second table is constructed using Yake and consists of 19,740,906,314 rows, each with a keywords and an article id. A third table associates an article id with metadata.
The only legal question, Carroll adds, is whether Malamud’s obtaining and copying of the underlying papers was done without breaching publishers’ terms. Malamud says that he did have to get copies of the 107 million articles referenced in the index to create it; he declined to say how, but emphasizes that researchers will not have access to the full texts of the papers, which are stored in a secured, undisclosed location in the United States.
They havent declared their source of papers. Is it Schi-Hub? I have no idea.