The DOES Project on DOcument-element Extraction and Search
This webpage describes the DOES research project on document element extraction and search and the results from this project.
PI: Prasenjit Mitra
Post-doctoral Scholar: Dr. Cornelia Caragea, Dr. Lior Rokach
Ph.D. Students: Sumit Bhatia, Prakhar Biyani, Jing Fang
Undergraduate Researcher: Sunil Jain, Sujay Patel
Collaborators: Dr. Qi He, Prof. C. Lee Giles
The project aims to investigate the following problems:
- Extract information from a) tables, b) images, and c)
other document elements (like algorithms) from documents in a digital
library (usually PDF).
- Enable end-users to search for document elements efficiently without having to open and manually peruse full documents.
- Identify as much of the semantics of the extracted data automatically
- Recommend citations and locations in text where they should be cited for an input article automatically.
The following are the major research challenges that we are addressing:
- Text extractors are noisy. For example, for PDF to text
conversion, none of the existing tools like PDFBox, TET, etc. are
error-free. Overcoming errors that creep in the first stage is hard.
in documents have varying layouts with multiple levels of hierarchical
column headers, nested or fused cells, etc. Identifying the layout
automatically and extracting information is a challenge.
the semantics of table columns automatically is a hard problem. Finding
relevant information from the text document that is relevant to the
document elements can help but finding them accurately is a
challenge. Initial investigations have resulted in low accuracy
with respect to representing the semantics of table columns
- Identifying which citations are
relevant given arbitrary text is a challenge. Finding what to
cite and where to cite it is hard. For references that can be
classified as Â“depthÂ” citations, all the references are
required. A paper cites all related work related to the exact
problem. However, if there is a good survey, often, writers cite
the survey and then cite papers published after the survey was
published. For references that are Â“breadthÂ” citations, e.g.,
algorithms books, citing only one book is enough. Hence, while
recommending citations, an automatic recommendation engine should
identify what type of citation it is and then cite. The problem,
of course, is more complex than this issue illustrates.
- We observed that heuristic solutions can detect table
boundaries with reasonable accuracy. Based on these findings, we have
implemented and fine-tuned a table extraction and search utility. We
have improved the accuracy of identifying and extracting hierarchical
headers. The source code has been released (see below for link).
- The ability to search for tables has been
integrated into the CiteSeerX system. The design and implementation of
the ability to search for figures and algorithms has been completed
in-house. Tables, figures, and algorithms have been extracted
from documents in the CiteSeerX digital library and end-users can
search for tables, figures, or algorithms directly and obtain relevant
articles where document elements of interest have been published.
have proposed a classification of algorithm influence types in order to
capture the influence of prior work on newer published
algorithms. We are currently working on automatically classifying
algorithm pairs into these classes. After this work is complete,
we can measure the influence of one work on another.
have designed and implemented citation recommendation algorithms that
recommend citations and suggest locations in text where the citations
should be inserted. We demonstrated that they can recommend
citations with reasonable efficacy. We have released this
utility, RefSeer, as part of the CiteSeerX system.
- Sumit Bhatia, Prasenjit Mitra, C. Lee Giles: Finding algorithms in scientific articles. WWW 2010: 1061-1062. [PDF]
Q., Pei, J., Kifer, D., Mitra, P., and Giles, C.L., "Context-aware
Citation Recommendation" , International World Wide Web Conference (WWW
2010): 421-430. [PDF]
- Saurabh Kataria, Prasenjit Mitra and Sumit Bhatia. Utilizing Context in Generative Bayesian Models for Linked Corpus In AAAI 2010.
- Sumit Bhatia, Suppawong Tuarob, Prasenjit Mitra and C. Lee Giles. An Algorithm Search Engine For Software Developers In SUITE '11: Proceedings of 2011 ICSE Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation, 2011.
- Sumit Bhatia and Prasenjit Mitra. Summarizing Figures, Tables and Algorithms in Scientific Publications to Augment Search Results Technical Report, College of Information Sciences and Technology, The Pennsylvania State University, June 2010.
- Sumit Bhatia and Prasenjit Mitra. Synopsis Generation for Specialized Document-Element Search Engines In Workshop on Web Search Result Summarization and Presentation, Co-Located with WWW2009, 2009.
- Saurabh Kataria, P. Mitra, C. Caragea, C. Lee Giles. Context Sensitive Topic Models for Author Influence. In 22nd International Joint Conference on Artificial Intelligence (IJCAI-2011), Barcelona, Spain, July 16-22, 2011.
- Jing Fang, Prasenjit Mitra, Zhi Tang, and C. Lee Giles, Table Header Detection and Classification. In 26th Conference on Artificial Intelligence, AAAI'12, AAAI Press, (2012).
- Suppawong Tuarob, Prasenjit Mitra, and, C. Lee Giles, Improving algorithm search using the algorithm co-citation network. (2012). In Proceedings of the 12th ACM/IEEE-CS
Joint Conference on Digital Libraries, JCDL '12, Eds. Karim B. Boughida, Barrie Howard, Michael L. Nelson, Herbert Van de
Sompel, and, Ingeborg Solvberg, pp. 277-280.
- Sumit Bhatia and Prasenjit Mitra. Summarizing Figures, Tables and Algorithms in Scientific Publications to Augment Search Results. ACM Transactions on Information Systems (TOIS), 30(1) 2012.
Software Publication & Release
The TableSeer software for extracting and searching for
table data from PDF documents has been released in SourceForge. To
download, click here. The RefSeer utility has been released for use by the community. See here.
- Training a graduate student in state of the art
technologies in document analysis and processing, information
retrieval, semantic web and database technologies.
is useful for natural scientists who want to extract data related to
experiments that they are interested in and those that were published
in scholarly articles (available as PDF documents).
and taught two graduate courses that had resident and online students
enrolled. IST 552: Database Systems and Knowledge Management, and
IST 558: Data Mining II.
This material is based upon work supported by the National Science Foundation under Grant No. 0845487.
Any opinions, findings, and conclusions or recommendations expressed in
this material are those of the author and do not necessarily reflect
the views of the National Science Foundation.