The DOES Project on DOcument-element Extraction and Search

This webpage describes the DOES research project on document element extraction and search and the results from this project.

People

PI: Prasenjit Mitra

Post-doctoral Scholar: Dr. Cornelia Caragea, Dr. Lior Rokach

Ph.D. Students: Sumit Bhatia, Prakhar Biyani, Jing Fang

Undergraduate Researcher: Sunil Jain, Sujay Patel

Collaborators: Dr. Qi He, Prof. C. Lee Giles

Project Goals

The project aims to investigate the following problems:

  1. Extract information from a) tables, b) images, and c) other document elements (like algorithms) from documents in a digital library (usually PDF).
  2. Enable end-users to search for document elements efficiently without having to open and manually peruse full documents.
  3. Identify as much of the semantics of the extracted data automatically
  4. Recommend citations and locations in text where they should be cited for an input article automatically.

Research Challenges

The following are the major research challenges that we are addressing:

  1. Text extractors are noisy. For example, for PDF to text conversion, none of the existing tools like PDFBox, TET, etc. are error-free. Overcoming errors that creep in the first stage is hard.
  2. Tables in documents have varying layouts with multiple levels of hierarchical column headers, nested or fused cells, etc. Identifying the layout automatically and extracting information is a challenge.
  3. Identifying the semantics of table columns automatically is a hard problem. Finding relevant information from the text document that is relevant to the document elements can help but finding them accurately is a challenge.  Initial investigations have resulted in low accuracy with respect to representing the semantics of table columns automatically.
  4. Identifying which citations are relevant given arbitrary text is a challenge.  Finding what to cite and where to cite it is hard.  For references that can be classified as “depth” citations, all the references are required.  A paper cites all related work related to the exact problem.  However, if there is a good survey, often, writers cite the survey and then cite papers published after the survey was published.  For references that are “breadth” citations, e.g., algorithms books, citing only one book is enough.  Hence, while recommending citations, an automatic recommendation engine should identify what type of citation it is and then cite.  The problem, of course, is more complex than this issue illustrates.

Current Results

  1. We observed that heuristic solutions can detect table boundaries with reasonable accuracy. Based on these findings, we have implemented and fine-tuned a table extraction and search utility. We have improved the accuracy of identifying and extracting hierarchical headers.  The source code has been released (see below for link).
  2. The ability to search for tables has been integrated into the CiteSeerX system. The design and implementation of the ability to search for figures and algorithms has been completed in-house.  Tables, figures, and algorithms have been extracted from documents in the CiteSeerX digital library and end-users can search for tables, figures, or algorithms directly and obtain relevant articles where document elements of interest have been published. 
  3. We have proposed a classification of algorithm influence types in order to capture the influence of prior work on newer published algorithms.  We are currently working on automatically classifying algorithm pairs into these classes.  After this work is complete, we can measure the influence of one work on another.
  4. We have designed and implemented citation recommendation algorithms that recommend citations and suggest locations in text where the citations should be inserted.  We demonstrated that they can recommend citations with reasonable efficacy.  We have released this utility, RefSeer, as part of the CiteSeerX system.

Publications

  1. Sumit Bhatia, Prasenjit Mitra, C. Lee Giles: Finding algorithms in scientific articles. WWW 2010: 1061-1062. [PDF]
  2. He Q., Pei, J., Kifer, D., Mitra, P., and Giles, C.L., "Context-aware Citation Recommendation" , International World Wide Web Conference (WWW 2010): 421-430. [PDF]
  3. Saurabh Kataria, Prasenjit Mitra and Sumit Bhatia. Utilizing Context in Generative Bayesian Models for Linked Corpus In AAAI 2010.
  4. Sumit Bhatia, Suppawong Tuarob, Prasenjit Mitra and C. Lee Giles. An Algorithm Search Engine For Software Developers In SUITE '11: Proceedings of 2011 ICSE Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation, 2011.
  5. Sumit Bhatia and Prasenjit Mitra. Summarizing Figures, Tables and Algorithms in Scientific Publications to Augment Search Results Technical Report, College of Information Sciences and Technology, The Pennsylvania State University, June 2010.
  6. Sumit Bhatia and Prasenjit Mitra. Synopsis Generation for Specialized Document-Element Search Engines In Workshop on Web Search Result Summarization and Presentation, Co-Located with WWW2009, 2009.
  7. Saurabh Kataria, P. Mitra, C. Caragea, C. Lee Giles. Context Sensitive Topic Models for Author Influence. In 22nd International Joint Conference on Artificial Intelligence (IJCAI-2011), Barcelona, Spain, July 16-22, 2011.
  8. Jing Fang, Prasenjit Mitra, Zhi Tang, and C. Lee Giles, Table Header Detection and Classification.  In 26th Conference on Artificial Intelligence, AAAI'12, AAAI Press, (2012).
  9. Suppawong Tuarob, Prasenjit Mitra, and, C. Lee Giles, Improving algorithm search using the algorithm co-citation network. (2012). In Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '12, Eds. Karim B. Boughida, Barrie Howard, Michael L. Nelson, Herbert Van de Sompel, and, Ingeborg Solvberg, pp. 277-280.
  10. Sumit Bhatia and Prasenjit Mitra. Summarizing Figures, Tables and Algorithms in Scientific Publications to Augment Search Results. ACM Transactions on Information Systems (TOIS), 30(1) 2012.

Software Publication & Release

The TableSeer software for extracting and searching for table data from PDF documents has been released in SourceForge. To download, click here.  The RefSeer utility has been released for use by the community.  See here.

Broader Impacts

  1. Training a graduate student in state of the art technologies in document analysis and processing, information retrieval, semantic web and database technologies.
  2. TableSeer is useful for natural scientists who want to extract data related to experiments that they are interested in and those that were published in scholarly articles (available as PDF documents).
  3. Developed and taught two graduate courses that had resident and online students enrolled.  IST 552: Database Systems and Knowledge Management, and IST 558: Data Mining II.

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant No. 0845487.  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.