The DOES Project on DOcument-element Extraction and Search

This webpage describes the DOES research project on document element extraction and search and the results from this project.

People

PI: Prasenjit Mitra

Post-doctoral Scholar: Dr. Cornelia Caragea, Dr. Lior Rokach

Ph.D. Students: Sumit Bhatia, Prakhar Biyani, Jing Fang, Sujatha Das G., Sagnik Ray Choudhury, Suppawong Tuarob, Dayu Yuan

Undergraduate Researcher: Sunil Jain, Sujay Patel, Denise Bartolome, BK Huang

Collaborators: Dr. Qi He, Prof. C. Lee Giles

Project Goals

The project aims to investigate the following problems:

  1. Extract information from a) tables, b) images, and c) other document elements (like algorithms) from documents in a digital library (usually PDF).
  2. Enable end-users to search for document elements efficiently without having to open and manually peruse full documents.
  3. Identify as much of the semantics of the extracted data automatically
  4. Recommend citations and locations in text where they should be cited for an input article automatically.
  5. Find and download researcher homepages and extract their metadata.
  6. Index graphical data and enable efficient subgraph and supergraph queries.

Research Challenges

The following are the major research challenges that we are addressing:

  1. Text extractors are noisy. For example, for PDF to text conversion, none of the existing tools like PDFBox, TET, etc. are error-free. Overcoming errors that creep in the first stage is hard.
  2. Tables in documents have varying layouts with multiple levels of hierarchical column headers, nested or fused cells, etc. Identifying the layout automatically and extracting information is a challenge.
  3. Identifying the semantics of table columns automatically is a hard problem. Finding relevant information from the text document that is relevant to the document elements can help but finding them accurately is a challenge.  Initial investigations have resulted in low accuracy with respect to representing the semantics of table columns automatically.
  4. Identifying which citations are relevant given arbitrary text is a challenge.  Finding what to cite and where to cite it is hard.  For references that can be classified as “depth” citations, all the references are required.  A paper cites all related work related to the exact problem.  However, if there is a good survey, often, writers cite the survey and then cite papers published after the survey was published.  For references that are “breadth” citations, e.g., algorithms books, citing only one book is enough.  Hence, while recommending citations, an automatic recommendation engine should identify what type of citation it is and then cite.  The problem, of course, is more complex than this issue illustrates.
  5. Automatically finding researcher webpages is a challenging problem. The problem of identifying metadata related to researchers is even harder. We seek to investigate whether we can classify researcher webpages, which are a significant minority class among all webpages, using semi-supervised machine learning algorithms such as co-training. We seek to investigate applied machine learning algorithms to extract metadata with high accuracy.
  6. Standard indexes in databases are not efficient for querying graphical data. We seek to design an efficient index for graphical data and algorithms to update it incrementally efficiently such that it supports both subgraph and supergraph querying efficiently.

Current Results

  1. [Table Data Extraction] We observed that heuristic solutions can detect table boundaries with reasonable accuracy. Based on these findings, we have implemented and fine-tuned a table extraction and search utility. We have improved the accuracy of identifying and extracting hierarchical headers. The source code has been released (see below for link).
  2. [Table Search] The ability to search for tables has been integrated into the CiteSeerX system. The design and implementation of the ability to search for figures and algorithms has been completed in-house. Tables, figures, and algorithms have been extracted from documents in the CiteSeerX digital library and end-users can search for tables, figures, or algorithms directly and obtain relevant articles where document elements of interest have been published.
  3. [Figures] We have proposed algorithms to detect metadata (captions and mentions) related to figures and enable figure search using this metadata. We are currently working on algorithms to extract data from 2-D plots, histograms, pie-charts, etc.
  4. [Algorithms] We have proposed algorithms to detect pseudo-codes in PDF documents and identify their start and end lines. We have proposed a classification of algorithm influence types in order to capture the influence of prior work on newer published algorithms. We are currently working on automatically classifying algorithm pairs into these classes. After this work is complete, we can measure the influence of one work on another.
  5. [Citations] We have designed and implemented citation recommendation algorithms that recommend citations and suggest locations in text where the citations should be inserted. We demonstrated that they can recommend citations with reasonable efficacy. We have released this utility, RefSeer, as part of the CiteSeerX system.
  6. [Homepages] We have designed and implemented algorithms to identify researcher homepages and metadata extraction algorithms for such homepages.
  7. [Graph Databases] We have designed and implemented Lindex, a lattice-based index for graph databases that enables both subgraph and supergraph querying and demonstrated that it is more efficient than existing graph databases.

Publications

  1. Sumit Bhatia, Prasenjit Mitra, C. Lee Giles: Finding algorithms in scientific articles. WWW 2010: 1061-1062. [PDF]
  2. He Q., Pei, J., Kifer, D., Mitra, P., and Giles, C.L., "Context-aware Citation Recommendation", International World Wide Web Conference (WWW 2010): 421-430. [PDF]
  3. Saurabh Kataria, Prasenjit Mitra and Sumit Bhatia. Utilizing Context in Generative Bayesian Models for Linked Corpus In AAAI 2010.
  4. Sumit Bhatia, Suppawong Tuarob, Prasenjit Mitra and C. Lee Giles. An Algorithm Search Engine For Software Developers In SUITE '11: Proceedings of 2011 ICSE Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation, 2011.
  5. Sumit Bhatia and Prasenjit Mitra. Summarizing Figures, Tables and Algorithms in Scientific Publications to Augment Search Results Technical Report, College of Information Sciences and Technology, The Pennsylvania State University, June 2010.
  6. Sumit Bhatia and Prasenjit Mitra. Synopsis Generation for Specialized Document-Element Search Engines In Workshop on Web Search Result Summarization and Presentation, Co-Located with WWW2009, 2009.
  7. Saurabh Kataria, P. Mitra, C. Caragea, C. Lee Giles. Context Sensitive Topic Models for Author Influence. In 22nd International Joint Conference on Artificial Intelligence (IJCAI-2011), Barcelona, Spain, July 16-22, 2011.
  8. Jing Fang, Prasenjit Mitra, Zhi Tang, and C. Lee Giles, Table Header Detection and Classification.  In 26th Conference on Artificial Intelligence, AAAI'12, AAAI Press, (2012).
  9. Suppawong Tuarob, Prasenjit Mitra, and, C. Lee Giles, Improving algorithm search using the algorithm co-citation network. (2012). In Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '12, Eds. Karim B. Boughida, Barrie Howard, Michael L. Nelson, Herbert Van de Sompel, and, Ingeborg Solvberg, pp. 277-280.
  10. Sumit Bhatia and Prasenjit Mitra. Summarizing Figures, Tables and Algorithms in Scientific Publications to Augment Search Results. ACM Transactions on Information Systems (TOIS), 30(1) 2012.
  11. Sumit Bhatia, Cornelia Caragea, Hung-Hsuan Chen, Jian Wu, Pucktada Treeratpituk, Zhaohui Wu, Madian Khabsa, Prasenjit Mitra and C. Lee Giles.Specialized Research Datasets in the CiteSeerx Digital Library. D-Lib, 18(7/8) 2012.
  12. Suppawong Tuarob, Sumit Bhatia, Prasenjit Mitra and C. Lee Giles. Automatic Detection of Pseudo-codes in Scholarly Documents Using Machine LearningIn proceedings of The Twelfth International Conference on Document Analysis and Recognition (ICDAR 2013), Washington, DC, USA.
  13. Sagnik Ray Choudhury, Suppawong Tuarob, Prasenjit Mitra, Lior Rokach and C. Lee Giles. ChemXSeer Figure Search: A Chemical Literature Figure Search Engine. In proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2013), Indianapolis, IN, USA.
  14. Sagnik Ray Choudhury, Prasenjit Mitra, Lior Rokach and C. Lee Giles. Figure Metadata Extraction in Digital Documents. In proceedings of The Twelfth International Conference on Document Analysis and Recognition (ICDAR 2013), Washington, DC, USA.
  15. Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, C. Lee Giles. Researcher homepage classification using unlabeled data. WWW 2013: 471-482.
  16. Sooyoung Oh, Zhen Lei, Wang-Chien Lee, Prasenjit Mitra, John Yen.CV-PCR: a context-guided value-driven framework for patent citation recommendation. CIKM 2013: 2291-2296.
  17. Lior Rokach, and Prasenjit Mitra. Parsimonious Citer-Based Measures: Artificial Intelligence Domain as a Case Study. JASIST 2013 64(9): 1951-1959.
  18. Dayu Yuan, Prasenjit Mitra: Lindex: a lattice-based index for graph databases. The VLDB Journal, 2013 22(2): 229-252.
  19. Dayu Yuan, Prasenjit Mitra, C. Lee Giles. Mining and Indexing Graphs for Supergraph Search. PVLDB 2013 6(10): 829-840 (2013).
  20. Sujatha Das Gollapalli, Yanjun Qi, Prasenjit Mitra, C. Lee Giles: Extracting Researcher Metadata with Labeled Features. SDM 2014: 740-748.

Software Publication & Release

The TableSeer software for extracting and searching for table data from PDF documents has been released in SourceForge. To download, click here.  The RefSeer utility has been released for use by the community.  See here.

Broader Impacts

  1. Training multiple post-doctoral, graduate and under-graduate students in state of the art technologies in document analysis and processing, information retrieval, semantic web and database technologies.
  2. TableSeer is useful for natural scientists who want to extract data related to experiments that they are interested in and those that were published in scholarly articles (available as PDF documents).
  3. Developed and taught two graduate courses that had resident and online students enrolled.  IST 552: Database Systems and Knowledge Management, and IST 558: Data Mining II.

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant No. 0845487.  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.