The DOES Project on DOcument-element
Extraction and Search
This webpage describes the DOES research project on document
element extraction and search and the results from this project.
People
PI: Prasenjit Mitra
Post-doctoral Scholar: Dr. Cornelia
Caragea
Ph.D. Student: Sumit
Bhatia
Undergraduate Researcher: Sunil Jain
Collaborators: Dr. Qi He, Prof. C. Lee Giles
Project Goals
The project aims to investigate the following problems:
- Extract information from a) tables, b) images, and c) other
document elements (like algorithms) from documents in a digital library
(usually PDF).
- Enable end-users to search for document elements efficiently
without having to open and manually peruse full documents.
- Identify as much of the semantics of the extracted data
automatically
- Recommend citations and locations in text where they should be
cited for an input article automatically.
Research Challenges
The following are the major research challenges that we are
addressing:
- Text extractors are noisy. For example, for PDF to text
conversion, none of the existing tools like PDFBox,
TET, etc. are error-free. Overcoming errors that creep in the first stage is
hard.
- Tables in documents have varying layouts with multiple levels
of hierarchical column headers, nested or fused cells, etc. Identifying the
layout automatically and extracting information is a challenge.
- Identifying the semantics of table columns automatically is a
hard problem. Finding relevant information from the text document that is
relevant to the document elements can help but finding them accurately is a
challenge. Initial investigations
have resulted in low accuracy with respect to representing the semantics of
table columns automatically.
- Identifying which citations are relevant given arbitrary text
is a challenge. Finding what to
cite and where to cite it is hard.
For references that can be classified as depth citations, all the
references are required. A paper
cites all related work related to the exact problem. However, if there is a good survey,
often, writers cite the survey and then cite papers published after the survey
was published. For references
that are breadth citations, e.g., algorithms books, citing only one book is
enough. Hence, while recommending
citations, an automatic recommendation engine should identify what type of
citation it is and then cite. The
problem, of course, is more complex than this issue illustrates.
Current Results
- We
observed that heuristic solutions can detect table boundaries with reasonable
accuracy. Based on these findings, we have implemented and fine-tuned a table
extraction and search utility. The source code has been released (see below
for link).
- The
ability to search for tables has been integrated into the CiteSeerX system. The design and implementation of the
ability to search for figures and algorithms has been completed
in-house. Tables, figures, and algorithms have been extracted from
documents in the CiteSeerX digital library and
end-users can search for tables, figures, or algorithms directly and obtain
relevant articles where document elements of interest have been
published.
- We
have proposed a classification of algorithm influence types in order to
capture the influence of prior work on newer published algorithms. We are currently working on
automatically classifying algorithm pairs into these classes. After this work is complete, we can
measure the influence of one work on another.
- We
have designed and implemented citation recommendation algorithms that
recommend citations and suggest locations in text where the citations should
be inserted. We demonstrated that
they can recommend citations with reasonable efficacy. We have released this utility, RefSeer, as part of the CiteSeerX system.
Publications
- Sumit Bhatia, Prasenjit Mitra, C. Lee
Giles: Finding algorithms in scientific articles. WWW 2010: 1061-1062. [PDF]
- He
Q., Pei, J., Kifer, D., Mitra, P., and Giles, C.L., "Context-aware Citation
Recommendation" , International World Wide Web Conference (WWW 2010): 421-430.
[PDF]
- Saurabh
Kataria, Prasenjit Mitra and Sumit
Bhatia. Utilizing
Context in Generative Bayesian Models for Linked Corpus In AAAI 2010.
- Sumit
Bhatia, Suppawong Tuarob, Prasenjit Mitra and C. Lee Giles. An Algorithm Search
Engine For Software Developers In SUITE '11:
Proceedings of 2011 ICSE Workshop on Search-driven Development: Users,
Infrastructure, Tools and Evaluation, 2011.
- Sumit
Bhatia and Prasenjit Mitra. Summarizing
Figures, Tables and Algorithms in Scientific Publications to Augment Search
Results Technical Report, College of Information Sciences and
Technology, The Pennsylvania State University, June 2010.
- Sumit
Bhatia and Prasenjit Mitra. Synopsis Generation
for Specialized Document-Element Search Engines In Workshop on Web
Search Result Summarization and Presentation, Co-Located with WWW2009,
2009.
- Saurabh
Kataria, P. Mitra, C. Caragea, C. Lee Giles. Context Sensitive
Topic Models for Author Influence. To appear in 22nd International
Joint Conference on Artificial Intelligence (IJCAI-2011),
Barcelona, Spain, July 16-22, 2011.
Software Publication & Release
The TableSeer software for
extracting and searching for table data from PDF documents has been released in
SourceForge. To download, click here. The RefSeer
utility has been released for use by the community. See here.
Broader Impacts
- Training a graduate student in state of the art technologies
in document analysis and processing, information retrieval, semantic web and
database technologies.
- TableSeer is useful for natural
scientists who want to extract data related to experiments that they are
interested in and those that were published in scholarly articles (available
as PDF documents).
- Developed and taught two graduate courses that had
resident and online students enrolled.
IST 552: Database Systems and Knowledge Management, and IST 558: Data
Mining II.