Approaches & Emerging Software for Extracting Structure from Text Documents.

This is a “work in progress” workshop led by DSI Director and Professor of Statistics Dr. Duncan Temple Lang in which we will all discuss approaches and related software we are developing for extracting information from text documents such as scanned documents with Optical Character Recognition (OCR), and PDF and DOCX files. This includes finding information such as tables, images, lists, sections and section titles, bibliographic information, author affiliations, etc.. We’ll also discuss the software development process for this task. This is intended as both an overview and a brainstorming session to get people started working on different types of non-standard text formats.

Prerequisites: intermediate R skills and a working R environment.