Antoine Amarilli (Télécom Paris)

We study the data management task of extracting structured information from unstructured documents, e.g., raw text or HTML pages. We use the framework of document spanners, where the pattern to extract is specified declaratively by the user (as a regular expression with capture variables) and is translated to an automaton that then is evaluated on the document to compute the pattern matches. We focus on the last step of this pipeline: our goal is to efficiently find the matches of the automaton on the document, even when there can be many of them. We do this by proposing an enumeration algorithm, which first preprocesses the automaton and document, and then enumerates all matches with a small delay between consecutive matches. Unlike previous work, our algorithm achieves the best possible bounds in the input document (namely, linear preprocessing and constant delay), while remaining tractable in the automaton. The guiding principle of the algorithm is to compute a factorized representation of all matches as a product of the automaton and document, and design efficient indexes based on the structure of this representation. We also present our ongoing follow-up work, e.g., how to extend our algorithm to the case of tree-shaped documents by efficiently enumerating the matches of tree automata, how to efficiently update the enumeration results when the input document changes, and other open research directions.