Automatically Linking Structured and Unstructured Data: Connecting Databases to Text

From Technical Presentations

Jump to: navigation, search
Presenter(s): Breck Baldwin, Alias-i
Where: Automatic Machine Translation and Natural Language Processing (February 2007 NY SemWeb Meetup)
When: February 21, 2008
Topics: NLP, Semantic Web, LingPipe
Download: PDF

[edit] Description

Natural language processing for text analytics, text data mining and search. LingPipe is a state-of-the-art suite of natural language processing tools written in Java that performs tokenization, sentence detection, named entity detection, coreference resolution, classification, clustering, part-of-speech tagging, general chunking, fuzzy dictionary matching. These general tools support a range of applications.

Breck will discuss the thorny problem of linking entities in a database to text mentions of those entities.

The challenges are:

  • The John Smith problem: You have a text mention of "John Smith" and many possible John Smiths in the database. How to pick?
  • The name variant problem: Your database has an incomplete list of aliases for a gene. Serpina3 has the alias 'ACT', but is also called 'AACT' in the literature but you don't know that.
  • The new entity problem: You want to discover new performers when they show up in your entertainment text sources. Those new performers are not in your database yet, how is that handled?

Breck will discuss how you can approach these problems using the LingPipe suite of tools in context of entertainment news and bioinformatics.

Personal tools