logo Idiap Research Institute        
 [BibTeX] [Marc21]
Probabilistic Tagging of Unstructured Genealogical Records
Type of publication: Idiap-RR
Citation: PerrowBarber05a
Number: Idiap-RR-86-2005
Year: 2005
Institution: IDIAP
Abstract: In this paper we present a method of parsing unstructured textual records briefly describing a person and their direct relatives. The string `Stephanus, brother of Johannes Magnin, from Saillon' is a typical example of a record. We wish to annotate every term (word and symbol) in our records with a label which describes whether the term is a name (e.g. `Stephanus',',','), a place (e.g. `Saillon',',','), or a relationship (e.g. `brother'). We build upon work developed for the cleaning and standardization of names for record linkage corpora, adding several enhancements to deal with our more difficult data, which contains common name structures of French, Italian and Latin, over hundreds of years. We present an approach to this problem that works interactively with a user to annotate the data set accurately, greatly reducing the human effort required. We do this by learning a Hidden Markov Model representing a record structure, and finding structural patterns in new records
Userfields: ipdmembership={learning},
Keywords:
Projects Idiap
Authors Perrow, Mike
Barber, David
Added by: [UNK]
Total mark: 0
Attachments
  • rr05-86.pdf
  • rr05-86.ps.gz
Notes