Information about http://www.dcs.shef.ac.uk/~fabio/paperi/IJCAI01.pdf

Adaptive Information Extraction from Text …

Tags: computer science university, correct mistakes, current technology, department of computer science, document format, duction, generalisation, imprecision, infor, information extraction, portobello street, regent court, rule induction, s1 4dp, sgml tags, shef ac uk, sheffield uk, symbolic rules, two steps, university of sheffield,
Pages: 6
Language: english
Created: Mon Apr 9 17:56:46 2001
Display cached document
Page 1
image
Page 2
image
Page 3
image
Page 4
image
Page 5
image
Page 6
image
                           Adaptive Information Extraction from Text
                             by Rule Induction and Generalisation

                                           Fabio Ciravegna
                         Department of Computer Science, University of Sheffield
                                  Regent Court, 211 Portobello Street,
                                         S1 4DP Sheffield, UK
                                      F.Ciravegna@dcs.shef.ac.uk
                       Abstract                      scenarios [Cardie 1997; Miller et al. 1998; Yangarber et
                                                             al. 2000]. Given the current technology, IE experts are
    (LP)2  is a covering algorithm for adaptive Infor-
                                                             still necessary.
    mation Extraction from text (IE). It induces
                                                             In the last years, the increasing importance of the Inter-
    symbolic rules that insert SGML tags into texts
                                                             net has stressed the central role of texts such as emails,
    by learning from examples found in a user-
                                                             Usenet posts and Web pages. In this context, extralin-
    defined tagged corpus. Training is performed in
                                                             guistic structures (e.g. HTML tags, document format-
    two steps: initially a set of tagging rules is
                                                             ting, and ungrammatical stereotypical language) are
    learned; then additional rules are induced to
                                                             elements used to convey information. Linguistically in-
    correct mistakes and imprecision in tagging. In-
                                                             tensive approaches are difficult or unnecessary in such
    duction is performed by bottom-up generaliza-
                                                             cases. For this reason a new research stream on adaptive
    tion of examples in the training corpus. Shallow
                                                             IE has arisen at the convergence of NLP, Information
    knowledge about Natural Language Processing
                                                             Integration and Machine Learning. The goal is to pro-
    (NLP) is used in the generalization process. The
                                                             duce IE algorithms and systems adaptable to new Inter-
    algorithm has a considerable success story.
                                                             net-related applications/scenarios by using only an ana-
    From a scientific point of view, experiments re-
                                                             lyst's knowledge (i.e. knowledge on the domain/scenario
    port excellent results with respect to the current
                                                             itself) [Kushmerick 1997; Califf 1998; Muslea et al.
    state of the art on two publicly available cor-
                                                             1998; Freitag and McCallum 1999; Soderland 1999;
    pora. From an application point of view, a suc-
                                                             Freitag and Kushmerick 2000]. Such algorithms are very
    cessful industrial IE tool has been based on
                                                             effective when applied on highly structured HTML
    (LP)2 . Real world applications have been devel-
                                                             pages, but less effective on unstructured texts (e.g. free
    oped and licenses have been released to external
                                                             texts). In our opinion this is because most successful
    companies for building other applications. This
                                                             algorithms make scarce (or no) use of NLP, tending to
    paper presents (LP)2, experimental results and
                                                             avoid any generalization over the flat word sequence.
    applications, and discusses the role of shallow
                                                             When they are applied to unstructured texts, data
    NLP in rule induction.
                                                             sparseness becomes a problem.
                                                             This paper presents (LP)2, an adaptive IE algorithm de-
1. Introduction                                              signed in this new stream of research that makes use of
By general agreement the main barriers to wide use and       shallow NLP in order to overcome data sparseness when
commercialization of Information Extraction from text        confronted with NL texts, while keeping effectiveness
(IE) are the difficulties in adapting systems to new ap-     on highly structured texts. This paper first introduces the
plications. The classical IE has been focusing on appli-     algorithm, discusses experimental results and shows how
cations to free texts; therefore systems often rely on ap-   the algorithm compares successfully with the current
proaches based on Natural Language Processing (NLP)          state of the art. The role and importance of shallow NLP
(e.g. using parsing) [Humphreys et al. 1997; Grishman        for overcoming data sparseness is then discussed. Fi-
1997]. Most systems require the manual development of        nally a successful industrial system for adaptive IE built
resources (e.g. grammars) by a user skilled in NLP [Ci-      around (LP)2 is presented and some conclusions and fu-
ravegna 2000]. There is an increasing interest in apply-     ture work are drawn.
ing machine learning (ML) to IE in order to build adap-
tive systems. Up to now, the use of ML has been ap-          2.The Rule Induction Algorithm
proached mainly in an NLP-oriented perspective, i.e. in
                                                             (LP)2 learns from a training corpus where a user has
order to reduce the amount of work to be done by the
                                                             highlighted the information to be extracted with differ-
NLP experts in porting systems across free text based
ent SGML tags. It induces symbolic rules that insert           (3) cover different parts of input1; (4) have an error rate
SGML tags into texts in two steps:                             that is less than a specified threshold.
 1. Sets of tagging rules are induced by bottom-up
   generalization of tag instances found in the training         word Condition         Associated information   Action
   corpus. Shallow knowledge about NLP is used in the            index     word     lemma LexCat case SemCat      Tag
   generalization process.                                          1       the       the      Art    low
 2. Correction rules are induced that refine the tag-               2     seminar seminar Noun low
   ging by correcting mistakes and imprecision                      3        at        at     Prep low          
This section presents and discusses these two steps.                4        4         4      Digit low
                                                                    5       pm        pm      Other low timeid
2.1 Inducing Tagging Rules                                          6       will      will    Verb low
A tagging rule is composed of a left hand side, contain-           Table 1: Starting rule (with associated NLP knowledge)
ing a pattern of conditions on a connected sequence of             inserting  in the sentence ``the seminar at
words, and a right hand side that is an action inserting an         4 pm will...''.
SGML tag in the texts. Each rule inserts a single SGML
tag, e.g. . This makes (LP)2 different from           The other generalizations are discarded. Retained rules
many adaptive IE algorithms, whose rules recognize             become part of the best rules pool. When a rule enters
whole slot fillers (i.e. insert both  and             the best rules pool, all the instances covered by the rule
 [Califf 1998; Freitag 1998]) or even multi          are removed from the positive examples pool, i.e. cov-
slots [Soderland 1999]. The tagging rule induction algo-       ered instances will no longer be used for rule induction
rithm uses positive examples from the training corpus          ((LP)2 is a covering algorithm). Rule induction continues
for learning rules. Positive examples are the SGML tags        by selecting new instances and learning rules until the
inserted by the user. All the rest of the corpus is consid-    pool of positive examples is void.
ered a pool of negative examples. For each positive ex-          Word                    Condition                   Action
ample the algorithm: (1) builds an initial rule, (2) gener-      index     Word Lemma LexCat Case SemCat               Tag
alizes the rule and (3) keeps the k best generalizations of        3                  at                             
the initial rule. In particular (LP)2's main loop starts by        4                        digit
selecting a tag in the training corpus and extracting from         5                                        timeid
the text a window of w words to the left and w words to         Table 2: A generalization for rule in table 1. The pattern is
the right. Each information stored in the 2*w word win-         relaxed in length (conditions on words 1, 2 and 6 were re-
dow is transformed into a condition in the initial rule         moved) and conditions on the other words were substituted
pattern, e.g. if the third word in the window is "semi-         by other constraints.
nar", the condition on the third word in the pattern will
be word="seminar". Each initial rule is then general-          2.1.1 Learning Contextual Rules
ized. In the generalization process (LP)2 uses generic         When applied on the test corpus, the best rules pool pro-
shallow knowledge about Natural Language as provided           vides good results in terms of precision, but limited ef-
by a morphological analyzer, a POS tagger and a user-          fectiveness in terms of recall. This means that such rules
defined dictionary (or a gazetteer). A lexical item (LexIt     insert few tags (low recall), and that such tags are gener-
in the following) summarizes such knowledge for each           ally correct (high precision). Intuitively this is because
word (e.g., companies) via: a lemma (company), a lexi-         the absolute reliability required for rule selection is
cal category (noun), case information (lowercase) and a        strict, thus only some of the induced rules will match it.
list of user defined classes as defined by a user-defined      In order to reach acceptable effectiveness, it is necessary
dictionary or a gazetteer (if available). An initial rule      to identify additional rules able to raise recall without
and associated information in the LexIts is in Table 1.        affecting precision. (LP)2 recovers some of the rules not
Generalization consists in the production of a set of          selected as best rules and tries to constraint their appli-
rules derived by relaxing constraints in the initial rule      cation to make them reliable. Constraints on rule appli-
pattern. Conditions are relaxed both by reducing the           cation are derived by exploiting interdependencies
pattern in length and by substituting constraints on           among tags. As mentioned, (LP)2 learns rules for insert-
words with constraints on some parts of the additional         ing tags (e.g., ) independently from other tags
knowledge. Table 2 shows one of the many generaliza-           (e.g., ). But tags are not independent. There
tions for rule in table 1. The last step of the algorithm is   are two ways in which they can influence each other: (1)
the selection of the best generalizations. Each generali-      tags represent slots, therefore  always requires
zation is tested on the training corpus and an accuracy        ; (2) slots can be concatenated into linguistic
score L=wrong/matched is calculated. For each initial          patterns and therefore the presence of a slot can be a
instance the k best generalizations are kept that: (1) re-     good indicator of the presence of another, e.g.
port better accuracy; (2) cover more positive examples;
                                                                   1
                                                                    Rules derived from the same seed cover the same portions
                                                                   of input when ineffective constraints are present.
 can be used as "anchor tag" for inserting               It learns from the mistakes made in applying tagging
2. In general it is possible to use  to in-           rules on the training corpus. Shift rules consider tags
troduce . (LP)2 is not able to use such contextual           misplaced within a distance d from the correct position.
information, as it induces single tag rules. The context is        Correction rules are identical to tagging rules, but (1)
reintroduced in (LP)2 as an external constraint used to            their patterns match also the tags inserted by the tagging
improve the reliability of unreliable rules. In particular,        rules and (2) their actions shift misplaced tags rather
(LP)2 reconsiders low precision non-best rules for appli-          than adding new ones. An example of an initial correc-
cation in the context of tags inserted by the best rules           tion rule for shifting  in "at  4
only. For example some rules will be used only to close             pm'' is shown in table 4.
slots when the best rules were able to open it, but not            The induction algorithm used for the best tagging rules
close it (i.e., when the best rules are able to insert             is also used for shift rules: initial instance identification,
 but not ). Selected rules are called                 generalization, test and selection. "Wrong Tag" and
contextual rules. As example consider a rule inserting a           "Correct Tag" conditions are never relaxed. Positive
 tag between a capitalized word and a low-               (correct shifts) and negative (wrong shifts of correctly
ercase word. This is not a best rule as it reports high            assigned tags) are counted. Shift rules are accepted only
recall/low precision on the corpus, but it is reliable if          if they report an acceptable error rate.
used only to close an open . Thus it will only
be applied when the best rules have already recognized             3. Extracting Inf ormation
an open , but not the corresponding                       In the testing phase information is extracted from the
. Area of application is the part of the text            test corpus in four steps: initial tagging, contextual tag-
following a  and within a distance minor or               ging, correction and validation. The best rule pool is
equal to the maximum length allowed for the present                initially used to tag the texts. Then contextual rules are
slot3. "Anchor tags" used as contexts can be found either          applied in the context of the introduced tags. They are
to the right of the rule space application (as in the case         applied until new tags are inserted, i.e. some contextual
above when the anchor tag is ), or to the left            rules can match also tags inserted by other contextual
as in the opposite case (anchor tag is ).                rules. Then correction rules correct some imprecision.
Detailed description of this process can be found in [Ci-          Finally each tag inserted by the algorithm is validated.
ravegna 2000a]. Reliability for contextual rules is com-           There is no meaning in producing a start tag (e.g.
puted by using the same error rate used for best rules,            ) without its corresponding closing tag
but only matches in controlled contexts are counted.               () and vice versa, therefore uncoupled tags
In conclusion the sets of tagging rules (LP)2 induces are          are removed in the validation phase.
both the best rule pool and the contextual rules. Figure 3
shows the whole algorithm for tagging rule induction.                      Condition              Additional Information
 Loop for instance in initial-instances                             word Wrong tag correct tag lemma LexCat case SemCat
  unless already-covered(instance)
  loop for rule in generalise(instance)                              at                              at    prep low
   test(rule)                                                        4                       4     digit low
   if best-rule?(rule)                                              pm                      pm     other low timeid
    then insert(rule, bestrules)                                    Table 4: A a correction rule. The action (not shown) shifts
           cover(rule, initial-instances)                           the tag from the wrong to the correct position.
    else loop for tag in tag-list
       if test-in-context(rule,tag,:right)                         4. Experimental Results
        then select-contxtl(rule,tag,:right)
       if test-in-context(rule,tag,:left)                          (LP)2  was tested in a number of tasks in two languages:
        then select-contxtl(rule,tag,:left)                        English and Italian. In each experiment (LP)2 was trained
  Figure 3: The final algorithm for rule tagging induction.        on a subset of the corpus (some hundreds of texts, de-
                                                                   pending on the corpus) and the induced rules were tested
2.2 Inducing Correction Rules                                      on unseen texts. Here we report about results on two
Tagging rules when applied on the test corpus report               standard tasks for adaptive IE: the CMU seminar an-
some imprecision in slot filler boundary detection. A              nouncements and the Austin job announcements4. The
typical mistake is for example "at  4                        first task consists of uniquely identifying speaker name,
 pm", where "pm" should have been part of the               starting time, ending time and location in 485 seminar
time expression. For this reason (LP)2 induces rules for           announcements [Freitag 1998]. Table 5 shows the over-
shifting wrongly positioned tags to the correct position.          all accuracy obtained by (LP)2, and compares it with that
                                                                   obtained by other state of the art algorithms. (LP)2 scores
  2
    In the following we just make examples related to the first    the best results in the task. It definitely outperforms
  case as it is more intuitive.                                    other symbolic approaches (+8.7% wrt Rapier[Califf
  3
    The training corpus is used for computing the maximum filler
                                                                     4
  length for each slot.                                                  Corpora available at www.isi.edu/muslea/RISE/index.html
1998], +21% wrt to Whisk[Soderland 1999]), but it also              alization via shallow NLP, (3) the use of single tag rules
outperforms statistical approaches (+2.1% wrt BWI                   and (4) the use of correction.
[Freitag and Kushmerick 2000] and +4% wrt HMM                       (LP)2 induces rules by instance generalization. Generali-
[Freitag and McCallum 1999]). Moreover (LP)2 is the                 zation is also used in SRV, Rapier and Whisk. It allows
only algorithm whose results never go down 75% on any               reducing data sparseness by capturing some general as-
slot (second best is BWI: 67.7%).                                   pects beyond the simple flat word structure. Shallow
                                                                    NLP is the basis for generalization in (LP)2. Morphology
               (LP)2      BWI HMM SRV Rapier Whisk                  allows overcoming data sparseness due to num-
   speaker         77.6 67.7 76.6 56.3            53.0     18.3     ber/gender word realizations, while POS tagging infor-
   location        75.0 76.7 78.6 72.3            72.7     66.4     mation allows generalization over lexical categories. In
     stime         99.0 99.6 98.5 98.5            93.4     92.6     principle such type of generalization produces rules of
    etime          95.5 93.9 62.1 77.9            96.2     86.0     better quality than those matching the flat word se-
   All Slots       86.0 83.9 82.0 77.1           77.3     64.9      quence, rules that tend to report better effectiveness on
 Table 5: F-measure (=1) obtained on CMU seminars. Results          unseen cases. This is because both morphology and POS
 for algorithms other than (LP)2 are taken from [Freitag and        tagging are generic NLP processes performing equally
 Kushmerick 2000]. We added the comprehensive ALL SLOTS             well on unseen cases; therefore rules relying on their
 figure, as it allows better comparison among algorithms. It was    results apply successful on unseen cases. This intuition
 computed by:                                                       was confirmed experimentally: (LP)2 with generalization
  slot (F-measure * number of possible slot fillers)                ((LP)2G) definitely outperforms a version without gener-
                                                       100          alization ((LP)2NG) on the test corpus, while having com-
          slot number of possible slot fillers
 Concerning (LP)2 results from a 10 cross-folder experiment using   parable results on the training corpus (+57% on the
 half of the corpus for training. F-measure calculated via the      speaker field, +28% on the location field, +11% overall
 MUC scorer [Douthat 1998]. Average training time per run: 56       on the CMU task). Moreover in (LP)2G the covering algo-
 min on a 450MHz computer. Window size w=4.                         rithm converges more rapidly than in (LP)2NG, because its
                                                                    rules tend to cover more cases. This means that (LP)2G
A second task concerned IE from 300 Job Announce-                   need less examples in order to be trained, i.e., rule gen-
ments taken from misc.jobs.offered [Califf 1998]. The               eralization also allows reducing the training corpus size.
task consists of identifying for each announcement: mes-            Not surprisingly the role of shallow NLP in the reduc-
sage id, job title, salary offered, company offering the            tion of data sparseness is more relevant on semi-
job, recruiter, state, city and country where the job is            structured or free texts (such as the CMU seminars) than
offered, programming language, platform, application                on documents with highly standardized language (e.g.
area, required and desired years of experience, required            HTML pages, or the job announcement task). During the
and desired degree, and posting date. The results ob-               rule selection phase (LP)2 is able to adopt the right level
tained on such a task are reported in table 6. (LP)2 out-           of NLP information for the task at hand: in an experi-
performs both Rapier and Whisk (Whisk obtained lower                ment on texts written in mixed Italian/English we used
accuracy than Rapier [Califf 1998]). We cannot compare              an English POS tagger that was completely unreliable on
(LP)2 with BWI as the latter was tested on a very limited           the Italian part of the input. (LP)2G reached the same ef-
subset of slots. In summary, (LP)2 reaches the best results         fectiveness of (LP)2NG, because the rules using the unreli-
on both the tasks.                                                  able NLP information were automatically discarded.
                                                                    This shows that the use of NLP is always a plus, never a
     Slot   (LP)2 Rapier BWI           Slot      (LP)2 Rapier       minus.
      id     100     97.5 100       platform      80.5 72.5         The separate recognition of tags is an aspect shared by
     title   43.9 40.5 50.1 application           78.4 69.3         BWI, while HHM, Rapier and SRV recognized whole
 company 71.9 69.5 78.2                area       66.9 42.4         slots and Whisk recognizes multislots. Separate tag
    salary   62.8 67.4             req-years-e    68.8 67.1         identification allows further reduction of data sparse-
  recruiter 80.6 68.4              des-years-e    60.4 87.5         ness, as it better generalizes over the coupling of slot
     state   84.7 90.2             req-degree     84.7 81.5         start/end conditions. For example in order to learn pat-
     city    93.0 90.4             des-degree     65.1 72.2         terns     equivalent    to    the     regular   expression
   country   81.0 93.2              post date     99.5 99.5         (`at'|`starting from')DIGIT(`pm'|`am'), (LP)2 just
 language 91.0 80.6                 All Slots     84.1 75.1         needs two examples, e.g., `at'+`pm' and `starting
 Table 6: F-measure (=1) obtained on the         Jobs do-           from'+`am', because the algorithm induces two inde-
 main using half of the corpus for training.                        pendent rules for  (`at' + `starting from')
                                                                    and two for  (`am' + `pm'). In a slot-oriented
5. Discussion                                                       rule learning strategy four examples (and four rules) will
                                                                    be needed, i.e. `at'+`pm', `at'+`am', `starting
(LP)2 's
       main features that are most likely to contribute to          from' +`pm', `starting from'+`am'. In a multislot
the excellence in the experiments are: (1) the induction            approach the problem is worst and the number of train-
of symbolic rules (see the conclusions), (2) rule gener-            ing examples needed increases drastically [Ciravegna
2000a].                                                            Two other applications were developed for Kataweb, a
Another reason for the good experimental results relies            major Italian Internet portal. The goal was to extract
in the use of a correction step. Correction is useful in           information from both financial news and classified ads
recognizing slots with fillers with high degree of vari-           written in Italian and published on the portal pages.
ability (such as the speaker in the CMU experiment),               LearningPinocchio is used both to generate hyperlinks
while it does not pay on slots with highly standardized            for cross-referencing texts and to retrieve texts querying
fillers (such as many slots in the Jobs task). (LP)2 using         the content. The application is currently under final test
correction rules reports 7% more in terms of accuracy on           at the customer's site. Table 8 shows experimental re-
 than (LP)2 without correction. Imprecision              sults on financial news.
in tagging was also reported by [Califf 1998] who noted
up to 5% imprecision on some slots (but she did not in-            7. Conclusions and Future Work
troduce any correction steps in Rapier).
  Slot PRE REC F-measure Slot PRE REC F-measure                    (LP)2 is a successful algorithm. On the one hand it out-
 Name 97 82           88.9      Email 92 71         80.1           performs the other state of the art algorithms on two
 Street 96 71         81.6       Tel. 93 75         83.0           very popular IE tasks. It is important to stress the fact
  City 90 90           90        Fax 100 50         66.6           that (LP)2 outperforms also statistical approaches, be-
 Prov. 97 92          94.4       Zip 100 90         94.7           cause in the last years the latter largely outperformed
  Zip 100 90          94.7                                         symbolic approaches. There is a clear advantage in using
 Table 7: Results of a blind test on 50 resumees. This is not a    symbolic rules in real world applications. It is possible
 simple named entity recognition task. A resumee may contain       to inspect the final system results and manually
 many names and addresses (e.g. previous work addresses,           add/modify/remove rules for squeezing additional accu-
 name of referees or thesis supervisors and their addresses).      racy (it was not done in the scientific experiments, but it
 The system had to recognize the correct ones.                     was in the applications).
                                                                   On the other hand (LP)2 was the basis for building
                                                                   LearningPinocchio, a tool for building adaptive IE ap-
6. Developing real world applications                              plications that is having a considerable commercial suc-
(LP)2  was developed as a research prototype, but it               cess. This shows that adaptive IE is able produce tools
quickly turned out to be suitable for real world applica-          suitable for building real world applications by a final
tions. An industrial system based on (LP)2, LearningPi-            user by using only analyst's knowledge.
nocchio, was developed. Recently LearningPinocchio                 Future work on (LP)2 will involve both the improvement
has been used in a number of industrial applications.              of rule formalism expressiveness and the further use of
Moreover licenses have been released to external com-              shallow NLP for generalization. Concerning the im-
panies for further application development. This section           provement in rule formalism expressiveness we plan to
reports about some industrial applications we have di-             include some forms of Kleene-star and optionality op-
rectly developed. The system is used for extracting in-            erators. Such improvement has shown to be very effec-
formation from professional resumees written in English.           tive in both BWI and Rapier. Concerning the use of
It is used on the results of a spider that surfs the Web to        shallow NLP for generalization (i.e., one of the keys of
retrieve professional resumees. The spider classifies              the success in (LP)2) there are two possible improve-
resumees by topics (e.g. computer science). LearningPi-            ments. On the one hand (LP)2 will be used in cascade
nocchio extracts the relevant information and its output           with a Named Entity Recognizer (also implemented by
is used to populate a database. Table 7 shows some re-             using (LP)2). This will allow further generalization over
sults obtained in such task. Application development               named entity classes (e.g., the speaker is a person, so it
time for the IE task required about 24 person hours for            is possible to generalize over such class in the rules). On
scenario definition and revision (the scenario was re-             the other hand (LP)2 is compatible with forms of shallow
fined by tagging some texts in different ways and dis-             parsing such as chunking. It is then possible to preproc-
cussing among annotators). Further 10 person hours                 ess the texts with a chunker and to insert tags only at the
were needed for tagging about 250 texts. The rule in-              chunk borders. This is likely to improve precision in
duction process took 72 hours on a 450MHz machine,                 border identification.
with window size w=4. Finally system results validation            An interesting question concerns the limits of the tag-
required four person hours.                                        ging-based IE approach used by many adaptive systems,
  TAG                    F(1)      TAG                   F(1)      (LP)2 included. Classic MUC-like IE is based on tem-
  Geograph Area          0.70      Organiz. Name         0.86      plate filling. Template filling is more complex than tag-
  Currency               0.85      Company Share                   ging, as it implies to decide about both coreference of
  Stock Exchange                      Name               0.85
                                                                   expressions (are "John A. Smith" and "J. Smith" the
                                                                   same person? Two seminars have been identified in a
    Name                 0.91         Type               0.92
                                                                   text: are they separate events or are they coreferring?),
    Index                0.97         Category           0.86
                                                                   and slot pairing (two seminars and two speakers have
                      ALL SLOTS 0.87
                                                                   been identified: which is the speaker of the first semi-
 Table 8: Results of blind test on financial news (300   texts).
nar?). (LP)2 is able to apply default strategies for tem-         Learning for Information Extraction, Berlin, August 2000.
plate merging that solve simple cases of coreferences             (www.dcs.shef.ac.uk/~fabio/ecai-workshop.html)
and slot pairing. Such strategies are powerful enough to
cope with many real world tasks, but in some other cases         [Douthat 1998] Aaron Douthat, `The message understanding
they are not effective enough. For example in the re-             conference scoring software user's manual', in the 7th Mes-
sumees application the customer was interested in re-             sage Understanding Conf., www.muc.saic.com
trieving also the triples degree/university/year. Learn-         [Freitag 1998] Dayne Freitag, `Information Extraction from
ingPinocchio was able to correctly highlight such infor-          HTML: Application of a general learning approach', Proc.
mation, but often it was not able to pair them correctly,         of the 15th National Conference on Artificial Intelligence
therefore they were not used to populate the database,
                                                                  (AAAI-98), 1998.
but only to index texts. Even if some ad hoc strategies
would have probably solved the problem in the specific           [Freitag and McCallum 1999] Dayne Freitag and Andrew
case, it is quite clear that this is a major limitation in the    McCallum: `Information Extraction with HMMs and
approach. Classical MUC-like IE systems use sophisti-             Shrinkage', AAAI-99 Workshop on Machine Learning for
cated strategies for coreference resolution and template          Information    Extraction,  Orlando,    FL,    1999,
merging, very often based on a mix of NLP knowledge               www.isi.edu/~muslea/RISE/ML4IE/
and domain knowledge [Humphreys et al. 1998]. Two
problems prevent the use of such techniques. On the one          [Freitag and Kushmerick 2000] Dayne Freitag and Nicholas
hand deep NLP is not effective in many applications in            Kushmerick, `Boosted wrapper induction', in F. Ciravegna,
the Internet realm (e.g. how can you parse an e-mail?).           R. Basili, R. Gaizauskas (eds.) ECAI2000 Workshop on
On the other hand it is not clear how to elicit the domain        Machine Learning for Information Extraction, Berlin, 2000,
knowledge for coreference from an analyst (adaptability           (www.dcs.shef.ac.uk/~fabio/ecai-workshop.html)
via analyst's knowledge is a strong constraint as men-
tioned above). Adaptive template filling is an issue             [Grishman 1997] Ralph Grishman, ` Information Extraction:
worth exploration that we are currently investigating.            Techniques and Challenges, . In Information Extraction: A
We work in the direction of further using shallow NLP             Multidisciplinary Approach to an Emerging Information
for improving template filling and merging. To some               Technology, in M.T. Pazienza, (ed.), Springer, 97.
extent this is also a step in the direction of bridging the      [Humphreys et al. 1998] K. Humphreys, R. Gaizauskas, S.
gap between classical NLP based IE systems and fully              Azzam, C. Huyck, B. Mitchell, H. Cunningham, Y. Wilks:
adaptive systems.                                                 `Description of the University of Sheffield LaSIE-II System
                                                                  as used for MUC-7'. In Proc. of the 7th Message Under-
Acknowledgments                                                   standing Conference, 1998 (www.muc.saic.com).
I developed (LP)2 and LearningPinocchio at ITC-Irst,             [Kushmerick et al. 1997] N. Kushmerick, D. Weld, and R.
Centro per la Ricerca Scientifica e Tecnologica, Trento,          Doorenbos, `Wrapper induction for information extrac-
Italy. LearningPinocchio is property of ITC-Irst, see             tion', Proc. of 15th International Conference on Artificial
http://ecate.itc.it:1025/cirave/LEARNING/home.html.               Intelligence, IJCAI-97, 1997.
The financial application mentioned above was jointly
developed with Alberto Lavelli. Thanks to Daniela Pet-           [Miller et al. 1998] S. Miller, M. Crystal, H. Fox, L. Ram-
relli for revising this paper. Errors, if any, are mine.          shaw, R. Schwartz, R. Stone and R. Weischedel, `BBN:
                                                                  Description of the SIFT system as used for MUC-7', In
References                                                        Proc. of the 7th Message Understanding Conference, 1998
                                                                  (www.muc.saic.com).
[Califf 1998] Mary E. Califf, Relational Learning Tech-
 niques for Natural Language IE, Ph.D. thesis, Univ. Texas,      [Muslea et al. 1998] I. Muslea, S. Minton, and C. Knoblock,
 Austin, www.cs.utexas.edu/users/mecaliff                         `Wrapper induction for semi-structured, web-based infor-
                                                                  mation sources', in Proc. of the Conference on Autonomous
[Cardie 1997] Claire Cardie, `Empirical methods in infor-         Learning and Discovery CONALD-98, 1998.
 mation extraction', AI Journal, 18(4), 65-79, 1997.
                                                                 [Soderland 1999] Steven Soderland, `Learning information
[Ciravegna et al. 2000] Fabio Ciravegna, Alberto Lavelli,         extraction rules for semi-structured and free text', Machine
 and Giorgio Satta, `Bringing information extraction out of       Learning, (1), 1-44, 1999.
 the labs: the Pinocchio Environment', in ECAI2000, Proc.
 of the 14th European Conference on Artificial Intelli-          [Yangarber et al. 2000] Roman Yangarber, Ralph Grish-
 gence, ed., W. Horn, Amsterdam, 2000. IOS Press.                 man, Pasi Tapanainen and Silja Huttunen: ``Automatic
                                                                  Acquisition of Domain Knowledge for Information Ex-
[Ciravegna 2000a] Fabio Ciravegna, `Learning to Tag for           traction'' In Proc. of COLING 2000, 18th Intern. Confer-
 Information Extraction from Text' in F. Ciravegna, R.            ence on Computational Linguistics, Saarbrücken, 2000.
 Basili, R. Gaizauskas (eds.) ECAI Workshop on Machine