![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Overview:
the user |
![]() |
||
[Back to Architecture] [What AKIRA can understand] [How to express a query]
What AKIRA can understandconcept class Conference
Most of IE tools are domain and format dependent. AKIRA's conference
extractor is formatted to extract conference names from a source consisting
of Calls for Papers. The first step consists
in recognizing in a document the strings of characters corresponding
to conference names (and tag them in the document). The second step is
a co-reference step. Different strings of characters recognized
to be conference names, may refer to the same conference. For instance
the strings
Int. Conf. on Very Large DataBases and VLDB refer
to the same conference. The identification step follows. To each
set of co-referent strings correspond an instance of class Conference.
A canonical representation is chosen for the name of conferences.
An instance of concept class Conference is
described with its canonical representation which consists of the
acronym of the conference followed by the corresponding year. For instance,
"SIGMOD'98", "SIGMOD'99", etc. Moreover, each document (call for papers)
is associated to a conference (or split into several documents when it
is the call for papers of several conferences). The mapping from documents
to instances of class Conference is used by
other extractors with the assumption that all information extracted from
the document refer to the instance of Conference
it refers to.
The canonical representation is indeed a key for the class.
In the conceptual representation of extraction capabilities, the concept
class
Conference is defined with at least
two attributes:
fragment, which associates to each instance the
set of all its
fragments (the pairs [file,span]
of all its occurrences in the documents), and attribute name, its
key.
concept class Date
The date extractor follows the same successive steps (recognition,
co-reference and identification), but is not restricted to a source of
Calls for Papers (it extract dates from any
English textual document). The canonical representation of dates consists
of three attributes:
month, day and year, which constitute
the key of the class.
concept class Location
Like the latter, the location extractor is not restricted to a given
source, but is based on a recognizer (many thanks to Breck Baldwin who
let us use his recognizer) which uses list of all cities, states and countries
in the world. The chosen canonical representation is composed of city,
state and
country, which constitute the key of the class.
In our conceptual representation, a concept class may be specialized by either valued attributes or abstract attributes (meta-concepts). Today, AKIRA provides extractors for the following information about conferences:
How to express queriesselect x.name from x in Conference where x.location.city="Paris"However, within the limits of the system's knowledge, some fuzziness is allowed. The user can express his query using her own vocabulary. For instance the following expression is equivalent to the previous (and thus will have the same output).
select x.name from x in Conf where x.location="Paris"AKIRA uses a primitive thesaurus and some regular expressions to map class attributes from the user's query to attributes understood at the level of components. As an illustration, the attribute submission_deadline can be expressed by the system from any string matching the regular pattern:
/^(conf(.*?))*(date(.*?))*sub(.*?)(date|deadline)*$/iSome more advanced strategies are used for more sophisticated metaconcepts. When necessary, tools based on regular pattern as well as their syntactic description, including their part of speech such as SuperTags can be used.
If the user expresses himself through an expression that does not match the pattern, the system will not be able to understand the query.