[HOME]
The Agentive Knowledge-based
and Information Retrieval
Architecture
(AKIRA) project wishes to combine together
various techniques from database technology and information processing
to provide the user with a powerful tool to query the Web.
[AKIRA's story] [AKIRA's
benefits] [AKIRA's features]
[Architecture] [Future
work]
AKIRA's story
Three Characters
Our user wants to query the Web in a flexible
and transparent way. He thus expects the system to provide information
retrieval, information extraction and data manipulation. The Web,
the most heterogeneous network one may think of, consists of highly
structured sources (databases), providing a query language (wrapper),
as well as poorly structured sources (HTML pages). The system AKIRA
is user-oriented, agent-based and Web-aware.
The plot
Our user wants information about upcoming conferences such as query
Q1:``Conferences
with a submission deadline after July 31, 1998?'' or query Q2:
``Conferences located in the USA?''
A sad story...
With usual tools, our user will browse ``by hand'' all pages (or send the
query
``Call for papers'' to a search engine such as Altavista in
order to have a list of thousands of potentially relevant webpages about
conferences), read them and extract, still ``by hand'', the relevant information
in the content.
... with a happy ending
In AKIRA, the target structure consisting of the classes and attributes
invoked in the query is extracted and sent to the View Factory. The target
structure is first analyzed to determine a suitable schema to represent
information in the smart-cache. The system is user-oriented in the sense
that the cache is structured with respect to the given target structure,
and optionally completed by several classes in order to call the right
Information Extraction (IE) tools to populate the database. The View
Factory sends three queries: (1) a view definition creates the
new classes and attributes, (2) an Information Retrieval (IR) query is
sent to the IR component in order to retrieve data from the Web, and (3)
a query is sent to the database in order to populate it.
[TOP]
AKIRA's benefits
A unified framework for IR, IE and data-manipulation
AKIRA is an attempt to offer a unified interface for information retrieval
(IR), information extraction (IE) and data manipulation.
Most Web services today only provide information retrieval. The
user has to express a boolean expression of keywords which have to be explicitly
mentioned in the document he interested in. The output of a query to a
search engine is a list of URLs, corresponding to documents satifying the
criteria. The user has to load and read returned documents to extract information.
Further data manipulation also has to be done by hand.
In AKIRA, the user expresses a query (today a OQL-like query, tomorrow
a NL query). The evaluation of the query consists in retrieving relevant
documents from the Web and extracting and storing information in an object-oriented
database (smart-cache) which substitutes
for the usual unstructured cache. The tasks of information processing (IR
and IE) and data manipulation are thus automatically performed in a transparent
way for the user.
A user-oriented approach
The organisation of extracted data should not be seen as a constraint the
system imposes on the user. AKIRA's smart-cache must be structured according
to the user's understanding. The AKIRA system does not provide a rigid
global schema, but small schema components
representing its extraction capabilities and that can be combined together
to match the structure expressed by the user.
When the user sends a query, the system first computes the target
structure (i.e. the schema that is expressed by the user in his query)
and uses it as a schema for the cache. Of course, there is no magic, and
the system is able to answer a query when the structure expressed by the
user matches its extraction capabilities. However, the system allows flexibility
with a fuzzy-matching module in charge of mapping the user's vocabulary
to its own.
Possible future extensions
A Natural Language (NL) interface allowing
the user to express himself with an approximate knowledge of the available
schema
components, and thus express fuzzy queries that will be translated
into OQL query with general path expressions.
The query evaluation can be optimized with
query rewriting. First, the last step of AKIRA query evaluation can be
optimized at the level of the database system: when the cache is populated.
The latter corresponds to usual query optimization based on query rewriting.
But AKIRA offers several other interesting ways that should be investigated
for optimization.
Tomorrow the system should be able to mimic human
browsing when processing documents. To perform such a task, IE tools
should be first extended to HTML (and upcomming XML) syntax, then to the
hyperstructure of Web documents (as opposed to the linear structure of
usual documents). Browsing is a non-deterministic process that can
be seen as a loop from (1) type a URL (or click) to (2) analyse the retrieved
document, and back to (1). Heuristics that mimic human being should take
advantage of
explicit structure such as the tags or the structure
of URLs, as well as implicit structure extracted from raw data.
The conceptual representation of extracted information is format
and medium independant. AKIRA's architecture makes it easy to plug
in tools that process other formats or media and extract information from
documents.
Quality of service could be improved
with providing information about recall and precision as
a measure of confidence together with the extracted information.
Lastly, the user should be able to define the output
format of the result. AKIRA could be combined with a HTML (or XML)
restructuring tool.
[TOP]
AKIRA's features
Mutliplatform
The AKIRA system has been fully written in Perl and Java and can run on
any platform supporting these languages.
Small foot-print
AKIRA's cache has been implemented using PSE embedded database from ObjectStore
and requires very little memory. The detail is as follows:
OQL parser ~ 30kb
AKIRA ~ 50kb
PSE ~ 400kb
Regular Exp ~ 100kb
JGL ~ 560kb (*) we are only using a limited set of classes
------------------------
~ 1200kb
Modular and extensible
AKIRA is modular. Services can be directly plugged-in into the system.
Since each component inside the system follows the same interface,
the system knows how to build complex classes out of their sub-components.
AKIRA provides a template to write new services. Most services
are composed of a Perl script (used for extraction) and its corresponding
Java wrapper (used for object construction and database storage). The system
offers some generic tools for extraction as well as object construction.
As an illustration, here are some pieces of the Java and Perl
programs for the component "Date of Submission".
Java program
class SubmissionDate extends DateComponent { public SubmissionDate(String
confName) throws Exception { try { // create the
system call String[] cmd = buildScript(confName); //
launch the call and process the result process(confName, getProcStream(cmd),
getCorefTable()); } catch(Exception e) { this.date = new BuggyDate(); }
} private void process(String conf, InputStream is, CoreferenceTable corefTable)
throws Exception { BufferedReader br = new BufferedReader(new InputStreamReader(is));
this.date = parseDate(br.readLine()); // date value br.readLine(); // empty
line String str = br.readLine(); while (str!=null) { Span span = extractSpan(str);
// doc start end corefTable.addSpan(conf, span.doc,
span); str = br.readLine(); // date br.readLine(); // empty line str =
br.readLine(); } return; } }
Perl program
#!/usr/bin/perl sub initFilter { print "Building filter...\n" if ($DEBUG);
return sub { my $line = shift; my $context = " (subm .*? | paper .*? (due|rece)
.*?) "; my $pattern1 = $context . $Date::dateFilter; my $pattern2 = $Date::dateFilter
. $context; return ( ( ($line =~ /$pattern1/oxi) || ($line =~ /$pattern2/oxi)
) && ($line !~ /camera|abstract/oxi)) }; }
[TOP]
Future Work
Future work will probably include the following topics:
-
cross-optimization between IR, IE and database query execution
-
improving system performance
-
moving from CGI to Corba/ILU
-
looking at mobile agent technology
-
applying AKIRA to a different area than "Call for Papers" to check its
genericity
-
looking at visual interfaces to help the user formulate her query
[TOP]





Copyright AKIRA Project1998, University
of Pennsylvania
Department of Computer
and Information Science, Institute
for Research in Cognitive Science