Create relevancy from disparate sources of
information. Integrate the Extractor Engine into your
software applications enabling the summarization of
documents into lists of keywords and key phrases with
contextual links back to the originating
document(s).
What is text summarization? By definition text
summarization is: To comprise in, or reduce to, a
summary; to present briefly; quickly executed.
In terms of computer automated text summarization
there are many definitions and implementations including
Bayesian, Heurstics and linguistics. Extractor uses a
Genetic approach which in itself provides a learning
process. This is important for the summarization utility
to move from one domain to another, versus other
approaches which are traditionally domain specific and
thereby require greater human intervention to adjust
from one domain to another.
The Extractor API's have been designed for maximum
flexibility allowing a wide variety of applications to
take advantage of this unparalleled technology...
supported development languages include:
- C (C, C++, VC++)
- Java
- Visual Basic
- Python
- Perl
There are 26 primary API function calls that provide
the development team with full control of the Extractor
DLL and presentation of the extracted results.
Extractor supports Windows, Solaris and Linux
computing platforms. Other computing platforms such as
HP/UX, AIX or the Mac O/S can be custom compiled. (Upon
confirmation of computing platform and engagement of the
custom compilation, the process can take from one to two
weeks for final testing and release.)
Multiple Threads with the Extractor API ... The API
for Extractor allows several documents to be processed
simultaneously, using separate threads for each
document. This is useful, for example, when processing
web pages. A major bottle-neck when downloading web
pages is waiting for web servers to respond to requests
for pages. One way around this bottle-neck is to
download several pages simultaneously, using a separate
thread to process each page.
Extractor is fully reentrant, to allow multithreading
without the use of Win32 services such as semaphores and
the EnterCriticalSection and LeaveCriticalSection
functions. There should be a one-to-one relationship
between threads and DocumentMemory values, so only one
thread reads or writes to a given DocumentMemory. On the
other hand, there may be a many-to-one relationship
between threads and StopMemory values. That is, many
threads may simultaneously read one StopMemory.
Most functions that take StopMemory as an argument
only read StopMemory; they do not write. This is why
many threads can safely access the same StopMemory.
However, the functions ExtrAddStopWord and
ExtrAddStopPhrase write StopMemory. These two functions
should be called (one after the other; not at the same
time) before any other threads access StopMemory. If one
thread calls ExtrAddStopWord or ExtrAddStopPhrase with a
given value of StopMemory while a second thread calls
any function with the same value of StopMemory, the
memory may become corrupted.
Applications of Text Summarization concepts: Text
summarization is used in many applications. Most notably
text summarization is used for:
- Content review - defining document suitability.
- Pre-Sort document summaries for Cataloging.
- Creating document Indexes.
- Providing interactive query refinement.
- Defining document trends - performing document
trend analysis.
- Assisting in web page content analysis.
Determining web page content accuracy
- Enhancing Document Management systems.
Version History: The Extractor technology started as
a machine learning and artificial intelligence research
project at the National Research Council of Canada in
the mid 1990's. In January of 1997 the initial result of
that R&D effort was the release of the first version
of Extractor. To this day research and development is
ongoing through the exceptional efforts of Dr. Peter
Turney at the Interactive Information Technology Group
at the National Research Council of Canada and DBI
Technologies Inc. For full product version history
please see Extractor7History.htm.
Credits: Extractor is provided under a world wide
distribution license to DBI Technologies Inc. from the
National Research Council of Canada. Extractor is a
patented technology held by the National Research
Council of Canada. All copy rights and intellectual
property are under the sole ownership of the National
Research Council of
Canada. |