Hercules Dalianis
NADA-KTH,
SE-100 44
Stockholm, Sweden
ph: +46 8 790 91 05
mob ph +46 70 568 13 59
fax: +46
8 10 24 77
email: hercules@nada.kth.se
August 2000
This paper describes the State-of-the-art technique in the area of automatic text summarization and how this technique is applied on the first text summarizer for Swedish-SweSum. SweSum is build on both statistical and linguistic methods as well as heuristic methods. SweSum uses a 700.000 word entries dictionary which tells if the word belongs to the open word class group and also gives the stem of the word. SweSum has been evaluated and its performance is estimated to be as good as the State of the art techniques, for English, i.e. an average of 30% summary, compression, of 2-3 pages news text gives a good summary.
Textsummarization or automatic text summarization is the
technique which automatically creates an abstract or summary of a text. The
technique has been developed for many years (Luhn 1959, Edmundson 1969 and
Salton 1989) but recent years with the increased use of the Internet there have
been an awakening interest for the summarization techniques.
According to
Hovy and Lin (1997) there are two ways to view text summarization either as text
extraction or as text abstraction. Text extraction means to extract pieces of an
original text on a statistical basis or with heuristic methods and put together
it to a new shorter text with the same information content. Text abstration is
to parse the original text in a linguistic way, interpret the text and find new
concepts to describe the text and then generate a new shorter text with the same
information content.
The parsing and interpretation of a text is an old research area which has been investigated for many years. In this area we have a spectrum of techniques and methods ranging from word by word parsing to rhetorical discourse parsing as well as more statistical methods or a mixture of all.
In (Dalianis & Hovy, 1996) a method called lexical aggregation is explained where a two concepts can be replaced by another concept in a text to make the text shorter and sometimes less redundant, for example, the concepts selling and buying became business. Yet another method called syntactic aggregation is described in (Dalianis & Hovy, 1996) where coordination is performed to make a text less redundant, for example, Mary walks and John walks becames Mary and John walk. Lexical aggregation is related to the so called concept fusion described in (Hovy & Lin, 1997).
As previously explained there are three steps to perform text summarization.
First to understand the topic of a text, so called topic indentification, according to (Hovy & Lin, 1997; Lin & Hovy, 1997), secondly the interpretation of the text and finally the generation of the text. The generation of text can as previously mentioned be carried out in two different ways, namely: Extraction and Abstraction. The abstraction generation must make use of a natural language generator as for example (Dalianis, 1999).
Topic identification is also used in information retrieval when one wants to find keywords for categorizing the text in for example a library.
There are many methods to perform topic identification, see (Lin & Hovy, 1997). Word counting at concept level which is more advanced than just simple word counting, Identification of cue phrases to find the the topic.
Another method to identify topics is to perform rhetorical parsing and hence build RST tree where the nuclei is identified as the topic (Marcu, 1997). Multitext text summarization has been investigated in (McKeown & Radev, 1995).
Lin (1999) describes a set of summarization methods and algorithms based on extraction:
Baseline: Sentence order in text gives the importance of the sentences. First sentence highest ranking last sentence lowest ranking.
Title: Words in title and in following sentences gives high score.
Term frequency (tf): Open class terms which are frequent in the text are more important than the less frequent. Open class terms are words that change over time.
Position score: The assumption is that certain genres put important sentences in fixed positions. For example. Newspaper articles has most important terms in the 4 first paragraphs.
Query signature: The query of the user affect the summary in the way that the extract will contain these words.
Sentence length: The sentence length implies which sentence is the most important.
Average lexical connectivity: Number terms shared with other sentences. The assumption is that a sentence that share more terms with other sentences is more important.
Numerical data: Sentences containing numerical data obtain boolean value 1 (is scored higher ) than the ones without numerical values.
Propername: Dito for propernames in sentences.
Pronoun and Adjective: Dito for pronouns and adjectives in sentences. Pronouns reflecting coreference connectivity.
Weekdays and Months: Dito for Weekdays and Months:
Quotation: Sentences containing quotations might be important for certain questions from user.
First sentence: First sentence of each paragraphs are the most important sentences.
Decision tree combination function: All the above parameters were put into decision tree and trained on set of texts and manual summarized texts.
Simple combination function: All the above parameter were normalized and put in a combination function with no special weighting.
The domain of SweSum is Swedish HTML tagged newspaper text.
SweSum ignores HTML tags which controls the format of the page but processes the
HTML tags which controls the format of text. The summarizer is written in Perl
(Wall et al., 1996) which is an excellent string processing language.
The
idea is that high scoring sentences in the original text are kept in the
summary, the scores are calculated accoording to the criteria
below.
Since the processed text is newspaper text, we have the genre
newspaper text and use therefore so called Position score: the sentences in the
beginning of the text are given higher scores than the ones at the end. The
formula is, 1/n, where n is the line number, so called Baseline.
HTML
tags which indicate sentences with bold text are given a higher score than the
ones without bold text tagging, dito title tagging. Bold text also indicates the
beginning of a new paragraph in some of the Swedish news paper
texts.
Sentences containing numerical data are given a higher score than
the ones without numerical values.
Sentences which contains keywords are
scored high so called Term frequency (tf).
To find the keywords one needs to
use a dictionary of all open word classes, that means the meaning
carrying-words. Since Swedish is an inflecting language it is very important
that one demorphs each word, i.e. finds the stem of each word. Both the stemming
and the open word class finding is carried out with a Swedish roottable
(morphological lexicon) containing 700.000 words or entries (Carlberger &
Kann 1999). All the above parameter are normalized and put in a naïve
combination function with no special weighting to obtain
the total score of each sentence.
The user of
SweSum can also enter his/her own keywords to the system. The user will then
obtain a more user centered summarization. The length of the
summary/compressionrate is of course selected by the user. Finally we use a
naïve combination function: All the above parameter are normalized and put in a
combination function with no special weighting, this is acoording to
(Lin, 1999).
SweSum was used in a field test within the framework of 2D1418 Språkteknologi
(Human Language Technology), a 4-credit course at NADA/KTH, Stockholm. The
students were given the task of automatically summarising 10 texts of news
articles and movie reviews. The purpose was to see how much a text can be
summarised without loosing coherence or important information.
The nine
students carried out the test by first reading the text to be summarised and
then gradually lowering the size of the summary giving SweSum the amount of the
original text they would like in the summary, noting in a questionnaire when
coherence was broken and when important information was missing. This procedure
was repeated for each of the 10 texts.
As can be seen in the Appendix,
not all of the students completed the whole questionnaire leaving the field test
inconclusive. Despite this one can conclude that most of the time the students
have come to fairly the same conclusions. There are naturally exceptions, which
only serve to exemplify the subjective nature of the test.
There are no corpora with manual extracts available for Swedish as for English. Therefore it is difficult to make proper evaluation of automatic summary in Swedish.We are planning to create such corpora using the technique proposed by Marcu (1999). Since we had very few participants in our field test we decided to use median as a statistical measurement of our results. We first calculated the total amount of summarised text (given in percent).
Information Coherence Total median: 30% 24% Total average: 31% 26%
Table 1. Results from the field test
From the field test we can conclude that the state-of-the-art Swedish text summariser SweSum is as good as the English state-of-the-art text summarisers. According to C-Y Lin (1999) around 30% summarisation gives an ideal summarised text for English. (Lin, 1999) says also that one estimates a 70-80% accurancy on a 30% summary considering a 2-3 page news article following F-score or tf-idf. Compare this to the summarization algorithm for MS Word 97 which gives the best summarization at around 35% summary of an text, Lin (1999).
SweSum is available for testing at (Dalianis, 2000). There is also an english version of the summarizer, where the Swedish roottable is replaced by an English one, the textsummarization engine is identical.
Future extensions of the SweSum can be the
following:
We are currently working on pronoun resolution to make the
summarized text more coherent. Incoherencies happens specifically when the
summaries are below 30% of the original texts and e.g.when a pronoun reference
hangs free with no reference in the text. Pronoun resolution will resolve the
pronouns in the text and replace them with the original noun when necessary.
Some early results are described in (Hassel & Dalianis 2000).
One
possibility when doing topic identification or keyword extraction is to use a
synonym term by using some sort of Swedish Ontology similar to Wordnet, namely
Swedish Wordnet - Swordnet currently devoloped at the department of Linguistics
at Lund University. Compare selling and buying became business
above.
Before performing the summarization, analyze the text
to be summarized by using RIX Readibility Index or letting the user select
different profiles for the summarization depending on the type of text -
Newspaper text, Academic Articles, Business Reports, Technical Reports, Social
Science Report etc. Studies on how to recognize text genres by using an advanced
RIX has been carried out by Karlgren & Cutting (1994) and Karlgren
(2000).
Regarding the text summarization in other languages it easy in to replace the Swedish roottable with the roottable of other languages but use the same text summarization engine and hence accomplish automatic text summarization in other languages.
I would like to thank Johan Carlberger at Nada/KTH in acquiring the Swedish morphological lexicon for SweSum. I would also like to thank Martin Hassel at DSV-Stockholm University/KTH for interpreting the results of the evaluation of SweSum and finally I would like to thank the students of the course 2D1418 for their willingness to participate in our field studies.
J. Carlberger and V. Kann. 1999. Implementing an efficient
part-of-speech tagger. Software Practice and Experience, 29, 815-832.
H.
Dalianis. 2000. SweSum - A Swedish
Textsummarizer.
http://www.nada.kth.se/~hercules/Textsumsummary.html (this
report) and the summarizer
http://www.nada.kth.se/~xmartin/swesum/index.html and http://www.nada.kth.se/~hercules/textsum/textsum8.html
H. Dalianis. 1999. ASTROGEN - Aggregated deep and Surface
naTuRal language GENerator http://www.dsv.su.se/~hercules/ASTROGEN/ASTROGEN.html
H. Dalianis and E. Hovy. 1996 Aggregation in Natural Language Generation. In Adorni, G. & Zock, M. (Eds.), Trends in Natural Language Generation: an Artificial Intelligence Perspective, EWNLG'93, Fourth European Workshop, Lecture Notes in Artificial Intelligence, No. 1036, pp. 88-105, Springer Verlag
H. Dalianis & E. Hovy. 1997. On Lexical Aggregation and
Ordering. In Proceedings of the 6th European Workshop on Natural Language
Generation, pp. 17-27, March 24 - 26, 1997, Gerhard-Mercator University,
Duisburg, Germany.
H.P. Edmundson. 1969. New Methods in Automatic
Extraction Journal of the ACM 16(2) pp 264-285.
M. Hassel and H. Dalianis. 2000. Pronominal Resolution in Text Summarisation, submitted to the ACL'2000 Workshop on Recent Advances in Natural Language Processing and Information Retrieval, October 7-8, 2000, Hong-Kong.
E. Hovy and C-Y Lin. 1997. Automated Text Summarization in SUMMARIST. in Proceedings of the Workshop of Intelligent Scalable Text Summarization, July.
J.Karlgren. 2000. Stylistic Experiments for Information Retrieval. Ph.D. Thesis, (Filosofie Doktorsavhandling), Department of Linguistics, Stockholm University, 2000.
J. Karlgren and D. Cutting. 1994. Recognizing Text Genres
with Simple Metrics Using Discriminant Analysis, Proceedings of COLING 94,
Kyoto, Japan.
C-Y Lin and E. Hovy. 1997. Identify Topics by Position,
Proceedings of the 5th Conference on Applied Natural Language Processing,
March.
C-Y Lin. 1999 Training a Selection Function for Extraction,
submitted to SIGIR 99.
H.P. Luhn.1959. The Automatic Creation of
Literature Abstracts. IBM Journal of Research and Development pp
159-165.
D.Marcu. 1999. The construction of large-scale corpora for summarization research, In the Proceedings of the International Conference on Research and Development in Information Retrieval SIGIR-99, pp. 137-144.
D. Marcu. 1997. From Discourse Structures to Text Summaries. The Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization, pp 82-88, Madrid, Spain,.July.
K. McKeown and D. Radev. 1995. Generating summaries of
multiple news articles. In Proceedings, 18th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, pages 74-82,
Seattle, Washington, July.
G. Salton, 1989. Automatic Text Processing:
The Transformation Analysis and Retrieval of Information by Computer. Addison
Wesley Publishing Company.
L. Wall,
T. Christensen, and R.L. Schwartz. 1996. Programming Perl. O'Reilly &
Associates Inc.
In the tables below, P1ÖP9 on the horizontal scale stands for person 1Ö9 in the test group. The numbers on the vertical scale each represent a text in the field test. The amounts given in percentage represent how much of the original text was extracted to the summary.
Important information is missing at: | |||||||||
Text\Person |
P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 |
1 |
37% |
25% |
40% |
15% |
35% |
20% |
30% |
35% |
49% |
2 |
40% |
25% |
55% |
45% |
45% |
20% |
40% |
25% |
53% |
3 |
30% |
30% |
45% |
25% |
50% |
25% |
20% |
30% |
55% |
4 |
20% |
15% |
20% |
15% |
35% |
35% |
50% |
25% |
29% |
5 |
60% |
20% |
10% |
15% |
20% |
15% |
20% |
45% |
15% |
6 |
37% |
30% |
99% |
35% |
40% |
40% |
30% |
25% |
|
7 |
44% |
25% |
30% |
30% |
25% |
50% |
50% |
41% | |
8 |
30% |
30% |
30% |
25% |
35% |
40% |
|||
9 |
20% |
15% |
20% |
25% |
15% |
30% |
19% | ||
10 |
15% |
10% |
10% |
35% |
10% |
20% |
Coherence is broken at: | |||||||||
Text\Person |
P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 |
1 |
25% |
20% |
35% |
30% |
35% |
25% |
20% |
20% |
|
2 |
20% |
30% |
55% |
30% |
60% |
20% |
15% |
53% | |
3 |
27% |
25% |
25% |
25% |
70% |
25% |
15% |
25% |
55% |
4 |
15% |
15% |
15% |
10% |
5% |
||||
5 |
30% |
20% |
10% |
20% |
20% |
15% |
25% |
||
6 |
30% |
10% |
30% |
35% |
35% |
10% |
15% |
15% |
|
7 |
20% |
10% |
25% |
30% |
40% |
30% |
35% |
41% | |
8 |
20% |
60% |
25% |
35% |
35% |
20% |
|||
9 |
5% |
20% |
40% |
10% |
10% |
44% | |||
10 |
5% |
10% |
35% |
10% |
10% |