Introduction

It has long been known that using bag-of-words as a document representation for texts can achieve good results in text categorization. Attemps with more advanced approaches using proper nouns, complex nomials and word-senses has not resulted in any conclusive improvements.

In my thesis, I examine the potential benefits of using features taken from the output of syntactic and semantic parsers to enhance a bag-of-words representation. These new representations is then evaluated on reuters large corpus of journalistic articles with a SVM classifier against the bag-of-words baseline.

Results

This is a subset the results from the experiments i've made. More indepth analysis will be done in my thesis.

I have done my experiments on the RCV1-v2 corpus which is a corrected version of RCV1 (Reuters Corpus Volume 1).

Please note that this is preliminary data.

Semantic data / Bag-of-words

Comparison between macro average f1 optimized bag-of-words and semantic enhanced bag-of-words.

Comparison between micro average f1 optimized bag-of-words and semantic enhanced bag-of-words.

Micro average f1

bag-of-words: 81.72
bag-of-words+semantic: 81.92

Macro average f1

bag-of-words: 62.17
bag-of-words+semantic: 62.38

News

No news yet.