Introduction
It has long been known that using bag-of-words as a document representation for texts can achieve good results in text categorization. Attemps with more advanced approaches using proper nouns, complex nomials and word-senses has not resulted in any conclusive improvements.
In my thesis, I examine the potential benefits of using features taken from the output of syntactic and semantic parsers to enhance a bag-of-words representation. These new representations is then evaluated on reuters large corpus of journalistic articles with a SVM classifier against the bag-of-words baseline.
Results
This is a subset the results from the experiments i've made. More indepth analysis will be done in my thesis.
I have done my experiments on the RCV1-v2 corpus which is a corrected version of RCV1 (Reuters Corpus Volume 1).
Please note that this is preliminary data.
Semantic data / Bag-of-words
Comparison between macro average f1 optimized bag-of-words and semantic enhanced bag-of-words.
Comparison between micro average f1 optimized bag-of-words and semantic enhanced bag-of-words.
Micro average f1
bag-of-words: 81.72
bag-of-words+semantic: 81.92
Macro average f1
bag-of-words: 62.17
bag-of-words+semantic: 62.38