ABSTRAKSI: Kategorisasi teks (atau juga dikenal dengan klasifikasi teks) adalah suatu task yang mengurutkan kumpulan dokumen D kedalam kategori yang telah ditentukan secara otomatis ke dalam kategori C. Dalam kategorisasi teks salah satu prosesnya adalah teks preprocessing yang termasuk didalamnya meliputi feature selection dan term weighting.
Salah satu metode pembobotan yang dikenal adalah TF-IDF dimana dalam metode ini setiap term/kata dalam sebuah dokumen dihitung frekuensinya dalam sebuah dokumen (term frequency) yang kemudian hasilnya dikombinasikan dengan frekuensi kemunculan term pada suatu kumpulan dokumen (inverse document frequency). Term yang sering muncul pada dokumen tapi jarang muncul pada kumpulan dokumen memberikan nilai bobot yang tinggi. TF-IDF akan meningkat dengan jumlah kemunculan term pada sebuah dokumen dan berkurang dengan jumlah term yang muncul pada kumpulan dokumen.
Namun mengingat kategorisasi teks bersifat terawasi dimana menggunakan dataset yang dibagi menjadi dataset training dan dataset testing, maka diperlukan suatu metode yang memenuhi syarat diatas. Dalam konteks standar Information Retrieval, asumsi IDF cukup beralasan karena dapat menginterpretasikan term dengan baik karena term yang sering muncul dalam banyak dokumen adalah diskriminator yang tidak baik. Tapi ketika data training untuk query tersedia, cara yang lebih baik harus digunakan yang dapat membedakan term yang terdistribusi ke dalam kumpulan data training baik kategori positif maupun negative. Data training tidak tersedia dalam query di konsep standar IR, namun lebih sering tersedia untuk kategori dalam konteks TC, dimana gagasan “relevansi dengan query” digantikan dengan “keanggotaan dalam kategori”. Maka dari itu digunakanlah Category-based Function yang ada pada Term Evalution Function seperti Chi-square, Information Gain (IG), dan Gain Ratio (GR) sebagai pengganti fungsi IDF pada TF-IDF. Metode ini disebut Supervised Term Weighting. Dan metode inilah yang digunakan dalam Skripsi ini.
Pada metode ini dikombinasikan antara nilai term evaluation function dari setiap term yang terpilih dengan nilai term frequency di setiap dokumen kemudian dilakukan klasifikasi dengan metode SVM dan dievaluasi dengan parameter precison, recall, f-measure dan akurasinya. Hasil penelitian menunjukkan bahwa metode Supervised Term Weighting memberikan performansi yang lebih baik dibandingkan TF-IDF khususnya pada threshold local policy.
Kata Kunci : kategorisasi teks, term frequency, inverse document frequency, term weighting, supervised term weighting.ABSTRACT: Text categorization (or also known as text classification) is a task that sort a collection of documents D into a category that has been determined automatically Φ : D x C. In text categorization, one of the process are text preprocessing which include feature selection and term weighting.
One known method of term weighting is TF-IDF, in this method where each term / word in a document is calculated in their frequency in a document (term frequency), which the results combined with the occurrence frequency of terms in a document collection (inverse document frequency). Term that often appears on the document but rarely appear on the set of documents providing a high weight value. TF-IDF will increase with the number of occurrences of terms in a document and is reduced by the number of terms that appear on the document collection.
Since text categorization are supervised which include dataset was divided into training and testing datasets, we need a method that meets the requirement. In standard IR contexts this assumption is reasonable, since it encodes the quite plausible intuition that a term tk that occurs in too many documents is not a good discriminator, i.e when it occurs in a query q it is not sufficiently helpful in discriminating the documents relevant to q from the irrelevant. However, if training data for the query were available (i.e. documents whose relevance or irrelevance to q is known), an even stronger intuition should be brought to bear, i.e. the one according to which the best discriminators are the terms that are distributed most differently in the sets of positive and negative training examples. Training data is not available for queries in standard IR contexts, but is usually available for categories in TC contexts, where the notion of “relevance to a query” is replaced by the notion of “membership in a category”. In these contexts, category-based functions like on Term Evalution Function such as Chi-square, Information Gain (IG), dan Gain Ratio (GR) that score terms according to how differently they are distributed in the sets of positive and negative training examples, are thus better substitutes of idf-like functions.
In this method combined the term evaluation function value of each term is chosen with a value frequencynya term in each document and then performed SVM classification methods and evaluated with the parameters of precison, recall, f-measure and accuracy. The results showed that the supervised term weighting method gives better performance than TF-IDF, especially at the local threshold policy.
Keyword: text categorization, term frequency, inverse document frequency, term weighting, supervised term weighting.