2017-06-25

python-textmining

利用re.split進行多重組合條件切割文字
split_word = "，|。|、|●|◎|◆|★|：|／|\(|\)|《|》|－|！|～|『|』|「|」|【|】|XXX"
words = re.split(split_word, iti)

N-gram文字切割處理

def ngram(sentence, n = 2, skip_word= skip_dic):
    word_dic = {}
    for i in range(0, len(sentence) - n + 1):
        invalid = len([word for word in sentence[i:i+n] if word in skip_word])
        if sentence[i:i+n] not in word_dic and invalid ==0:
            word_dic[sentence[i:i+n]] = 1
        elif invalid ==0:
            word_dic[sentence[i:i+n]] = word_dic[sentence[i:i+n]] + 1
    return word_dic

Jieba

jwords = jieba.cut(' '.join(words), cut_all=False)
使用jieba.cut文字出來後，資料型態為Generator () 非List []，如果要直接成list可以改用jieba.lcut

如果清單元素可以按照某種演算法推算出來，那我們是否可以在迴圈的過程中不斷推算出後續的元素呢？這樣就不必創建完整的list，從而節省大量的空間。在Python中，這種一邊迴圈一邊計算的機制，稱為生成器（Generator）。

統計詞的數量

dic = {}
jwords = jieba.cut(sentence)
for word in jwords:
    if word not in dic:
        dic[word] = 1
    else:
        dic[word] += 1
sorted(dic.items(), key=lambda x:x[1], reverse=True) #依values由大至小排序，並同時呈現key,value

CountVectorizer

參考http://www.gegugu.com/2017/04/11/24869.html

進行特征提取

構建文檔-詞矩陣（Document-Term Matrix）

(X,Y)–>X每一段話, Y斷詞詞句
ex.
corpus為[“下車參觀”,”上車參觀”]，經過CountVectorizer處理，X為”下車參觀”，Y分別為 [下車,上車,參觀]，所以＂下車參觀＂的詞矩陣為1,0,1

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() 
X = vectorizer.fit_transform(corpus)
print ("結構","->",X.shape) #num_samples, num_features
word = vectorizer.get_feature_names()  #產出特征詞
print ("***",word)
X.toarray() #詞矩陣
vectorizer.vocabulary_.get("參觀") # 檢視某個詞在詞表中的位置

構建文檔的 TF-IDF 特征向量

from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer().fit_transform(X)
weight = tfidf.toarray()    
print (weight)

格拉姆矩陣（Gramian matrix 或 Gram matrix, Gramian）

sklearn.metrics.pairwise.linear_kernel(X,y)
X(某個詞句)的文字與y(所有詞句)的個別關係

1 2	from sklearn.metrics.pairwise import linear_kernel print (linear_kernel(tfidf[2], tfidf).flatten()) #cosine_similarities

tf-idf（英語：term frequency–inverse document frequency）是一種用於資訊檢索與文字挖掘的常用加權技術。
詞頻（term frequency，tf）指的是某一個給定的詞語在該檔案中出現的頻率。
逆向檔案頻率（inverse document frequency，idf）是一個詞語普遍重要性的度量。某一特定詞語的idf，可以由總檔案數目除以包含該詞語之檔案的數目，再將得到的商取對數得到
字詞的重要性隨著它在檔案中出現的次數成正比增加，但同時會隨著它在語料庫中出現的頻率成反比下降。

tf為詞在特定檔案中出現的頻率,idf為詞在所有檔案中,有幾個檔案有出現。如果詞在所有檔案出現頻率高，代表不具重要性，故即使tf高亦無用。
tf-idf權重計算方法經常會和餘弦相似性（cosine similarity）一同使用於向量空間模型中，用以判斷兩份檔案之間的相似性。

tf-idf的理論依據及不足
在本質上idf是一種試圖抑制雜訊的加權，並且單純地認為文字頻率小的單詞就越重要，文字頻率大的單詞就越無用，顯然這並不是完全正確的。idf的簡單結構並不能有效地反映單詞的重要程度和特徵詞的分布情況，使其無法很好地完成對權值調整的功能，所以tf-idf法的精度並不是很高。
此外，在tf-idf演算法中並沒有體現出單詞的位置資訊，對於Web文件而言，權重的計算方法應該體現出HTML的結構特徵。特徵詞在不同的標記符中對文章內容的反映程度不同，其權重的計算方法也應不同。因此應該對於處於網頁不同位置的特徵詞分別賦予不同的係數，然後乘以特徵詞的詞頻，以提高文字表示的效果。

參考文字探勘之前處理與TF-IDF介紹
Text Mining 的前處理程序如下：
1. Part-of-Speech Tagging
2. Stemming
3. Feature Selection

import scipy as sp 
a, abb, abc = ["a"], ["a", "b", "b"], ["a", "b", "c"]
D = [a, abb, abc]
tf = float(abb.count('a')) / sum(abb.count(w) for w in set(abb)) # 1/3 = 0.333
idf =  sp.log( len(D) / len([doc for doc in D if 'b' in doc])) # 3/2 = 1.5, log(1.5)=0.405465108108 #Natural Log (nlog),以e為底數,非一般以10為底數的log #亦可用numpy.log
print (tf * idf)

自然对数（英语：Natural logarithm）是以e為底數的对数函数，標記作ln(x)或loge(x)，其反函数是指數函數ex。

依據文章內容將文章進行分類

from xml.dom import minidom
from xml.etree import ElementTree
import jieba.analyse
#讀入檔案資料
f = open('E:\\Data\\Documents\\Python\\pytextmining-master\\1435449602.xml', 'r', encoding='utf8')
events=ElementTree.fromstring(f.read())
f.close()
#使用jieba.analyse.extract_tags截取文章內容斷詞
ary = []
corpus=[]
for elem in events.findall('./channel/item'):
    title = elem.find('title').text
    description = elem.find('description').text
    source = elem.find('source').text
    ary.append(title)
    corpus.append(' '.join(jieba.analyse.extract_tags(description, 10, allowPOS = ['n', 'nr', 'ns'])))
#透過sklearn中CountVectorizer取得關鍵字    
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() 
X = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()
#計算關鍵字的tfidf權重
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
weight = tfidf.toarray()
#透過sklearn中cluster進行文章分群
from sklearn import cluster
c = cluster.KMeans(n_clusters=5)
k_data = c.fit_predict(weight)
for index, g in  enumerate(k_data):
    if g == 4:
        print (ary[index])