python-wordcloud

文字探勘套件,以前學的都是Jieba,最近覺得BosonNLP滿好用的
bosonnlp詞性分析
http://docs.bosonnlp.com/tag.html
情感分析 nlp.sentiment(s)
命名实体识别 nlp.ner(s)
依存文法分析 nlp.depparser(s)
关键词提取 nlp.extract_keywords(s, top_k=10)
新闻分类 nlp.classify(s)
语义联想 nlp.suggest(term, top_k=10)
分词与词性标注 nlp.tag(s)
时间转换 nlp.convert_time(“今天晚上8点到明天下午3点”, datetime.datetime(2015, 9, 1))
新聞摘要 nlp.summary(title, content)

另外,想呈現文字雲的概念,先用了pytagcloud,無論是英文或中文都碰到同一個錯誤訊息,只好暫先放棄,改用wordcloud
‘cp950’ codec can’t decode byte 0xc3 in position 223: illegal multibyte sequence’

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from bosonnlp import BosonNLP
iti = group_data_pd['行程內容'].iloc[0]
split_word = ",|。|、|●|◎|◆|★|:|/|\(|\)|《|》|-|!|~|『|』|「|」|【|】|XXX"
iti_s = re.split(split_word, iti)
iti_ss = " ".join(iti_s)
bosonnlp_token = "MufqZvMn.17180.1TXDWsWO0_qk"
nlp = BosonNLP(bosonnlp_token)
nlp_tag = nlp.tag(iti_ss)
word_join = ' '.join(nlp_tag[0]['word'])
result = nlp.extract_keywords(iti_ss)
#文字雲用的遮罩圖片
from PIL import Image
import numpy as np
abel_mask = np.array(Image.open("lion_icon2.png"))
#使用wordcloud產生文字雲
import matplotlib.pyplot as plt
from wordcloud import WordCloud
my_wordcloud = WordCloud(background_color="white", mask=abel_mask).generate(' '.join(nlp_tag[0]['word']))
plt.figure(figsize=(15,10))
plt.imshow(my_wordcloud)
plt.axis("off")
plt.show()

一直有文字重覆的問題,後來看wordcloud.py出現下述文字,將collocations=False補上後即正常

The input “text” is expected to be a natural text. If you pass a sorted

list of words, words will appear in your output twice. To remove this

duplication, set collocations=False.