數(shù)據(jù)讀取

首先導入一些必要的Python庫：

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport sysimport nltk# nltk.download('stopwords')from nltk.corpus import stopwords# from bs4 import BeautifulSoup as Soupimport json

我的機器學習樣本數(shù)據(jù)：

以下Python代碼用于在字符串列表中加載和存儲數(shù)據(jù)，這部分完全取決于數(shù)據(jù)類型：

def parseLog(file): file = sys.argv[1] content = [] with open(file) as f: content = f.readlines() content = [json.loads(x.strip()) for x in content] # print(content)  data = json.loads(json.dumps(content)) k=0# preprocessing //////////////////////////////// content_list = [] for i in data: string_content = '' if 'contents' in i:	 for all in i['contents']:	 if 'content' in all:	 # print(str(all['content']))	 string_content = string_content   str(all['content'])	 content_list.append(string_content)

content_list包含字符串列表中的完整數(shù)據(jù)。因此，如果有45000篇文章，content_list有45000個字符串。

數(shù)據(jù)預處理

現(xiàn)在我們將使用pandas來應用一些機器學習中的預處理技術。首先，我們將嘗試盡可能地清理文本數(shù)據(jù)。想法是使用regex replace(' [^a-zA-Z#] '， ' ')一次性刪除標點、數(shù)字和特殊字符，它將替換除空格以外的所有字符。然后我們將刪除較短的單詞，因為它們通常不包含有用的信息。最后，我們將所有文本都小寫。

news_df = pd.DataFrame({'document':content_list}) # removing everything except alphabets` news_df['clean_doc'] = news_df['document'].str.replace('[^a-zA-Z#]', ' ') # removing null fields news_df = news_df[news_df['clean_doc'].notnull()] # removing short words news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3])) # make all text lowercase news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

現(xiàn)在我們將從數(shù)據(jù)中刪除stopwords。首先，我加載NLTK的英文停用詞列表。stopwords是“a”，“the”或“in”之類的詞，它們沒有表達重要意義。

 stop_words = stopwords.words('english') stop_words.extend(['span','class','spacing','href','html','http','title', 'stats', 'washingtonpost']) # tokenization tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split()) # remove stop-words tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words]) # print(tokenized_doc) # de-tokenization detokenized_doc = [] for i in range(len(tokenized_doc)): if i in tokenized_doc: t = ' '.join(tokenized_doc[i]) detokenized_doc.append(t) # print(detokenized_doc)

應用Tf-idf創(chuàng)建文檔術語矩陣

現(xiàn)在，我們準備好了機器學習數(shù)據(jù)。我們將使用tfidf vectoriser創(chuàng)建一個文檔項矩陣。我們將使用sklearn的TfidfVectorizer創(chuàng)建一個包含10,000項的矩陣。

from sklearn.feature_extraction.text import TfidfVectorizer # tfidf vectorizer of scikit learn vectorizer = TfidfVectorizer(stop_words=stop_words,max_features=10000, max_df = 0.5, use_idf = True, ngram_range=(1,3)) X = vectorizer.fit_transform(detokenized_doc) print(X.shape) # check shape of the document-term matrix terms = vectorizer.get_feature_names() # print(terms)

ngram_range：unigrams，bigrams和trigrams。

這個document-term矩陣將在LSA中使用，并應用k-means對文檔進行聚類。

使用k-means對文本文檔進行聚類

在這一步中，我們將使用k-means算法對文本文檔進行聚類。

 from sklearn.cluster import KMeans num_clusters = 10 km = KMeans(n_clusters=num_clusters) km.fit(X) clusters = km.labels_.tolist()

clusters將用于繪圖。clusters是一個包含數(shù)字1到10的列表，將每個文檔分為10個聚類。

主題建模

下一步是將每個項和文檔表示為向量。我們將使用文檔項矩陣并將其分解為多個矩陣。

我們將使用sklearn的randomized_svd執(zhí)行矩陣分解任務。您需要一些LSA和奇異值分解(SVD)的知識來理解下面的部分。

在SVD的定義中，原始矩陣A ≈ UΣV*,其中U和V具有正交列，并且Σ是非負對角線。

from sklearn.decomposition import TruncatedSVD from sklearn.utils.extmath import randomized_svd U, Sigma, VT = randomized_svd(X, n_components=10, n_iter=100, random_state=122) # SVD represent documents and terms in vectors # svd_model = TruncatedSVD(n_components=2, algorithm='randomized', n_iter=100, random_state=122) # svd_model.fit(X) # print(U.shape) for i, comp in enumerate(VT): terms_comp = zip(terms, comp) sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7] print('Concept ' str(i) ': ') for t in sorted_terms: print(t[0]) print(' ')

這里，U，sigma和VT是在分解矩陣之后獲得的3個矩陣X 。VT是一個term-concept矩陣，U是document-concept矩陣，Sigma是concept-concept矩陣。

在上面的代碼中，采取了10個concepts/topics （n_components=10）。然后我打印了那些concepts。示例concepts如下：

主題可視化

為了找出我們的主題有多么不同，我們應該想象它們。當然，我們無法想象超過3個維度，但有一些技術，如PCA和t-SNE，可以幫助我們將高維數(shù)據(jù)可視化為較低維度。這里我們將使用一種稱為UMAP（Uniform Manifold Approximation and Projection）的相對較新的技術。

為了發(fā)現(xiàn)我們的主題有多么不同，我們應該把它們圖形化。當然，我們可視化時不能超過3個維度，但是有一些技術，比如PCA和t-SNE，可以幫助我們將高維數(shù)據(jù)可視化到更低的維度。在這里，我們將使用一個相對較新的技術：UMAP。

 import umap X_topics=U*Sigma embedding = umap.UMAP(n_neighbors=100, min_dist=0.5, random_state=12).fit_transform(X_topics) plt.figure(figsize=(7,5)) plt.scatter(embedding[:, 0], embedding[:, 1],  c = clusters, s = 10, # size edgecolor='none' ) plt.show()if __name__ == '__main__': parseLog(sys.argv[1])

在這里，我使用了c = clusters，這將有助于在文檔中顯示不同的顏色。

在這里，我展示了2500篇新聞文章的輸出：

10,000條新聞文章：

這里，每個點代表一個文檔，顏色代表使用k-means找到的不同的聚類。我們的LSA模型似乎做得很好。您可以隨意修改UMAP的參數(shù)，以查看圖形如何更改其形狀。

本站僅提供存儲服務，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權內(nèi)容，請點擊舉報。