刘诗雯2019乒乓球比赛视频,范冰冰戛纳走红毯视频

在 IBM Bluemix 云平臺上開發(fā)并部署您的下一個應(yīng)用。

內(nèi)容分類 是任何一種這樣的進(jìn)程，即增強數(shù)據(jù)，以便采用讓數(shù)據(jù)更容易搜索、歸檔、管理和集成到其他進(jìn)程中的方式組織數(shù)據(jù)。產(chǎn)生這樣的元數(shù)據(jù)，讓您可以從現(xiàn)有內(nèi)容獲得更大的價值。

常用縮寫詞

API：應(yīng)用程序編程接口
FLWOR：For、Let、Where、Order by 和 Return
HTML：超文本標(biāo)記語言
HTTP：超文本傳輸協(xié)議
NASA：美國宇航局
SQL：結(jié)構(gòu)化查詢語言
URL：統(tǒng)一資源定位符
XML：可擴(kuò)展標(biāo)記語言

分類的重要問題之一是，人們根據(jù)自己的邏輯進(jìn)行分類，會出現(xiàn)錯誤以及造成不同的分類。您定義分類系統(tǒng)時，考慮所有涉眾的觀點，試圖找到一致的方法來組織數(shù)據(jù)，但是困難重重。例如，一個部門的人可能不明白什么樣的元數(shù)據(jù)對另一個部門的人很重要。此外，培訓(xùn)人們?nèi)ダ斫夂鸵恢碌貞?yīng)用一種分類是很費時的。

隨著海量數(shù)據(jù)（有些人稱之為數(shù)字垃圾）的不斷產(chǎn)生，手工對數(shù)據(jù)進(jìn)行分類幾乎已經(jīng)是不可能的事了。您必須求助于分析各種格式和輸入的自動化方法。

自動分類有很多優(yōu)勢：

省錢。
省時。
通過提供添加元數(shù)據(jù)時應(yīng)該遵循的公共機(jī)制，保證了分類的一致性。
機(jī)構(gòu)從現(xiàn)有內(nèi)容獲得更大的價值。

安裝和運行代碼示例

本文中我編寫的代碼示例目的是跟 eXist XML Database 或 Zorba XQuery 處理器一起使用。要將它與 eXist XML Database 一起使用，您需要安裝了該數(shù)據(jù)庫；否則，使用 Zorba XQuery 處理器（可以通過在線沙箱得到）。

安裝 eXist XML Database

要安裝 eXist XML Database，可執(zhí)行以下步驟：

下載并解壓示例代碼。
將解壓后的代碼目錄上傳到數(shù)據(jù)庫集合中 — 例如，/db/content-classification。
如果您使用的是 Microsoft? Office Access?，那么在瀏覽器中運行該代碼示例。

使用 Zorba XQuery 處理器

另外，通過執(zhí)行以下步驟，您也可以使用 Zorba XQuery 處理器的在線版本來運行代碼示例：

下載并解壓示例代碼。
將代碼示例剪切并粘貼到 http://try.zorba-xquery.com/ 處的 Zorba XQuery 處理器在線沙箱中。
單擊 Execute 運行代碼。

注意，eXist 和 Zorba 示例之間的差別很小。但是，眼尖的讀者會注意到，它們使用 EXPath HTTP Client 庫上有一點不同：Zorba 默認(rèn)內(nèi)置這個庫，eXist 數(shù)據(jù)庫則不這么做，因此，我提供一個單獨的 http-client.xqm XQuery 庫，專門設(shè)計來與 eXist 一起使用。本文中的示例使用 EXPath HTTP Client 庫訪問遠(yuǎn)程數(shù)據(jù)和 web 服務(wù)。在本文的第二部分，您使用 Yahoo! Query Language (YQL) 和 AlchemyAPI 工具集成了更高級的處理。

注意：要明白，在使用這些服務(wù)之前，您可能需要同意接收一個 API 鍵。

回頁首

利用 XQuery 進(jìn)行簡單分類

本文的第一部分展示如何使用純 XQuery 對內(nèi)容進(jìn)行分類。

文本分析：在非結(jié)構(gòu)化上下文中定義單詞頻率

術(shù)語 文本分析（text analytics）（或 文本挖掘（text mining））定義一組機(jī)器學(xué)習(xí)和語言技術(shù)，以從文本源抽取和建模信息元數(shù)據(jù)。文本分析在文本內(nèi)容上應(yīng)用自然語言處理（natural language processing，NLP）和分析方法，并抽取有用的元數(shù)據(jù)，比如：

語言類型。對字符編碼、單詞和內(nèi)容樣式的分析可以輕松而明確地判斷出文本數(shù)據(jù)屬于哪種語言。
關(guān)鍵詞。文本分析可以抽取出一組代表文檔特征的關(guān)鍵詞。
公共實體。為總結(jié)公共模式（比如，電子郵箱地址、電話號碼、人名和地名）而掃描文本的算法對于指定實體抽象非常有用。
語義關(guān)系。有各種各樣的方法可用于掃描內(nèi)容，以期得到更加深入的見解。

文本挖掘的一個這樣的例子是確定一個文檔中包含的單詞的頻率，假設(shè)一個單詞使用得越頻繁，它就越與整個文檔相關(guān)。

最常見的單詞將被構(gòu)造為文檔關(guān)鍵詞，但是要知道術(shù)語 關(guān)鍵詞 通常應(yīng)用于更復(fù)雜算法的輸出，這些算法可比確定單詞頻率復(fù)雜多了。例如，關(guān)鍵詞分析通常將常見單詞與同義詞查找表進(jìn)行對照，也可以分析單詞之間的距離，以幫助確定單詞在整個文檔上下文中的重要性。

任何文本分析中，第一步都是從文本內(nèi)容生成一個語料庫（corpus），后續(xù)的分析將應(yīng)用于此語料庫。生成語料庫的原因之一是規(guī)范化文本并刪除任何不相關(guān)的內(nèi)容。

清單 1 展示了一個 XQuery 程序，它消費一個 HTML 頁面（通過使用 EXPath HTTP Client 庫）并從該 web 頁面抽取出所有段落元素。由于不關(guān)心單詞的大小寫，所以您從內(nèi)容創(chuàng)建的語料庫全是小寫的。

清單 1. 生成單詞頻率列表的 XQuery 程序

xquery version "1.0";import module namespace http = "http://expath.org/ns/http-client";let $content-url     := 'http://en.wikipedia.org/wiki/Asteroid_impact_avoidance'let $content-request :=          <http:request href="{$content-url}" method="get" follow-redirect="true"/>let $content         :=          fn:string-join(http:send-request($content-request)[2],' ')returnlet $corpus := for $w in tokenize($content, '\W+') return lower-case($w)let $wordList := distinct-values($corpus)return<words> {for $w in $wordListlet $freq := count($corpus[. eq $w])order by $freq descendingreturn <word word="{$w}" frequency="{$freq}"/>}</words>

下一步是從語料庫中派生出所有惟一的單詞，為此您使用一個 FLWOR 來處理每個單詞，生成單詞計數(shù)（通過向后引用語料庫，其中包含所有的單詞），然后輸出一個 <word/> 元素。

注意：對于本文中的所有例子，我使用了相同的 web URL（http://en.wikipedia.org/wiki/Asteroid_impact_avoidance）作為文本源，來演示每種方法的效果。

運行清單 1 中程序的結(jié)果是一個 XML 文檔，它具有一個 <word/> 元素，其中包含頻率和單詞，按照關(guān)于小行星防撞的 Wikipedia 頁面中包含的最常見單詞排序。清單 2展示了該列表。

清單 2. 單詞頻率列表

<words><word word="the" frequency="377"/><word word="of" frequency="236"/><word word="a" frequency="193"/><word word="to" frequency="167"/><word word="and" frequency="141"/><word word="in" frequency="124"/><word word="earth" frequency="121"/><word word="a" frequency="109"/><word word="asteroid" frequency="102"/>....</words>

如您所見，分析返回了很多單詞，根據(jù)在英語中的常見用途，很多使用頻率高的單詞是不相關(guān)的?？梢酝ㄟ^定義一些簡單的規(guī)則來解決這個問題，這些規(guī)則用于降低干擾，比如說去除所有三個字母以內(nèi)的單詞以及去除頻率在 3 以內(nèi)的單詞。

清單 3 展示了相同的代碼，但是添加了測試單詞字符串長度和頻率的邏輯。

清單 3. 修改過的生成單詞頻率列表的 XQuery 程序

xquery version "1.0";import module namespace http = "http://expath.org/ns/http-client";let $content-url     := 'http://en.wikipedia.org/wiki/Asteroid_impact_avoidance'let $content-request := <http:request href="{$content-url}" method="get" follow-redirect="true"/>let $response := http:send-request($content-request)[2]let $content := fn:string-join($response,' ')returnlet $corpus := for $w in tokenize($content, '\W+') return lower-case($w)let $wordList := distinct-values($corpus)return<words> {         for $w in $wordList         let $freq := count($corpus[. eq $w])         order by $freq descending         return          if(string-length($w) gt 3 and $freq gt 3) then           <word word="{$w}" frequency="{$freq}"/>         else           ()         }</words>

對于不同的數(shù)據(jù)集，您可能必須調(diào)整或增強這些設(shè)置，以便排除更大長度或更高頻率的單詞，但是如清單 4 所示，最小設(shè)置忽略了很多干擾，得到一個更為相關(guān)的術(shù)語集合。

清單 4. 修訂后的單詞頻率列表

<words><word word="earth" frequency="121"/><word word="asteroid" frequency="102"/><word word="impact" frequency="58"/><word word="near" frequency="56"/><word word="with" frequency="55"/><word word="that" frequency="53"/><word word="space" frequency="49"/><word word="nasa" frequency="43"/><word word="object" frequency="36"/><word word="from" frequency="34"/><word word="this" frequency="32"/>...</words>

無疑，這種方法有其局限性。但是它是一個良好的開始，并且向您展示了，用不多的 XQuery 代碼，就可以得到一組基本的描述文檔文本內(nèi)容特征的關(guān)鍵詞。

向單詞頻率添加結(jié)構(gòu)

元數(shù)據(jù)決策

生成元數(shù)據(jù)時通常需要作出的首要決策之一是，將元數(shù)據(jù)保存在內(nèi)容 XML 中，還是存儲在單獨的元數(shù)據(jù)文檔中。在確定了您想要將元數(shù)據(jù)存儲到哪里之后，還需要決定以何種格式編碼元數(shù)據(jù)。存在很多標(biāo)記元數(shù)據(jù)的方式 — 例如：

Darwin Information Typing Architecture。定義一個最佳匹配您的目的的關(guān)鍵詞元素。
Microformats。您可以使用 rel-tag直接在文檔中注釋關(guān)鍵詞元素。
其他。有很多 semweb標(biāo)記格式可用，比如 Research Description Framework 和 Web Ontology Language。

鼓勵采用現(xiàn)有的特定標(biāo)記語言，而不提倡設(shè)計自己的標(biāo)記語言。但是要確保選擇的格式既簡單，又給元數(shù)據(jù)盡可能的靈活性。

如果完全忽略結(jié)構(gòu)的話，應(yīng)用于諸如 HTML 或 XML 之類半結(jié)構(gòu)化文檔的文本分析提供的見解很有限。但是如果您通過將內(nèi)容與元素結(jié)構(gòu)掛鉤，給文本分析的重要性指定權(quán)值，從而得出更深層的推論，那又怎么樣呢？

用 HTML 的術(shù)語來講，如果您可以在某種程度上基于單詞出現(xiàn)在嵌套結(jié)構(gòu)中的何處而給單詞打分，是不是很好？例如：

出現(xiàn)在 <title> 元素中的單詞較重要。
出現(xiàn)在 <noscript> 或 <script> 元素中的單詞較不重要。
出現(xiàn)在 <h1> 和 <h2> 元素中的單詞較重要。

要得到這種結(jié)構(gòu)，向每個單詞添加一個 fitness 屬性。該屬性執(zhí)行一個檢查，看單詞是否特定出現(xiàn)在這些元素的哪一個中。清單 5展示了添加的邏輯，它進(jìn)行檢查以確定單詞是否包含在任何被認(rèn)為重要的元素中。

清單 5. 向生成單詞頻率列表的 XQuery 程序添加 fitness

xquery version "1.0";import module namespace http = "http://expath.org/ns/http-client";let $content-url := 'http://en.wikipedia.org/wiki/Asteroid_impact_avoidance'let $content-request :=     <http:request href="{$content-url}"                      method="get" follow-redirect="true"/>let $response := http:send-request($content-request)[2]let $content := fn:string-join($response,' ')let $corpus := for $w in tokenize($content, '\W+') return lower-case($w)let $wordList := distinct-values($corpus)return<words> {    for $w in $wordList    let $fitness := if ( $response//*:title[contains(lower-case(.),$w)]) then     5     else if ($response//*:h1[contains(lower-case(.),$w)]) then    4    else if ($response//*:h2[contains(lower-case(.),$w)]) then    3    else if ($response//*:h3[contains(lower-case(.),$w)]) then    2     else if ($response//*:noscript[contains(lower-case(.),$w)]) then    -2    else if ($response//*:script[contains(lower-case(.),$w)]) then    -1    else    1    let $freq := count($corpus[. eq $w])    order by $freq descending    return     if ($freq gt 4 and string-length($w) gt 3) then     <word word="{$w}" frequency="{$freq}" fitness="{$fitness}"/>     else ()    }</words>

現(xiàn)在，您有了第二個度量，可以用來收集關(guān)于單詞重要性的更多信息：

出現(xiàn)在 <title> 元素中的 <word word="asteroid" frequency="102" fitness="5"/>。
出現(xiàn)在 <h2> 元素中的 <word word="deflect" frequency="11" fitness="3"/>。
出現(xiàn)在 <script> 元素中的 <word word="false" frequency="7" fitness="-1"/>，所以您給它一個負(fù) fitness。

這一 fitness 度量過于簡單，因為可能出現(xiàn)這樣的情況，即重要單詞不知為何也出現(xiàn)在 <script> 部分，或者是，出現(xiàn)在 <title>元素中的單詞沒有像您設(shè)想的那樣對文檔主體那么重要。對于給文檔打分和生成更適當(dāng)?shù)年P(guān)鍵詞，您可以做出進(jìn)一步的改進(jìn)，但是我們這里將集成一些更重量級的工具，用于執(zhí)行文本分析。

回頁首

使用 web 服務(wù)進(jìn)行文本分析

有很多商業(yè)和開源工具可用于執(zhí)行自然語言處理 (NLP)。下面是一些最流行的開源軟件包：

GATE。一個自然語言處理和引擎工具。
Apache Unstructured Information Management Architecture。最初由 IBM 開發(fā)。
RapidMiner。數(shù)據(jù)和文本挖掘軟件。
Carrot2。文本和搜索結(jié)果框架（帶有集群）。

此外，有幾個 web 服務(wù)提供有用的文本分析。本文的后半部分關(guān)注如何在 XQuery 文件中使用這些服務(wù)。您使用 EXPath HTTP Client 庫來訪問它們。

使用 YQL 進(jìn)行關(guān)鍵詞抽取

YQL 是一個類似于 SQL 的語言，使用它可以跨各種 Yahoo! web 服務(wù)查詢數(shù)據(jù)。Yahoo! 用于使用一組 web 服務(wù)暴露大量數(shù)據(jù)和服務(wù)；現(xiàn)在，它使用不同的端點和方法通過單個接口 YQL 訪問這些服務(wù)。

有了 YQL，您現(xiàn)在可以通過一種簡單的語言跨 Internet 訪問數(shù)據(jù)了，不再需要學(xué)習(xí)如何調(diào)用不同的 API。search.termextract 就是這樣的一個服務(wù)，它從一組文本內(nèi)容抽取公共術(shù)語。通過使用在線 YQL 控制臺，您可以在瀏覽器上試用它：

http://developer.yahoo.com/yql/console/?q=select%20*%20from%20search.termextract%20where%20context%3D%22Italian%20sculptors%20and%20painters%20of%20the%20renaissance%20favored%20the%20Virgin%20Mary%20for%20inspiration%22

操作性 YQL 語句聲明，從一個名叫 search.termextract 的表中選擇 context 變量提供的文本。

select * from search.termextract where context=

單擊 Test 生成包含一個 <query/> 元素的結(jié)果 XML，帶有結(jié)果和一些診斷，如清單 6 所示。

清單 6. YQL 結(jié)果

<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng"    yahoo:count="5" yahoo:created="2010-12-05T14:36:25Z" yahoo:lang="en-US">    <diagnostics>         <publiclyCallable>true</publiclyCallable>         <user-time>14</user-time>         <service-time>11</service-time>         <build-version>9962</build-version>    </diagnostics>     <results>         <Result xmlns="urn:yahoo:cate">italian sculptors</Result>         <Result xmlns="urn:yahoo:cate">virgin mary</Result>         <Result xmlns="urn:yahoo:cate">painters</Result>         <Result xmlns="urn:yahoo:cate">renaissance</Result>         <Result xmlns="urn:yahoo:cate">inspiration</Result>    </results></query>

由于從 XQuery 中使用 EXPath HTTP Client 庫很容易，下面就在您自己的內(nèi)容分類進(jìn)程中使用它來訪問 YQL web 服務(wù)。清單 7 展示了如何從 XQuery 中調(diào)用這個 web 服務(wù)。

清單 7. 從 XQuery 訪問 YQL web 服務(wù)

xquery version "1.0";import module namespace http = "http://expath.org/ns/http-client";let $content-url     := 'http://en.wikipedia.org/wiki/Asteroid_impact_avoidance'let $content-request :=        <http:request href="{$content-url}"                         method="get" follow-redirect="true"/>let $response := http:send-request($content-request)[2]let $content := fn:string-join(subsequence(($response//*:title,$response//*:p),1,10),' ')let $query  := fn:concat("select * from search.termextract where context=",$content," ")let $query  :=     fn:encode-for-uri(        fn:concat("select * from search.termextract where context='",$content,"'")        )let $yahoo-url     :='http://query.yahooapis.com/v1/public/yql?diagnostics=true&q='let $term-extraction-url     := fn:concat($yahoo-url,$query) let $term-extraction-request := <http:request href="{$term-extraction-url}" method="get"/>returnhttp:send-request($term-extraction-request)[2]

上面的 XQuery 代碼使用 fn:encode-for-uri() 函數(shù)編碼查詢字符串。

YQL 分析生成一個更高質(zhì)量的術(shù)語集合，如清單 8 所示。

清單 8. YQL 術(shù)語結(jié)果

<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="20"yahoo:created="2010-12-05T20:14:37Z" yahoo:lang="en-US"><diagnostics>       <publiclyCallable>true</publiclyCallable>       <url execution-time="433"       >http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction       </url>       <javascript execution-time="436" instructions-used="0"              table-name="search.termextract"/>       <user-time>437</user-time>       <service-time>433</service-time>       <build-version>9962</build-version></diagnostics><results>       <Result xmlns="urn:yahoo:cate">tertiary extinction event</Result>       <Result xmlns="urn:yahoo:cate">shoemaker levy 9</Result>       <Result xmlns="urn:yahoo:cate">spaceguard survey</Result>       <Result xmlns="urn:yahoo:cate">near earth objects</Result>       <Result xmlns="urn:yahoo:cate">period comet</Result>       <Result xmlns="urn:yahoo:cate">nasa report</Result>       <Result xmlns="urn:yahoo:cate">extinction level event</Result>       <Result xmlns="urn:yahoo:cate">deep impact probe</Result>       <Result xmlns="urn:yahoo:cate">inner solar system</Result>       <Result xmlns="urn:yahoo:cate">mitigation strategies</Result>       <Result xmlns="urn:yahoo:cate">65 million years</Result>       <Result xmlns="urn:yahoo:cate">material composition</Result>       <Result xmlns="urn:yahoo:cate">impact winter</Result>       <Result xmlns="urn:yahoo:cate">chicxulub crater</Result>       <Result xmlns="urn:yahoo:cate">impact speed</Result>       <Result xmlns="urn:yahoo:cate">catastrophic impact</Result>       <Result xmlns="urn:yahoo:cate">catastrophic damage</Result>       <Result xmlns="urn:yahoo:cate">planetary defense</Result>       <Result xmlns="urn:yahoo:cate">impact events</Result>       <Result xmlns="urn:yahoo:cate">astronomical events</Result></results></query>

YQL 也有局限性。例如，您必須確保傳遞給 YQL 的內(nèi)容不超出請求限制。由于這些請求被作為 HTTP GET 請求發(fā)送，所以它們必須被正確編碼。

回頁首

利用 AlchemyAPI 進(jìn)行文本分析

AlchemyAPI 是一家公司，提供一組有趣的內(nèi)容分析工具（參見參考資料）。該公司的所有工具都作為一套 web 服務(wù)可用。在本文中，您使用他們的術(shù)語和指定實體抽取服務(wù)來執(zhí)行文本分析。

利用 Alchemy 進(jìn)行關(guān)鍵詞抽取

AlchemyAPI 提供一個 web 服務(wù)，用于從任何可公共訪問的 web 頁面抽取主題關(guān)鍵詞。使用一個直觀的 HTTP GET 請求，您訪問 AlchemyAPI web 服務(wù)，指示它檢索某個特定的 URL，并抽取主題關(guān)鍵詞。此外，AlchemyAPI URL處理調(diào)用自動獲得想要的 web 頁面，進(jìn)行規(guī)范化和清潔（刪除廣告、導(dǎo)航鏈接和其他不重要的內(nèi)容），并抽取主題關(guān)鍵詞。清單 9 展示了這是如何實現(xiàn)的。

清單 9. 用于訪問 AlchemyAPI 主題抽取 web 服務(wù)的 URL

http://access.alchemyapi.com/calls/url/URLGetRankedKeywords?apikey=PLACE_YOUR_APIKEY_HERE&    url=http://en.wikipedia.org/wiki/Asteroid_impact_avoidance

AlchemyAPI 需要兩個 URL 參數(shù)：

一個將被分析的 URL
一個 apikey，這是 web 服務(wù)上的任何調(diào)用所必需的

您可以通過一個注冊表單從 AlchemyAPI 站點獲得 AlchemyAPI apikey。

由于 AlchemyAPI 為您獲得 URL，所以從 XQuery 調(diào)用 web 服務(wù)比以前調(diào)用 YQL 的例子要稍微簡單一些。清單 10展示了代碼。

清單 10. 使用 AlchemyAPI 生成關(guān)鍵詞的 XQuery 代碼

xquery version "1.0";import module namespace http = "http://expath.org/ns/http-client";let $url    := 'http://en.wikipedia.org/wiki/Asteroid_impact_avoidance'let $apikey := 'PLACE_YOUR_APIKEY_HERE'let $alchemey_uri := 'http://access.alchemyapi.com/calls/url/URLGetRankedKeywords?'let $href    := fn:concat($alchemey_uri,'&apikey=',$apikey,'&url=',$url)let $content-request := <http:request href="{$href}" method="get" follow-redirect="true"/>returnhttp:send-request($content-request)[2]

清單 11 展示了結(jié)果，其中包含所測試 web 頁面的關(guān)鍵詞。

清單 11. 主題抽取 web 服務(wù)的結(jié)果

<results>     <status>OK</status>     <usage>By accessing AlchemyAPI or using information      generated by AlchemyAPI, you are agreeing to be bound by      the AlchemyAPI Terms of Use: http://www.alchemyapi.com/company/terms.html</usage>     <url>http://en.wikipedia.org/wiki/Asteroid_impact_avoidance</url>     <language>english</language>     <keywords>          <keyword>               <text>asteroid</text>               <relevance>0.983321</relevance>          </keyword>          <keyword>               <text>NASA</text>               <relevance>0.376168</relevance>          </keyword>          <keyword>               <text>comet</text>               <relevance>0.370371</relevance>          </keyword>          <keyword>               <text>near-earth object</text>               <relevance>0.363529</relevance>          </keyword>          <keyword>               <text>survey program</text>               <relevance>0.3417</relevance>          </keyword>     .... more keywords ....     </keywords></results>

由于關(guān)鍵詞帶有相關(guān)性分?jǐn)?shù)（還有很多更相關(guān)的結(jié)果），所以從質(zhì)量上來講，來自 AlchemyAPI web 服務(wù)的輸出比 YQL 更好一些。

利用 AlchemyAPI 進(jìn)行實體抽取

通過使用 AlchemyAPI 指定實體抽取 web 服務(wù)，可以在復(fù)雜度上更進(jìn)一步，AlchemyAPI 能夠識別內(nèi)容中的人、公司、組織、城市、地理特性和其他類型化實體。這里出現(xiàn)一些重型 NLP，以抽取有意義的實體。

跟主題關(guān)鍵詞 web 服務(wù)一樣，所有您必須做的事情就是提供一個 apikey 和 URL（包含您想要分析的內(nèi)容），如清單 12 所示。

清單 12. 用于訪問 AlchemyAPI 指定實體抽取 web 服務(wù)的 URL

http://access.alchemyapi.com/calls/url/URLGetRankedNamedEntities?    apikey=PLACE_YOUR_APIKEY_HERE&    url=http://en.wikipedia.org/wiki/Asteroid_impact_avoidance

就從 XQuery 調(diào)用 web 服務(wù)來說，您做的完全是相同的事情，如清單 13 所示。

清單 13. XQuery 使用 AlchemyAPI 生成實體

xquery version "1.0";import module namespace http = "http://expath.org/ns/http-client";let $url    := 'http://en.wikipedia.org/wiki/Asteroid_impact_avoidance'let $apikey := 'PLACE_YOUR_APIKEY_HERE'let $alchemey_uri := 'http://access.alchemyapi.com/calls/url/URLGetRankedNamedEntities?'let $href    := fn:concat($alchemey_uri,'&apikey=',$apikey,'&url=',$url)let $content-request := <http:request href="{$href}" method="get" follow-redirect="true"/>returnhttp:send-request($content-request)[2]

文本分析的結(jié)果相當(dāng)冗長，并且如清單 14所示，很清晰。

清單 14. 指定實體抽取 web 服務(wù)產(chǎn)生的結(jié)果

<results>    <status>OK</status>    <usage>By accessing AlchemyAPI or using information generated by AlchemyAPI,     you are agreeing to be bound by the AlchemyAPI Terms of     Use: http://www.alchemyapi.com/company/terms.html</usage>    <url>http://en.wikipedia.org/wiki/Asteroid_impact_avoidance</url>    <language>english</language>    <entities>         <entity>              <type>GeographicFeature</type>              <relevance>0.667231</relevance>              <count>44</count>              <text>Earth</text>         </entity>         <entity>              <type>Organization</type>              <relevance>0.472053</relevance>              <count>25</count>              <text>NASA</text>              <disambiguated>                   <name>NASA</name>                   <subType>Company</subType>                   <subType>GovernmentAgency</subType>                   <subType>AirportOperator</subType>                   <subType>AwardPresentingOrganization</subType>                   <subType>SoftwareDeveloper</subType>                   <subType>SpaceAgency</subType>                   <subType>SpacecraftManufacturer</subType>                   <geo>38.88305555555556 -77.01638888888888</geo>                   <website>http://www.nasa.gov/home/index.html</website>                   <dbpedia>http://dbpedia.org/resource/NASA</dbpedia>                   <umbel>http://umbel.org/umbel/ne/wikipedia/NASA</umbel>                   <yago>http://mpii.de/yago/resource/NASA</yago>              </disambiguated>         </entity>         .... entities ....    </entities></results>

AlchemyAPI 指定實體抽取 web 服務(wù)識別了所有種類的實體。例如，它知道：

Earth 是一個地理特性。
NASA 是一個組織，并提供幾個相關(guān)的鏈接。
United States 是一個國家。
Representative George E. Brown 是一個人，并將他識別為一個政治家。

就這種意義上來說，文本挖掘從內(nèi)容收集信息的能力幾乎是不可思議的，但是最好關(guān)注一下相關(guān)性評分。沒有哪個系統(tǒng)是百分之百準(zhǔn)確的，您會發(fā)現(xiàn)某些內(nèi)容對文本分析的響應(yīng)要好于其他內(nèi)容。

回頁首

結(jié)束語

為開始分類您自己的文檔，本文介紹了很多技術(shù)。第一種嘗試的焦點是，如何基于確定單詞頻率，建立您自己的 XQuery 文本挖掘技術(shù)。我然后展示了如何集成 Yahoo! 和 AlchemyAPI 提供的強大的外部 web 服務(wù)，用于文本分析。

無疑，web 服務(wù)提供的文本分析質(zhì)量較高，但是即使對于基本的單詞頻率 XQuery 例子，也可以使用純 XQuery 從您的數(shù)據(jù)得到有用的推論。

給出的所有方法都有一些局限性。例如，只分析一個文檔?？缫唤M相關(guān)文檔執(zhí)行文本分析可以導(dǎo)致更高質(zhì)量的分類，因為您可以交叉引用更大的語料庫，并分析出文檔之間更深層的關(guān)系?？傊?，我希望本文向您展示了 XQuery 對于自動化內(nèi)容分類有多么強大，并且愿意聽到關(guān)于您自己嘗試以相同方式應(yīng)用 XQuery 的反饋。

回頁首

下載

描述	名字	大小
本文的樣例腳本	content_catigorisation_src.zip	20KB

本站僅提供存儲服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊舉報。

国产一级a片免费看高清,亚洲熟女中文字幕在线视频,黄三级高清在线播放,免费黄色视频在线看