Document包分析
理解Document
Lucene沒有定義數(shù)據(jù)源,而是定義了一個(gè)通用的文檔結(jié)構(gòu),這個(gè)文檔結(jié)構(gòu)就是LuceneDocument包下的Document類.
一個(gè)Document對應(yīng)于你在進(jìn)行網(wǎng)頁抓取的時(shí)候一個(gè)msword,一個(gè)pdf,一個(gè)html,一個(gè)text等.Lucene的這種形式可以定義
非常靈活的應(yīng)用,只要前端有相應(yīng)的轉(zhuǎn)換器把數(shù)據(jù)源轉(zhuǎn)成Document結(jié)構(gòu)就可以了.
一個(gè)Document內(nèi)部維護(hù)一個(gè)Field的vector.
好,我們一起來看一下document的核心源碼(只有定義,沒有實(shí)現(xiàn))
public final class Document implements java.io.Serializable {
List fields = new Vector();//成員變量
//boost用來表示此document的重要程度,默認(rèn)為1.0,會(huì)作用于document中的所有的field
private float boost =
public Document() {}
public void setBoost(float boost) {this.boost = boost;}
public float getBoost() {return boost;}
public final void add(Field field)
public final void removeField(String name)
public final void removeFields(String name)
public final Field getField(String name)
public final String get(String name)
public final Enumeration fields()
public final Field[] getFields(String name)
public final String[] getValues(String name)
public final String toString()
理解Field
剛才提到一個(gè)Document中有一個(gè)用來存儲(chǔ)Field的vector,那么什么是Field.你可以簡單的認(rèn)為Field是一個(gè)<name,value>
name為域(Field)的名字,例如title,body,subject,data等等。value就是文本。我們來看一下源碼定義,不就OK了.
(由于Field是Lucene中非常重要的概念,所以我們拿來源碼看一下)
public final class Field implements java.io.Serializable {
private String name = "body";
private String stringValue = null;
private boolean storeTermVector = false;
private Reader readerValue = null;
private boolean isStored = false;
private boolean isIndexed = true;
private boolean isTokenized = true;
/*以前一直不了解boost為何?其實(shí)boost就是由于后來進(jìn)行相關(guān)度排序時(shí)用的,由于在query時(shí),
*每個(gè)term都分屬與一個(gè)field。同樣的term當(dāng)其屬于不同的field時(shí),其重要性不一樣,譬如
*field:<title>中的term就要比field:<content>中的term重要!而這個(gè)重要性如何體現(xiàn)就
*可以通過boost進(jìn)行設(shè)定。可以把field:<title>的boost至設(shè)大一些
*注意boost在Document中還有整個(gè)的設(shè)定.
*/
private float boost =
public void setBoost(float boost) {this.boost = boost;}
public float getBoost() { return boost;}
public static final Field Keyword(String name, String value) {return new Field(name, value, true, true, false);}
public static final Field UnIndexed(String name, String value) {return new Field(name, value, true, false, false);}
public static final Field Text(String name, String value) {return Text(name, value, false);}
public static final Field Keyword(String name, Date value) {return new Field(name, DateField.dateToString(value), true, true, false);}
public static final Field Text(String name, String value, boolean storeTermVector) {
return new Field(name, value, true, true, true, storeTermVector);}
public static final Field UnStored(String name, String value) {
return UnStored(name, value, false);}
public static final Field UnStored(String name, String value, boolean storeTermVector) {
return new Field(name, value, false, true, true, storeTermVector); }
public static final Field Text(String name, Reader value) {
return Text(name, value, false);}
public static final Field Text(String name, Reader value, boolean storeTermVector) {
Field f = new Field(name, value);
f.storeTermVector = storeTermVector;
return f;
}
public String name() { return name; }
public String stringValue() { return stringValue; }
public Reader readerValue() { return readerValue; }
public Field(String name, String string,
boolean store, boolean index, boolean token) {
this(name, string, store, index, token, false);
}
//最低層的構(gòu)造函數(shù)
public Field(String name, String string,
boolean store, boolean index, boolean token, boolean storeTermVector)
Field(String name, Reader reader)
public final boolean isStored() { return isStored; }
public final boolean isIndexed() { return isIndexed; }
public final boolean isTokenized() { return isTokenized; }
public final boolean isTermVectorStored() { return storeTermVector; }
public final String toString()
public final String toString2()//我加的用來返回六元組
}
代碼可能看起來有點(diǎn)長,不過看一下就知道了Field其實(shí)是一個(gè)六元組,咱們上文說其是<name,value>對是一種簡化形式.
Field的六元組形式為<name,stringValue,isStored,isIndexed,isTokenized,isTermVectorStored>,Field提供了不同的構(gòu)造函數(shù)
主要有一下幾個(gè)
方法 | 切詞 | 索引 | 存儲(chǔ) | 用途 |
Field.Text(String name, String value) | Yes | Yes | Yes | 切分,索引,并存儲(chǔ),比如:title,subject |
Field Text(String name, Reader value) | Yes | Yes | Yes | 與上面同, Term Vector并不存儲(chǔ)此Field |
Field Text(String name, String value, boolean storeTermVector) | Yes | Yes | Yes | 切分,索引,存儲(chǔ),比如:title,subject.于上面不同的加入了一個(gè)控制變量 |
Field Text(String name, Reader value, boolean storeTermVector) | Yes | Yes | Yes | 切分,索引,存儲(chǔ),比如:title,subject.于上面不同的加入了一個(gè)控制變量 |
Field.Keyword(String name, String value) | No | Yes | Yes | 不切分,索引,存儲(chǔ),比如:date,url |
Field Keyword(String name, Date value) |
|
|
| 不切分,存儲(chǔ),索引,用來返回hits |
Field.UnIndexed(String name, String value) | No | No | Yes | 不切分,不索引,存儲(chǔ),比如:文件路徑 |
Field.UnStored(String name, String value) | Yes | Yes | No | 只全文索引,不存儲(chǔ) |
Field UnStored(String name, String value, boolean storeTermVector) | Yes | Yes | No | 于上面相同,不同的是加入了一個(gè)控制變量 |
總的來看,Field的構(gòu)造函數(shù)就只有四種形式,Text,KeyWord,UnIndexed,UnStored,只不過每種函數(shù)往往有多種變形罷了.
編一段代碼來測試一下Document類和Field類
public class TestDocument
{
private Document makeDocumentWithFields() throws IOException
{
Document doc = new Document();
doc.add(Field.Text("title","title"));
doc.add(Field.Text("subject","ubject"));
doc.add(Field.Keyword("date","2005.11.12"));
doc.add(Field.Keyword("url","www.tju.edu.cn"));
doc.add(Field.UnIndexed("filepath","D:\\Lucene"));
doc.add(Field.UnStored("unstored","This field is unstored"));
Field field;
for(int i=0;i<doc.fields.size();i++)
{
field =(Field)doc.fields.get(i);
System.out.println(field.toString());
System.out.println("對應(yīng)的六元組形式為");
System.out.println(field.toString2());
}
return doc;
}
public void GetValuesForIndexedDocument() throws IOException
{
RAMDirectory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir,new StandardAnalyzer(),true);
writer.addDocument(makeDocumentWithFields());
writer.close();
Searcher searcher = new IndexSearcher(dir);
Query query = new TermQuery(new Term("title","title"));
//Hits 由匹配的Document組成.
Hits hits = searcher.search(query);
System.out.println("Document的結(jié)構(gòu)形式");
System.out.println(hits.doc(0));
}
public static void main(String [] args)
{
TestDocument testDocument = new TestDocument();
try
{
testDocument.GetValuesForIndexedDocument();
}
catch (IOException ioe)
{
ioe.printStackTrace();
}
}
}
其結(jié)果如下:
Text<title:title>
對應(yīng)的六元組形式為
Field:<name,stringValue,isStored,isIndexed,isTokenized,isTermVectorStored> is:<title,title,true,true,true,false>
Text<subject:ubject>
對應(yīng)的六元組形式為
Field:<name,stringValue,isStored,isIndexed,isTokenized,isTermVectorStored> is:<subject,ubject,true,true,true,false>
Keyword<date:2005.11.12>
對應(yīng)的六元組形式為
Field:<name,stringValue,isStored,isIndexed,isTokenized,isTermVectorStored> is:<date,2005.11.12,true,true,false,false>
Keyword<url:www.tju.edu.cn>
對應(yīng)的六元組形式為
Field:<name,stringValue,isStored,isIndexed,isTokenized,isTermVectorStored> is:<url,www.tju.edu.cn,true,true,false,false>
Unindexed<filepath:D:\Lucene>
對應(yīng)的六元組形式為
Field:<name,stringValue,isStored,isIndexed,isTokenized,isTermVectorStored> is:<filepath,D:\Lucene,true,false,false,false>
UnStored<unstored>
對應(yīng)的六元組形式為
Field:<name,stringValue,isStored,isIndexed,isTokenized,isTermVectorStored>is:<unstored,This field is unstored,false,true,true,false>
Document的結(jié)構(gòu)形式
Document<Text<title:title> Text<subject:ubject> Keyword<date:2005.11.12> Keyword<url:www.tju.edu.cn> Unindexed<filepath:D:\Lucene>>
相信當(dāng)你看著輸出結(jié)果時(shí),你會(huì)對Document和Field這兩個(gè)類有更好的認(rèn)識!Document和Field是Lucene中非常重要索引的基本概念,所以需要好好理解.