国产一级a片免费看高清,亚洲熟女中文字幕在线视频,黄三级高清在线播放,免费黄色视频在线看

打開(kāi)APP
userphoto
未登錄

開(kāi)通VIP,暢享免費(fèi)電子書(shū)等14項(xiàng)超值服

開(kāi)通VIP
HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs
1. Intruduction
HFile is a mimic of Google’s SSTable. Now, it is available in Hadoop HBase-0.20.0. And the previous releases of HBase temporarily use an alternate file format – MapFile[4],which is a common file format in Hadoop IO package. I think HFileshould also become a common file format when it becomes mature, andshould be moved into the common IO package of Hadoop in the future.
Following words of SSTable are from section 4 of Google’s Bigtable paper.
TheGoogle SSTable file format is used internally to store Bigtable data.An SSTable provides a persistent, ordered immutable map from keys tovalues, where both keys and values are arbitrary byte strings.Operations are provided to look up the value associated with a specifiedkey, and to iterate over all key/value pairs in a specified key range.Internally, each SSTable contains a sequence of blocks (typically eachblock is 64KB in size, but this is configurable). A block index (storedat the end of the SSTable) is used to locate blocks; the index is loadedinto memory when the SSTable is opened. A lookup can be performed with asingle disk seek: we first find the appropriate block by performing abinary search in the in-memory index, and then reading the appropriateblock from disk. Optionally, an SSTable can be completely mapped intomemory, which allows us to perform lookups and scans without touchingdisk.[1]
The HFile implements the same features as SSTable, but may provide more or less.
2. File Format
Data Block Size
Whenever we say Block Size, it means the uncompressed size.
Thesize of each data block is 64KB by default, and is configurable inHFile.Writer. It means the data block will not exceed this size morethan one key/value pair. The HFile.Writer starts a new data block to addkey/value pairs if the current writing block is equal to or bigger thanthis size. The 64KB size is same as Google’s [1].
Toachieve better performance, we should select different block size. Ifthe average key/value size is very short (e.g. 100 bytes), we shouldselect small blocks (e.g. 16KB) to avoid too many key/value pairs ineach block, which will increase the latency of in-block seek, becausethe seeking operation always finds the key from the first key/value pairin sequence within a block.
Maximum Key Length
Thekey of each key/value pair is currently up to 64KB in size. Usually,10-100 bytes is a typical size for most of our applications. Even in thedata model of HBase, the key (rowkey+column family:qualifier+timestamp)should not be too long.
Maximum File Size
Thetrailer, file-info and total data block indexes (optionally, may addmeta block indexes) will be in memory when writing and reading of anHFile. So, a larger HFile (with more data blocks) requires more memory.For example, a 1GB uncompressed HFile would have about 15600 (1GB/64KB)data blocks, and correspondingly about 15600 indexes. Suppose theaverage key size is 64 bytes, then we need about 1.2MB RAM (15600X80) tohold these indexes in memory.
Compression Algorithm
- Compression reduces the number of bytes written to/read from HDFS.
- Compression effectively improves the efficiency of network bandwidth and disk space
- Compression reduces the size of data needed to be read when issuing a read
Tobe as low friction as necessary, a real-time compression library ispreferred. Currently, HFile supports following three algorithms:
(1)NONE (Default, uncompressed, string name=”none”)
(2)GZ (Gzip, string name=”gz”)
Out of the box, HFile ships with only Gzip compression, which is fairly slow.
(3)LZO(Lempel-Ziv-Oberhumer, preferred, string name=”lzo”)
Toachieve maximal performance and benefit, you must enable LZO, which is alossless data compression algorithm that is focused on decompressionspeed.
Following figures show the format of an HFile.
In above figures, an HFile is separated into multiple segments, from beginning to end, they are:
- Data Block segment
To store key/value pairs, may be compressed.
- Meta Block segment (Optional)
To store user defined large metadata, may be compressed.
- File Info segment
It is a small metadata of the HFile, without compression. User can add user defined small metadata (name/value) here.
- Data Block Index segment
Indexes the data block offset in the HFile. The key of each index is the key of first key/value pair in the block.
- Meta Block Index segment (Optional)
Indexes the meta block offset in the HFile. The key of each index is the user defined unique name of the meta block.
- Trailer
The fix sized metadata. To hold the offset of each segment, etc. To read an HFile, we should always read the Trailer firstly.
The current implementation of HFile does not include Bloom Filter, which should be added in the future.
3. LZO Compression
LZOis now removed from Hadoop or HBase 0.20+ because of GPL restrictions.To enable it, we should install native library firstly as following. [6][7][8][9]
(1) Download LZO:http://www.oberhumer.com/, and build.
# ./configure --build=x86_64-redhat-linux-gnu --enable-shared --disable-asm
# make
# make install
Then the libraries have been installed in: /usr/local/lib
(2) Download the native connector libraryhttp://code.google.com/p/hadoop-gpl-compression/, and build.
Copy hadoo-0.20.0-core.jar to ./lib.
# ant compile-native
# ant jar
(3) Copy the native library (build/native/ Linux-amd64-64) andhadoop-gpl-compression-0.1.0-dev.jar to your application’s libdirectory. If your application is a MapReduce job, copy them to hadoop’slib directory. Your application should follow the$HADOOP_HOME/bin/hadoop script to ensure that the native hadoop libraryis on the library path via the system property -Djava.library.path=. [9]
4. Performance Evaluation
Testbed
4 slaves + 1 master
Machine: 4 CPU cores (2.0G), 2x500GB 7200RPM SATA disks, 8GB RAM.
Linux: RedHat 5.1 (2.6.18-53.el5), ext3, no RAID, noatime
1Gbps network, all nodes under the same switch.
Hadoop-0.20.0 (1GB heap), lzo-2.0.3
Some MapReduce-based benchmarks are designed to evaluate the performance of operations to HFiles, in parallel.
Total key/value entries: 30,000,000.
Key/Value size: 1000 bytes (10 for key, and 990 for value). We have totally 30GB of data.
Sequential key ranges: 60, i.e. each range have 500,000 entries.
Use default block size.
The entry value is a string, each continuous 8 bytes are a filled with a same letter (A~Z). E.g. “BBBBBBBBXXXXXXXXGGGGGGGG……”.
We set mapred.tasktracker.map.tasks.maximum=3 to avoid client side bottleneck.
(1) Write
Each MapTask for each range of key, which writes a separate HFile with 500,000 key/value entries.
(2) Full Scan
Each MapTask scans a separate HFile from beginning to end.
(3) Random Seek a specified key
EachMapTask opens one separate HFile, and selects a random key within thatfile to seek it. Each MapTask runs 50,000 (1/10 of the entries) randomseeks.
(4) Random Short Scan
EachMapTask opens one separate HFile, and selects a random key within thatfile as a beginning to scan 30 entries. Each MapTask runs 50,000 scans,i.e. scans 50,000*30=1,500,000 entries.
This table shows the average entries which are written/seek/scanned per second, and per node.
Inthis evaluation case, the compression ratio is about 7:1 for gz(Gzip),and about 4:1 for lzo. Even through the compression ratio is justmoderate, the lzo column shows the best performance, especially forwrites.
Theperformance of full scan is much better than SequenceFile, so HFile mayprovide better performance to MapReduce-based analytical applications.
Therandom seek in HFiles is slow, especially in none-compressed HFiles.But the above numbers already show 6X~10X better performance than a diskseek (10ms). Following Ganglia charts show us the overhead of load,CPU, and network. The random short scan makes the similar phenomena.
HFile: A Block-Indexed File Format to Store Sorted Key-Value PairsView moredocuments fromschubertzhang.
References
http://labs.google.com/papers/bigtable.html
0.20.0 Documentation,http://hadoop.apache.org/hbase/docs/r0.20.0/
[3] HFile code review and refinement.http://issues.apache.org/jira/browse/HBASE-1818
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
http://www.cloudera.com/blog/2009/06/24/parallel-lzo-splittable-compression-for-hadoop/
http://blog.chrisgoffinet.com/2009/06/parallel-lzo-splittable-on-hadoop-using-cloudera/
http://wiki.apache.org/hadoop/UsingLzoCompression
http://www.oberhumer.com
[8] Hadoop LZO native connector library:http://code.google.com/p/hadoop-gpl-compression/
http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html
Posted bySchubert Zhangat9/12/2009 02:53:00 AM
Labels:BigTable,Hadoop,HBase,HFile,SSTable
本站僅提供存儲(chǔ)服務(wù),所有內(nèi)容均由用戶發(fā)布,如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請(qǐng)點(diǎn)擊舉報(bào)。
打開(kāi)APP,閱讀全文并永久保存 查看更多類(lèi)似文章
猜你喜歡
類(lèi)似文章
Hadoop: Best Practices and Anti-Patter...
內(nèi)存碎片(Memory Fragmentation)
HBase邏輯結(jié)構(gòu)
淺談HBase系統(tǒng)架構(gòu)
Bulk Load-HBase數(shù)據(jù)導(dǎo)入最佳實(shí)踐
hbase中的HFile文件格式詳解 | Sina App Engine Blog
更多類(lèi)似文章 >>
生活服務(wù)
分享 收藏 導(dǎo)長(zhǎng)圖 關(guān)注 下載文章
綁定賬號(hào)成功
后續(xù)可登錄賬號(hào)暢享VIP特權(quán)!
如果VIP功能使用有故障,
可點(diǎn)擊這里聯(lián)系客服!

聯(lián)系客服