本文討論Web應用中實現(xiàn)數(shù)據(jù)分頁功能,不同的技術實現(xiàn)方式的性能方區(qū)別。
上圖功能的技術實現(xiàn)方法拿MySQL來舉例就是
select * from msgs where thread_id = ? limit page * count, count
不過在看Twitter API的時候,我們卻發(fā)現(xiàn)不少接口使用cursor的方法,而不用page, count這樣直觀的形式,如followers ids 接口
URL:
http://twitter.com/followers/ids.format
Returns an array of numeric IDs for every user following thespecified user.
Parameters:
* cursor. Required. Breaks the results into pages. Provide a value of -1to begin paging. Provide values as returned to in the response body’snext_cursor and previous_cursor attributes to page back and forth in thelist.
o Example: http://twitter.com/followers/ids/barackobama.xml?cursor=-1
o Example:http://twitter.com/followers/ids/barackobama.xml?cursor=-1300794057949944903
從上面描述可以看到,http://twitter.com/followers/ids.xml這個調用需要傳cursor參數(shù)來進行分頁,而不是傳統(tǒng)的url?page=n&count=n的形式。這樣做有什么優(yōu)點呢?是否讓每個cursor保持一個當時數(shù)據(jù)集的鏡像?防止由于結果集實時改變而產生查詢結果有重復內容?
在Google Groups這篇CursorExpiration討論中Twitter的架構師JohnKalucki提到
A cursor is an opaque deletion-tolerant index into aBtree keyed by source
userid and modification time. It brings you to a point in time in the
reverse chron sorted list. So, since you can’t change the past, otherthan
erasing it, it’s effectively stable. (Modifications bubble to the top.)But
you have to deal with additions at the list head and also blockshrinkage
due to deletions, so your blocks begin to overlap quite a bit as thedata
ages. (If you cache cursors and read much later, you’ll see the firstfew
rows of cursor[n+1]’s block as duplicates of the last rows ofcursor[n]’s
block. The intersection cardinality is equal to the number of deletionsin
cursor[n]’s block). Still, there may be value in caching these cursorsand
then heuristically rebalancing them when the overlap proportion crossessome
threshold.
在另外一篇newcursor-based pagination not multithread-friendly中John又提到
The page based approach does not scale with large sets.We can no
longer support this kind of API without throwing a painful number of
503s.Working with row-counts forces the data store to recount rows in an O
(n^2) manner. Cursors avoid this issue by allowing practically
constant time access to the next block. The cost becomes O(n/
block_size) which, yes, is O(n), but a graceful one given n < 10^7and
a block_size of 5000. The cursor approach provides a more complete and
consistent result set.Proportionally, very few users require multiple page fetches with a
page size of 5,000.Also, scraping the social graph repeatedly at high speed is could
often be considered a low-value, borderline abusive use of the social
graph API.
通過這兩段文字我們已經很清楚了,對于大結果集的數(shù)據(jù),使用cursor方式的目的主要是為了極大地提高性能。還是拿MySQL為例說明,比如翻頁到100,000條時,不用cursor,對應的SQL為
select * from msgs limit 100000, 100
在一個百萬記錄的表上,第一次執(zhí)行這條SQL需要5秒以上。
假定我們使用表的主鍵的值作為cursor_id, 使用cursor分頁方式對應的SQL可以優(yōu)化為
select * from msgs where id > cursor_id limit 100;
同樣的表中,通常只需要100ms以下, 效率會提高幾十倍。MySQL limit性能差別也可參看我3年前寫的一篇不成熟的文章 MySQLLIMIT 的性能問題。
結論
建議Web應用中大數(shù)據(jù)集翻頁可以采用這種cursor方式,不過此方法缺點是翻頁時必須連續(xù),不能跳頁。
實際應用中問題一般是出在where和limit之間的status = pass或者其他篩選條件上,數(shù)據(jù)不連續(xù)cursor也就不那么靈光了
ls應該指order by吧,當不是以id為序時
跳頁可以加一次運算,取出合適cursor就可以了吧,相對可能還是簡單了
可以看一下我的一篇博客http://www.fuchaoqun.com/2009/04/efficient- pagination-using-mysql/
既可用到cursor,亦可隨意翻頁。
請問如果不是以主鍵id為排序,應該怎么做呢?