快捷導(dǎo)航

postgresql分頁(yè)數(shù)據(jù)重復(fù)問(wèn)題的深入理解

更新時(shí)間：2019年04月05日 09:00:32 作者：月圖靈

這篇文章主要給大家介紹了關(guān)于postgresql分頁(yè)數(shù)據(jù)重復(fù)問(wèn)題的相關(guān)資料，文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家學(xué)習(xí)或者使用postgresql具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面來(lái)一起學(xué)習(xí)學(xué)習(xí)吧

問(wèn)題背景

許多開(kāi)發(fā)和測(cè)試人員都可能遇到過(guò)列表的數(shù)據(jù)翻下一頁(yè)的時(shí)候顯示了上一頁(yè)的數(shù)據(jù)，也就是翻頁(yè)會(huì)有重復(fù)的數(shù)據(jù)。

如何處理？

這個(gè)問(wèn)題出現(xiàn)的原因是因?yàn)檫x擇的排序字段有重復(fù)，常見(jiàn)的處理辦法就是排序的時(shí)候加上唯一字段，這樣在分頁(yè)的過(guò)程中數(shù)據(jù)就不會(huì)重復(fù)了。關(guān)于這個(gè)問(wèn)題文檔也有解釋并非是一個(gè)bug。而是排序時(shí)需要選擇唯一字段來(lái)做排序，不然返回的結(jié)果不確定

排序返回?cái)?shù)據(jù)重復(fù)的根本原因是什么呢？

經(jīng)常優(yōu)化sql的同學(xué)可能會(huì)發(fā)現(xiàn)，執(zhí)行計(jì)劃里面會(huì)有Sort Method這個(gè)關(guān)鍵字，而這個(gè)關(guān)鍵字就是排序選擇的方法。abase的排序分為三種

quicksort                       快速排序
top-N heapsort Memory          堆排序
external merge Disk            歸并排序

推測(cè)

分頁(yè)重復(fù)的問(wèn)題和執(zhí)行計(jì)劃選擇排序算法的穩(wěn)定性有關(guān)。

簡(jiǎn)單介紹下這三種排序算法的場(chǎng)景：

在有索引的情況下：排序可以直接走索引。在沒(méi)有索引的情況下：當(dāng)表的數(shù)據(jù)量較小的時(shí)候選擇快速排序（排序所需必須內(nèi)存小于work_mem），當(dāng)排序有l(wèi)imit，且耗費(fèi)的內(nèi)存不超過(guò)work_mem時(shí)選擇堆排序，當(dāng)work_mem不夠時(shí)選擇歸并排序。

驗(yàn)證推測(cè)

1.創(chuàng)建表，初始化數(shù)據(jù)

abase=# create table t_sort(n_int int,c_id varchar(300));
CREATE TABLE
abase=# insert into t_sort(n_int,c_id) select 100,generate_series(1,9);
INSERT 0 9
abase=# insert into t_sort(n_int,c_id) select 200,generate_series(1,9);
INSERT 0 9
abase=# insert into t_sort(n_int,c_id) select 300,generate_series(1,9);
INSERT 0 9
abase=# insert into t_sort(n_int,c_id) select 400,generate_series(1,9);
INSERT 0 9
abase=# insert into t_sort(n_int,c_id) select 500,generate_series(1,9);
INSERT 0 9
abase=# insert into t_sort(n_int,c_id) select 600,generate_series(1,9);
INSERT 0 9

三種排序

--快速排序 quicksort
abase=# explain analyze select ctid,n_int,c_id from t_sort order by n_int asc;
            QUERY PLAN            
------------------------------------------------------------
 Sort (cost=3.09..3.23 rows=54 width=12) (actual time=0.058..0.061 rows=54 loops=1)
 Sort Key: n_int
 Sort Method: quicksort Memory: 27kB
 -> Seq Scan on t_sort (cost=0.00..1.54 rows=54 width=12) (actual time=0.021..0.032 rows=54 loops=1)
 Planning time: 0.161 ms
 Execution time: 0.104 ms
(6 rows)
--堆排序 top-N heapsort
abase=# explain analyze select ctid,n_int,c_id from t_sort order by n_int asc limit 10;
             QUERY PLAN             
 
------------------------------------------------------------
 Limit (cost=2.71..2.73 rows=10 width=12) (actual time=0.066..0.068 rows=10 loops=1)
 -> Sort (cost=2.71..2.84 rows=54 width=12) (actual time=0.065..0.066 rows=10 loops=1)
   Sort Key: n_int
   Sort Method: top-N heapsort Memory: 25kB
   -> Seq Scan on t_sort (cost=0.00..1.54 rows=54 width=12) (actual time=0.022..0.031 rows=54 loops=1
)
 Planning time: 0.215 ms
 Execution time: 0.124 ms
(7 rows)
--歸并排序 external sort Disk
--插入大量值為a的數(shù)據(jù)
abase=# insert into t_sort(n_int,c_id) select generate_series(1000,2000),'a';
INSERT 0 1001
abase=# set work_mem = '64kB';
SET
abase=# explain analyze select ctid,n_int,c_id from t_sort order by n_int asc;
             QUERY PLAN             
-------------------------------------------------------------
 Sort (cost=18.60..19.28 rows=270 width=12) (actual time=1.235..1.386 rows=1055 loops=1)
 Sort Key: n_int
 Sort Method: external sort Disk: 32kB
 -> Seq Scan on t_sort (cost=0.00..7.70 rows=270 width=12) (actual time=0.030..0.247 rows=1055 loops=1)
 Planning time: 0.198 ms
 Execution time: 1.663 ms
(6 rows)

快速排序

--快速排序
abase=# explain analyze select ctid,n_int,c_id from t_sort order by n_int asc;
            QUERY PLAN            
------------------------------------------------------------
 Sort (cost=3.09..3.23 rows=54 width=12) (actual time=0.058..0.061 rows=54 loops=1)
 Sort Key: n_int
 Sort Method: quicksort Memory: 27kB
 -> Seq Scan on t_sort (cost=0.00..1.54 rows=54 width=12) (actual time=0.021..0.032 rows=54 loops=1)
 Planning time: 0.161 ms
 Execution time: 0.104 ms
(6 rows) 

--獲取前20條數(shù)據(jù)
 abase=# select ctid,n_int,c_id from t_sort order by n_int asc limit 20;
  ctid | n_int | c_id 
 --------+-------+------
  (0,7) | 100 | 7
  (0,2) | 100 | 2
  (0,4) | 100 | 4
  (0,8) | 100 | 8
  (0,3) | 100 | 3
  (0,6) | 100 | 6
  (0,5) | 100 | 5
  (0,9) | 100 | 9
  (0,1) | 100 | 1
  (0,14) | 200 | 5
  (0,13) | 200 | 4
  (0,12) | 200 | 3
  (0,10) | 200 | 1
  (0,15) | 200 | 6
  (0,16) | 200 | 7
  (0,17) | 200 | 8
  (0,11) | 200 | 2
  (0,18) | 200 | 9
  (0,20) | 300 | 2
  (0,19) | 300 | 1
 (20 rows)  --分頁(yè)獲取前10條數(shù)據(jù)
 abase=# select ctid,n_int,c_id from t_sort order by n_int asc limit 10 offset 0;
  ctid | n_int | c_id 
 --------+-------+------
  (0,1) | 100 | 1
  (0,3) | 100 | 3
  (0,4) | 100 | 4
  (0,2) | 100 | 2
  (0,6) | 100 | 6
  (0,7) | 100 | 7
  (0,8) | 100 | 8
  (0,9) | 100 | 9
  (0,5) | 100 | 5
  (0,10) | 200 | 1
 (10 rows)
 --分頁(yè)從第10條開(kāi)始獲取10條
 abase=# select ctid,n_int,c_id from t_sort order by n_int asc limit 10 offset 10;
  ctid | n_int | c_id 
 --------+-------+------
  (0,13) | 200 | 4
  (0,12) | 200 | 3
  (0,10) | 200 | 1
  (0,15) | 200 | 6
  (0,16) | 200 | 7
  (0,17) | 200 | 8
  (0,11) | 200 | 2
  (0,18) | 200 | 9
  (0,20) | 300 | 2
  (0,19) | 300 | 1
 (10 rows)

limit 10 offset 0,limit 10 offset 10，連續(xù)取兩頁(yè)數(shù)據(jù)

此處可以看到limit 10 offset 10結(jié)果中，第三條數(shù)據(jù)重復(fù)了第一頁(yè)的最后一條： (0,10) | 200 | 1

并且n_int = 200 and c_id = 5這條數(shù)據(jù)被“遺漏”了。

堆排序

abase=# select count(*) from t_sort;
 count 
-------
 1055
(1 row)
--設(shè)置work_mem 4MB
abase=# show work_mem ;
 work_mem 
----------
 4MB
(1 row)

--top-N heapsort
abase=# explain analyze select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;
              QUERY PLAN           
    
-------------------------------------------------------------------------------------------------------------
 Limit (cost=2061.33..2061.45 rows=10 width=13) (actual time=15.247..15.251 rows=10 loops=1)
 -> Limit (cost=2061.33..2063.83 rows=1001 width=13) (actual time=15.245..15.247 rows=10 loops=1)
   -> Sort (cost=2061.33..2135.72 rows=29757 width=13) (actual time=15.244..15.245 rows=10 loops=1)
    Sort Key: test.n_int
    Sort Method: top-N heapsort Memory: 95kB
    -> Seq Scan on test (cost=0.00..429.57 rows=29757 width=13) (actual time=0.042..7.627 rows=2
9757 loops=1)
 Planning time: 0.376 ms
 Execution time: 15.415 ms
(8 rows)

--獲取limit 1001 offset 0，然后取10前10條數(shù)據(jù)
abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;
 ctid | n_int | c_id 
----------+-------+------
 (0,6) | 100 | 6
 (0,2) | 100 | 2
 (0,5) | 100 | 5
 (87,195) | 100 | 888
 (0,3) | 100 | 3
 (0,1) | 100 | 1
 (0,8) | 100 | 8
 (0,55) | 100 | 888
 (44,12) | 100 | 888
 (0,9) | 100 | 9
(10 rows)
---獲取limit 1001 offset 1，然后取10前10條數(shù)據(jù)
abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 1) td limit 10;
 ctid | n_int | c_id 
----------+-------+------
 (44,12) | 100 | 888
 (0,8) | 100 | 8
 (0,1) | 100 | 1
 (0,5) | 100 | 5
 (0,9) | 100 | 9
 (87,195) | 100 | 888
 (0,7) | 100 | 7
 (0,6) | 100 | 6
 (0,3) | 100 | 3
 (0,4) | 100 | 4
(10 rows)

---獲取limit 1001 offset 2，然后取10前10條數(shù)據(jù)
abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 2) td limit 10;
 ctid | n_int | c_id 
----------+-------+------
 (0,5) | 100 | 5
 (0,55) | 100 | 888
 (0,1) | 100 | 1
 (0,9) | 100 | 9
 (0,2) | 100 | 2
 (0,3) | 100 | 3
 (44,12) | 100 | 888
 (0,7) | 100 | 7
 (87,195) | 100 | 888
 (0,4) | 100 | 4
(10 rows)

堆排序使用內(nèi)存： Sort Method: top-N heapsort Memory: 95kB

當(dāng)offset從0變成1后，以及變成2后，會(huì)發(fā)現(xiàn)查詢出的10條數(shù)據(jù)不是有順序的。

歸并排序

--將work_mem設(shè)置為64kb讓其走歸并排序。
abase=# set work_mem ='64kB';
SET
abase=# show work_mem;
 work_mem 
----------
 64kB
(1 row)

-- external merge Disk
abase=# explain analyze select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;
              QUERY PLAN               
---------------------------------------------------------------------------------------------------------------------------
 Limit (cost=2061.33..2061.45 rows=10 width=13) (actual time=27.912..27.916 rows=10 loops=1)
 -> Limit (cost=2061.33..2063.83 rows=1001 width=13) (actual time=27.910..27.913 rows=10 loops=1)
   -> Sort (cost=2061.33..2135.72 rows=29757 width=13) (actual time=27.909..27.911 rows=10 loops=1)
    Sort Key: test.n_int
    Sort Method: external merge Disk: 784kB
    -> Seq Scan on test (cost=0.00..429.57 rows=29757 width=13) (actual time=0.024..6.730 rows=29757 loops=1)
 Planning time: 0.218 ms
 Execution time: 28.358 ms
(8 rows)

--同堆排序一樣，獲取limit 1001 offset 0，然后取10前10條數(shù)據(jù)
abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;
 ctid | n_int | c_id 
--------+-------+------
 (0,1) | 100 | 1
 (0,2) | 100 | 2
 (0,4) | 100 | 4
 (0,8) | 100 | 8
 (0,9) | 100 | 9
 (0,5) | 100 | 5
 (0,3) | 100 | 3
 (0,6) | 100 | 6
 (0,55) | 100 | 888
 (0,7) | 100 | 7
(10 rows)

--同堆排序一樣，獲取limit 1001 offset 1，然后取10前10條數(shù)據(jù)
abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 1) td limit 10;
 ctid | n_int | c_id 
----------+-------+------
 (0,2) | 100 | 2
 (0,4) | 100 | 4
 (0,8) | 100 | 8
 (0,9) | 100 | 9
 (0,5) | 100 | 5
 (0,3) | 100 | 3
 (0,6) | 100 | 6
 (0,55) | 100 | 888
 (0,7) | 100 | 7
 (87,195) | 100 | 888
(10 rows)

--同堆排序一樣，獲取limit 1001 offset 2，然后取10前10條數(shù)據(jù)
abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 2) td limit 10;
 ctid | n_int | c_id 
----------+-------+------
 (0,4) | 100 | 4
 (0,8) | 100 | 8
 (0,9) | 100 | 9
 (0,5) | 100 | 5
 (0,3) | 100 | 3
 (0,6) | 100 | 6
 (0,55) | 100 | 888
 (0,7) | 100 | 7
 (87,195) | 100 | 888
 (44,12) | 100 | 888
(10 rows)

減小work_mem使用歸并排序的時(shí)候，offset從0變成1后以及變成2后，任然有序。

還有一種情況，那就是在查詢前面幾頁(yè)的時(shí)候會(huì)有重復(fù)，但是越往后面翻就不會(huì)重復(fù)了，現(xiàn)在也可以解釋清楚。

如果每頁(yè)10條數(shù)據(jù)，當(dāng)offse較小的時(shí)候使用的內(nèi)存較少。當(dāng)offse不斷增大，所耗費(fèi)的內(nèi)存也就越多。

--設(shè)置work_mem =64kb
abase=# show work_mem;
 work_mem 
----------
 64kB
(1 row)
--查詢limit 10 offset 10
abase=# explain analyze select * from ( select ctid,n_int,c_id from test order by n_int asc limit 10 offset 10) td limit 10;
              QUERY PLAN               
---------------------------------------------------------------------------------------------------------------------------
 Limit (cost=1221.42..1221.54 rows=10 width=13) (actual time=12.881..12.884 rows=10 loops=1)
 -> Limit (cost=1221.42..1221.44 rows=10 width=13) (actual time=12.879..12.881 rows=10 loops=1)
   -> Sort (cost=1221.39..1295.79 rows=29757 width=13) (actual time=12.877..12.879 rows=20 loops=1)
    Sort Key: test.n_int
    Sort Method: top-N heapsort Memory: 25kB
    -> Seq Scan on test (cost=0.00..429.57 rows=29757 width=13) (actual time=0.058..6.363 rows=29757 loops=1)
 Planning time: 0.230 ms
 Execution time: 12.976 ms
(8 rows)

--查詢limit 10 offset 1000
abase=# explain analyze select * from ( select ctid,n_int,c_id from test order by n_int asc limit 10 offset 1000) td limit 10;
              QUERY PLAN               
---------------------------------------------------------------------------------------------------------------------------
 Limit (cost=2065.75..2065.88 rows=10 width=13) (actual time=27.188..27.192 rows=10 loops=1)
 -> Limit (cost=2065.75..2065.78 rows=10 width=13) (actual time=27.186..27.188 rows=10 loops=1)
   -> Sort (cost=2063.25..2137.64 rows=29757 width=13) (actual time=26.940..27.138 rows=1010 loops=1)
    Sort Key: test.n_int
    Sort Method: external merge Disk: 784kB
    -> Seq Scan on test (cost=0.00..429.57 rows=29757 width=13) (actual time=0.026..6.374 rows=29757 loops=1)
 Planning time: 0.207 ms
 Execution time: 27.718 ms
(8 rows)

可以看到當(dāng)offset從10增加到1000的時(shí)候，使用的內(nèi)存增加，排序的方法從堆排序變成了歸并排序。而歸并排序?yàn)榉€(wěn)定排序，所以后面的分頁(yè)不會(huì)再有后一頁(yè)出現(xiàn)前一頁(yè)數(shù)據(jù)的情況。

參考資料:PostgreSQL - repeating rows from LIMIT OFFSET

參考資料: LIMIT and OFFSET

結(jié)語(yǔ)

1.關(guān)于分頁(yè)重復(fù)數(shù)據(jù)的問(wèn)題主要是排序字段不唯一并且執(zhí)行計(jì)劃走了快速排序和堆排序?qū)е隆?/p>

2.當(dāng)排序有重復(fù)字段，但是如果查詢是歸并排序，便不會(huì)存在有重復(fù)數(shù)據(jù)的問(wèn)題。

3.當(dāng)用重復(fù)字段排序，前面的頁(yè)重復(fù)，隨著offset的增大導(dǎo)致work_mem不足以后使用歸并排序，就不存在重復(fù)的數(shù)據(jù)了。

4.排序和算法的穩(wěn)定性有關(guān)，當(dāng)執(zhí)行計(jì)劃選擇不同的排序算法時(shí)，返回的結(jié)果不一樣。

5.處理重復(fù)數(shù)據(jù)的常見(jiàn)手段就是，排序的時(shí)候可以在排序字段d_larq(立案日期)后面加上c_bh(唯一字段)來(lái)排序。

order by d_larq,c_bh;

總結(jié)

以上就是這篇文章的全部?jī)?nèi)容了，希望本文的內(nèi)容對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，謝謝大家對(duì)腳本之家的支持。