快捷導(dǎo)航

Spark處理數(shù)據(jù)排序問題如何避免OOM

更新時間：2020年05月21日 11:00:38 作者：Sheep Sun

這篇文章主要介紹了Spark處理數(shù)據(jù)排序問題如何避免OOM,文中通過示例代碼介紹的非常詳細(xì)，對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友可以參考下

錯誤思想

舉個列子，當(dāng)我們想要比較一個類型為 RDD[(Long, (String, Int))] 的RDD，讓它先按Long分組，然后按int的值進行倒序排序，最容易想到的思維就是先分組，然后把Iterable 轉(zhuǎn)換為 list，然后sortby,但是這樣卻有一個致命的缺點，就是Iterable 在內(nèi)存中是一個指針，不占內(nèi)存，而list是一個容器，占用內(nèi)存，如果Iterable 含有元素過多，那么極易引起OOM

 val cidAndSidCountGrouped: RDD[(Long, Iterable[(String, Int)])] = cidAndSidCount.groupByKey()
    // 4. 排序, 取top10
    val result: RDD[(Long, List[(String, Int)])] = cidAndSidCountGrouped.map {
      case (cid, sidCountIt) =>
        // sidCountIt 排序, 取前10
        // Iterable轉(zhuǎn)成容器式集合的時候, 如果數(shù)據(jù)量過大, 極有可能導(dǎo)致oom
        (cid, sidCountIt.toList.sortBy(-_._2).take(5))
    }

首先，我們要知道，RDD 的排序需要 shuffle, 是采用了內(nèi)存+磁盤來完成的排序.這樣能有效避免OOM的風(fēng)險，但是RDD是全部排序，所以需要針對性的過濾Key值來進行排序

方法一利用RDD排序特點

 //把long（即key值）提取出來
    val cids: List[Long] = categoryCountList.map(_.cid.toLong)
    val buffer: ListBuffer[(Long, List[(String, Int)])] = ListBuffer[(Long, List[(String, Int)])]()
    //根據(jù)每個key來過濾RDD
    for (cid <- cids) {
      /*
      List((15,(632972a4-f811-4000-b920-dc12ea803a41,10)), (15,(f34878b8-1784-4d81-a4d1-0c93ce53e942,8)), (15,(5e3545a0-1521-4ad6-91fe-e792c20c46da,8)), (15,(66a421b0-839d-49ae-a386-5fa3ed75226f,8)), (15,(9fa653ec-5a22-4938-83c5-21521d083cd0,8)))
      目標(biāo):
      (9,List((199f8e1d-db1a-4174-b0c2-ef095aaef3ee,9), (329b966c-d61b-46ad-949a-7e37142d384a,8), (5e3545a0-1521-4ad6-91fe-e792c20c46da,8), (e306c00b-a6c5-44c2-9c77-15e919340324,7), (bed60a57-3f81-4616-9e8b-067445695a77,7)))
       */
      val arr: Array[(String, Int)] = cidAndSidCount.filter(cid == _._1)
        .sortBy(-_._2._2)
        .take(5)
        .map(_._2)
      buffer += ((cid, arr.toList))
    }
    buffer.foreach(println)

這樣做也有缺點：即有多少個key，就有多少個Job，占用資源

方法二利用TreeSet自動排序特性

 def statCategoryTop10Session_3(sc: SparkContext,
                  categoryCountList: List[CategroyCount],
                  userVisitActionRDD: RDD[UserVisitAction]) = {
    // 1. 過濾出來 top10品類的所有點擊記錄
    // 1.1 先map出來top10的品類id
    val cids = categoryCountList.map(_.cid.toLong)
    val topCategoryActionRDD: RDD[UserVisitAction] = userVisitActionRDD.filter(action => cids.contains(action.click_category_id))


    // 2. 計算每個品類 下的每個session 的點擊量 rdd ((cid, sid) ,1)
    val cidAndSidCount: RDD[(Long, (String, Int))] = topCategoryActionRDD
      .map(action => ((action.click_category_id, action.session_id), 1))
      // 使用自定義分區(qū)器 重點理解分區(qū)器的原理
      .reduceByKey(new CategoryPartitioner(cids), _ + _)
      .map {
        case ((cid, sid), count) => (cid, (sid, count))
      }
    
    // 3. 排序取top10
//因為已經(jīng)按key分好了區(qū)，所以用Mappartitions ，在每個分區(qū)中新建一個TreeSet即可
    val result: RDD[(Long, List[SessionInfo])] = cidAndSidCount.mapPartitions((it: Iterator[(Long, (String, Int))]) => {
//new 一個TreeSet，并同時指定排序規(guī)則
   var treeSet: mutable.TreeSet[CategorySession] = new mutable.TreeSet[CategorySession]()(new Ordering[CategorySession] {
          override def compare(x: CategorySession, y: CategorySession): Int = {
            if (x.clickCount >= y.clickCount) -1 else 1
          }
        })
   var id = 0l
  iter.foreach({
    case (l, session) => {
      id = l
      treeSet.add(session)
    if (treeSet.size > 10) treeSet = treeSet.take(10)
          }
        })
        Iterator(id, treeSet)
      })
  
    result.collect.foreach(println)
    
    Thread.sleep(1000000)
  }
}

/*
根據(jù)傳入的key值來決定分區(qū)號，讓相同key進入相同的分區(qū)，能夠避免多次shuffle
 */
class CategoryPartitioner(cids: List[Long]) extends Partitioner {
  // 用cid索引, 作為將來他的分區(qū)索引.
  private val cidWithIndex: Map[Long, Int] = cids.zipWithIndex.toMap
  
  // 返回集合的長度
  override def numPartitions: Int = cids.length
  
  // 根據(jù)key返回分區(qū)的索引
  override def getPartition(key: Any): Int = {
    key match {
      // 根據(jù)品類id返回分區(qū)的索引!  0-9
      case (cid: Long, _) =>
        cidWithIndex(cid)
    }
  }
}

以上就是本文的全部內(nèi)容，希望對大家的學(xué)習(xí)有所幫助，也希望大家多多支持腳本之家。

您可能感興趣的文章: