「ロブロックス、機械学習最適化ブルームフィルターでSpark Joinクエリのコスト削減を実現」【23/11/29】

Oyaji 2023/11/29 3:43 最終更新日:2023/11/29 3:43

Robloxは巨大なデータレイクを持ち、データ結合を最適化するために機械学習最適化ブルームフィルターを使用している。
ブルームフィルターは、MLを使用してデータの存在を予測し、結合データを効率的に縮小する。
モデルアーキテクチャの改善により、メモリとCPU時間の削減にも大きな利点がある。

How Roblox Reduces Spark Join Query Costs With Machine Learning Optimized Bloom Filters

Abstract Every day on Roblox, 65.5 million users engage with millions of experiences, totaling 14.0 billion hours quarterly. This interaction generates a petabyte-scale data lake, which is enriched for analytics and machine learning (ML) purposes. It’s resource-intensive to join fact and dimension tables in our data lake, so to optimize this and reduce data shuffling, […]
The post How Roblox Reduces Spark Join Query Costs With Machine Learning Optimized Bloom Filters appeared first on Roblox Blog.

How Roblox Reduces Spark Join Query Costs With Machine Learning Optimized Bloom Filters November 28, 2023 by Aditya Mangal, Sourashis Roy, Sandeep Akinapelli, Anupam Singh, Jerome Boulon Product & Tech Abstract Every day on Roblox, 65.5 million users engage with millions of experiences, totaling 14.0 billion hours quarterly. This interaction generates a petabyte-scale data lake, which is enriched for analytics and machine learning (ML) purposes. It’s resource-intensive to join fact and dimension tables in our data lake, so to optimize this and reduce data shuffling, we embraced Learned Bloom Filters [1]—smart data structures using ML. By predicting presence, these filters considerably trim join data, enhancing efficiency and reducing costs. Along the way, we also improved our model architectures and demonstrated the substantial benefits they offer for reducing memory and CPU hours for processing, as well as increasing operational stability. Introduction In our data lake, fact tables and data cubes are temporally partitioned for efficient access, while dimension tables lack such partitions, and joining them with fact tables during updates is resource-intensive. The key space of the join is driven by the temporal partition of the fact table being joined. The dimension entities present in that temporal partition are a small subset of those present in the entire dimension dataset. As a result, the majority of the shuffled dimension data in these joins is eventually discarded. To optimize this process and reduce unnecessary shuffling, we considered using Bloom Filters on distinct join keys but faced filter size and memory footprint issues. To address them, we explored Learned Bloom Filters, an ML-based solution that reduces Bloom Filter size while maintaining low false positive rates. This innovation enhances the efficiency of join operations by reducing computational costs and improving system stability. The following schematic illustrates the conventional and optimized join processes in our distributed computing environment. Enhancing Join Efficiency with Learned Bloom Filters To optimize the join between fact and dimension tables, we adopted the Learned Bloom Filter implementation. We constructed an index from the keys present in the fact table and subsequently deployed the index to pre-filter dimension data before the join operation. Evolution from Traditional Bloom Filters to Learned Bloom Filters While a traditional Bloom Filter is efficient, it adds 15-25% of additional memory per worker node needing to load it to hit our desired false positive rate. But by harnessing Learned Bloom Filters, we achieved a considerably reduced index size while maintaining the same false positive rate. This is because of the transformation of the Bloom Filter into a binary classification problem. Positive labels indicate the presence of values in the index, while negative labels mean they’re absent. The introduction of an ML model facilitates the initial check for values, followed by a backup Bloom Filter for eliminating false negatives. The reduced size stems from the model’s compressed representation and reduced number of keys required by the backup Bloom Filter. This distinguishes it from the conventional Bloom Filter approach. As part of this work, we established two metrics for evaluating our Learned Bloom Filter approach: the index’s final serialized object size and CPU consumption during the execution of join queries. Navigating Implementation Challenges Our initial challenge was addressing a highly biased training dataset with few dimension table keys in the fact table. In doing so, we observed an overlap of approximately one-in-three keys between the tables. To tackle this, we leveraged the Sandwich Learned Bloom Filter approach [2]. This integrates an initial traditional Bloom Filter to rebalance the dataset distribution by removing the majority of keys that were missing from the fact table, effectively eliminating negative samples from the dataset. Subsequently, only the keys included in the initial Bloom Filter, along with the false positives, were forwarded to the ML model, often referred to as the “learned oracle.” This approach resulted in a well-balanced training dataset for the learned oracle, overcoming the bias issue effectively. The second challenge centered on model architecture and training features. Unlike the classic problem of phishing URLs [1], our join keys (which in most cases are unique identifiers for users/experiences) weren’t inherently informative. This led us to explore dimension attributes as potential model features that can help predict if a dimension entity is present in the fact table. For example, imagine a fact table that contains user session information for experiences in a particular language. The geographic location or the language preference attribute of the user dimension would be good indicators of whether an individual user is present in the fact table or not. The third challenge—inference latency—required models that both minimized false negatives and provided rapid responses. A gradient-boosted tree model was the optimal choice for these key metrics, and we pruned its feature set to balance precision and speed. Our updated join query using learned Bloom Filters is as shown below: Results Here are the results of our experiments with Learned Bloom filters in our data lake. We integrated them into five production workloads, each of which possessed different data characteristics. The most computationally expensive part of these workloads is the join between a fact table and a dimension table. The key space of the fact tables is approximately 30% of the dimension table. To begin with, we discuss how the Learned Bloom Filter outperformed traditional Bloom Filters in terms of final serialized object size. Next, we show performance improvements that we observed by integrating Learned Bloom Filters into our workload processing pipelines. Learned Bloom Filter Size Comparison As shown below, when looking at a given false positive rate, the two variants of the learned Bloom Filter improve total object size by between 17-42% when compared to traditional Bloom Filters. In addition, by using a smaller subset of features in our gradient boosted tree based model, we lost only a small percentage of optimization while making inference faster. Learned Bloom Filter Usage Results In this section, we compare the performance of Bloom Filter-based joins to that of regular joins across several metrics. The table below compares the performance of workloads with and without the use of Learned Bloom Filters. A Learned Bloom Filter with 1% total false positive probability demonstrates the comparison below while maintaining the same cluster configuration for both join types. First, we found that Bloom Filter implementation outperformed the regular join by as much as 60% in CPU hours. We saw an increase in CPU usage of the scan step for the Learned Bloom Filter approach due to the additional compute spent in evaluating the Bloom Filter. However, the prefiltering done in this step reduced the size of data being shuffled, which helped reduce the CPU used by the downstream steps, thus reducing the total CPU hours. Second, Learned Bloom Filters have about 80% less total data size and about 80% less total shuffle bytes written than a regular join. This leads to more stable join performance as discussed below. We also saw reduced resource usage in our other production workloads under experimentation. Over a period of two weeks across all five workloads, the Learned Bloom Filter approach generated an average daily cost savings of 25%, which also accounts for model training and index creation. Due to the reduced amount of data shuffled while performing the join, we were able to significantly reduce the operational costs of our analytics pipeline while also making it more stable.The following chart shows variability (using a coefficient of variation) in run durations (wall clock time) for a regular join workload and a Learned Bloom Filter based workload over a two-week period for the five workloads we experimented with. The runs using Learned Bloom Filters were more stable—more consistent in duration—which opens up the possibility of moving them to cheaper transient unreliable compute resources. References [1] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The Case for Learned Index Structures. https://arxiv.org/abs/1712.01208, 2017. [2] M. Mitzenmacher. Optimizing Learned Bloom Filters by Sandwiching. https://arxiv.org/abs/1803.01474, 2018. ¹As of 3 months ended June 30, 2023 ²As of 3 months ended June 30, 2023 Recommended Causal Inference Using Instrumental VariablesIntroducing Custom Material Variants and New Enhancements to MaterialsUnleashing Creator Productivity with Open CloudRevolutionizing Creation on Roblox with Generative AI

全文表示

ソース：https://blog.roblox.com/2023/11/roblox-reduces-spark-join-query-costs-machine-learning-optimized-bloom-filters/

「ROBLOX」最新情報はこちら

ROBLOXの動画をもっと見る

セルランの推移をチェックしましょう

11月28日前日の様子	ゲームセールス：圏外総合セールス：圏外無料ランキング：51位
23'11月29日(水曜日) 記事掲載日 ※1日の最高順位	ゲームセールス：圏外総合セールス：圏外無料ランキング：60位
無料ランキング急上昇	注目度が高くなっています。ゲームを開始するチャンスです。
無料ランキング上位 ※初心者にオススメ！	話題性もあり新規またはアクティブユーザーが多い。リセマラのチャンスの可能性も高くサービス開始直後や期間などのゲームも多い。
サービス開始日	2012年12月11日
何年目？	4410日(12年)
周年いつ？	次回：2025年12月11日(13周年)
アニバーサリーまで	あと338日
ハーフアニバーサリー予測	2025年6月11日(12.5周年) あと155日
運営	Roblox Corporation

ROBLOX情報

ROBLOXについて何でもお気軽にコメントしてください(匿名)

全ソシャゲのコメントをチェック

Roblox（ロブロックス）は、無料でカンタンにダウンロードできる制作・交流型のバーチャル空間プラットフォームです。世界中のクリエーターが制作・投稿した膨大な数のバーチャル空間で友達と交流したりしながら楽しみましょう。バーチャル空間が何百万本もラインナップ Robloxのバーチャル空間には、何百万通りの楽しみ方があります。例えば... ・話題の映画やテレビ番組の公式ワールドを探検・バーチャルコンサート鑑賞・一流ファッションブランドのアパレル試着・ホラー空間で肝試し・eスポーツでみんなと試合したり、格闘ゲームや障害物アスレに挑戦・世界の都市で観光体験・アバターになったミュージシャンやタレントと交流 …など、盛りだくさんです。みんなとつながれます・パソコン、モバイル、Xbox One、VRヘッドセットなど、ほとんどの端末環境で動作。・友達と同じ端末を使っていなくても一緒に楽しめます。・テキストと音声チャット機能とプライベートメッセージ機能つき。・グループ機能で同じ趣味や推しの世界規模のコミュニティとつながれます。・アバターとして通話できる通信機能、Roblox Connect（ロブロックス・コネクト）搭載。なりたい自分になれちゃいます・アバターをカスタマイズして自分らしくコーディネート。・ブランドや企業のバーチャル空間内でバーチャル商品の購入もできます。・語学や教育に特化したバーチャル空間でスキルアップも。　・プログラミングやデザインを学べるツールも満載。・アイテムやバーチャル空間を制作して、クリエーターデビューもできます。オリジナルのバーチャル空間制作: https://www.roblox.com/develop サポート: https://help.roblox.com お問い合わせ: https://corp.roblox.com/contact/ プライバシーポリシー: https://en.help.roblox.com/hc/ja/articles/115004630823 保護者ガイド: https://corp.roblox.com/parents/ 利用規約: https://www.roblox.com/info/terms ご注意: 参加するには、ネットワーク接続が必要です。Robloxは、Wi-Fiに接続した状態での使用が最適です。Roblox上の機能やコンテンツの一部は、お住まいの国や地域、お使いの言語ではご利用いただけない場合があります。

全文表示

今も楽しいですが物や人を飛ばすで荒らしや稀にチーターがいるのが嫌ですでもロブロはとっっっっててっっても好きです！無課金でも可愛い顔に無料の部類を追加してください2023年の様におねがいします！ (★5)(24/12/29)

Hate the hackers I mind my business but hackers suddenly disturb me I also hate scammers but love the games and events (★4)(24/12/29)

とにかく最高です。ﾀﾋぬまでこのゲームやってられます。有名なゲームは大量にあるし何億個もある面白いゲームは見たことがありません。課金しても損はほぼありません。だけど最近課金してる人が多すぎて無課金勢がいじめられるという事が多発してて困ってます。僕は課金勢なんですけどそういう現場はちょくちょうみてます。物や人を飛ばすっていうゲームとかは無課金差別がやばいです。そして多分日本人の治安の悪さ世界一です。べつに行くなと言う訳では無いんですけどまじで治安悪いので気をつけてください。でも当たりのサーバーに行けばめっちゃ最高のゲームでもあります。治安は悪いけど面白いゲームかな。とにかくロブロックスは最高ですね。やろうかなと思ってる人、絶対にやってください。特にやった方がいいゲームを言います（個人的な意見）・ブロックスフルーツ・ブルックヘブン・スラップバトル・プリーズトネート（これはロブロックス内通貨であるロバックスを寄付するゲームです）・ドアーズ・ひみつのおるすばん・最強の戦場・ライバル・殺人ミステリー2 ・むありあ放送局このような感じです。ちなみにむありあ放送局は行った方がいいですね。むありあっていうユーチューバー居るんですけどその人がよくロバックス配布企画してるんですよね。なのでグループに入って告知みながらやってみて下さい。ライブでもロバックス配布してるのでグループの告知をみながら待ちましょう！ (★3)(24/12/29)

Sometimes I need to use real money to buy something special but I still love Roblox because I can meet new friends and there are many different games (★5)(24/12/29)

1.正直課金ゲー正直課金ゲーすぎて笑えません。ロブロックス面白いですが、課金をしないとあまり私的には面白くないです。私は課金しましたが、最初は私も無課金です。誰でもそうです。なのでそこはもう少し無課金に優しくしてもらいたいです。 2.出会い厨出会い厨多すぎです。特に、物や人を飛ばす・海鼠の湯で彼氏欲しいなどの発言をしている方がいらっしゃいます。見ていて不快です。 3.やらしい行動 2でも言った、温泉でやらしい行動をしている、男女がいます。こちらも見ていて不快です。 4.□倍詐欺ロブロックスの中には色々なゲームがありその中の「おねがい寄付して｣（寄付ゲー）を例にしますね。そのゲームは寄付をしたり寄付をされたりするゲームなんですけど、その中に「〇ロバ（ロブロックスの中のお金）をくれたら□倍で返します。｣といった詐欺が大量発生😅どういうことかというと、その詐欺に騙されてAさん（例え）が寄付します。するとその□倍にして返すという詐欺をBさんとしましょう。そのBさんは寄付をしてもらうと直ぐにそのワールドから抜けるとのことです。正直それは犯罪なのでは？と思います。 5.運営さんが厳しすぎる。運営さん、厳しすぎます。少しの発言で1日BANされます。 (★4)(24/12/29)

最近、ロブロックスのアプデがきましたよね。フレンドからも聞きました。チャットをできなくしました。1人のフレンドがチャットができなくなっています。プライベートチャットができなくなったのが、1番の欠点です。ここからはプライベートチャットのことをプラチャと言います。何故かと言うと、野良のサーバーに行く時があるんですが、その時、プラチャが出来ないと、他の人が話に割り込んでくるので、そこだけは修正してください。良いところ・ゲームが面白い・無料のスキンがあるだけです。個人的な評価なので質問は御控えください。悪いところ・暴言キッズがいる・誤banされる・入った時フレンドがいるサーバーに飛ばされる (上の悪いところはプレイヤーが満員だった時) ・治安が悪い修正してほしいところ・プレイヤーが満員の時、別のサーバーに飛ばしてほしい。・チーターがチートをしたら、無期限バンをする。 ⚠️注意　これは個人的に修正してほしいところ。・ゲームによくバグ技がある。 (揺るぎない魂の壁抜けを除いて．) のみです。あとがきよく、自分が通報しようと思ったら、逃げられるんです。個人的には結構いいと思いますよ。ただ…一つだけですけどね…また変わったことがあったらまたレビューします。では、good-bye！ (★5)(24/12/29)

レビューをもっと見る

One thought on “「ロブロックス、機械学習最適化ブルームフィルターでSpark Joinクエリのコスト削減を実現」”

ROBLOX より:

2023年11月29日 3:45 AM

Robloxは巨大なデータレイクを持っており、このデータを効率的に結合するために機械学習を活用しています。このアプローチは効果的であり、メモリやCPUの使用量を削減し、処理時間を短縮することができるようです。興味深い取り組みだと感じました。

返信

ご意見らくがき帳コメントをキャンセル

カテゴリーへ移動して「ROBLOX」の最新情報をチェックしてください

「ロブロックス、機械学習最適化ブルームフィルターでSpark Joinクエリのコスト削減を実現」

How Roblox Reduces Spark Join Query Costs With Machine Learning Optimized Bloom Filters

One thought on “「ロブロックス、機械学習最適化ブルームフィルターでSpark Joinクエリのコスト削減を実現」”

ご意見らくがき帳 コメントをキャンセル

ご意見らくがき帳コメントをキャンセル