「ロブロックス、インフラストラクチャーの効率性と信頼性を向上させる取り組みを発表」【23/12/08】

Oyaji 2023/12/8 2:16 最終更新日:2023/12/08 2:16

Robloxの技術基盤は16年以上の間に成長し、約145,000台のマシンで構成されている。
1,000以上の内部サービスがデプロイおよび管理されており、常時オンラインの体験を提供している。
コストとネットワークの遅延を管理するために、カスタムビルドされたハイブリッドプライベートクラウドインフラストラクチャを使用している。
現在、Robloxのインフラは全世界で7000万以上のアクティブユーザーをサポートしている。

How We’re Making Roblox’s Infrastructure More Efficient and Resilient

As Roblox has grown over the past 16+ years, so has the scale and complexity of the technical infrastructure that supports millions of immersive 3D co-experiences. The number of machines we support has more than tripled over the past two years, from approximately 36,000 as of June 30, 2021 to nearly 145,000 today. Supporting these […]
The post How We’re Making Roblox’s Infrastructure More Efficient and Resilient appeared first on Roblox Blog.

How We’re Making Roblox’s Infrastructure More Efficient and Resilient December 7, 2023 by Daniel Sturman, Chief Technology Officer; Max Ross, Vice President, Engineering; and Michael Wolf, Technical Director Product & Tech As Roblox has grown over the past 16+ years, so has the scale and complexity of the technical infrastructure that supports millions of immersive 3D co-experiences. The number of machines we support has more than tripled over the past two years, from approximately 36,000 as of June 30, 2021 to nearly 145,000 today. Supporting these always-on experiences for people all over the world requires more than 1,000 internal services. To help us control costs and network latency, we deploy and manage these machines as part of a custom-built and hybrid private cloud infrastructure that runs primarily on premises. Our infrastructure currently supports more than 70 million daily active users around the world, including the creators who rely on Roblox’s economy for their businesses. All of these millions of people expect a very high level of reliability. Given the immersive nature of our experiences, there is an extremely low tolerance for lags or latency, let alone outages. Roblox is a platform for communication and connection, where people come together in immersive 3D experiences. When people are communicating as their avatars in an immersive space, even minor delays or glitches are more noticeable than they are on a text thread or a conference call. In October, 2021, we experienced a system-wide outage. It started small, with an issue in one component in one data center. But it spread quickly as we were investigating and ultimately resulted in a 73-hour outage. At the time, we shared both details about what happened and some of our early learnings from the issue. Since then, we’ve been studying those learnings and working to increase the resilience of our infrastructure to the types of failures that occur in all large-scale systems due to factors like extreme traffic spikes, weather, hardware failure, software bugs, or just humans making mistakes. When these failures occur, how do we ensure that an issue in a single component, or group of components, does not spread to the full system? This question has been our focus for the past two years and while the work is ongoing, what we’ve done so far is already paying off. For example, in the first half of 2023, we saved 125 million engagement hours per month compared to the first half of 2022. Today, we’re sharing the work we’ve already done, as well as our longer-term vision for building a more resilient infrastructure system. Building a Backstop Within large-scale infrastructure systems, small scale failures happen many times a day. If one machine has an issue and has to be taken out of service, that’s manageable because most companies maintain multiple instances of their back-end services. So when a single instance fails, others pick up the workload. To address these frequent failures, requests are generally set to automatically retry if they get an error. This becomes challenging when a system or person retries too aggressively, which can become a way for those small-scale failures to propagate throughout the infrastructure to other services and systems. If the network or a user retries persistently enough, it will eventually overload every instance of that service, and potentially other systems, globally. Our 2021 outage was the result of something that’s fairly common in large scale systems: A failure starts small then propagates through the system, getting big so quickly it’s hard to resolve before everything goes down. At the time of our outage, we had one active data center (with components within it acting as backup). We needed the ability to fail over manually to a new data center when an issue brought the existing one down. Our first priority was to ensure we had a backup deployment of Roblox, so we built that backup in a new data center, located in a different geographic region. That added protection for the worst-case scenario: an outage spreading to enough components within a data center that it becomes entirely inoperable. We now have one data center handling workloads (active) and one on standby, serving as backup (passive). Our long-term goal is to move from this active-passive configuration to an active-active configuration, in which both data centers handle workloads, with a load balancer distributing requests between them based on latency, capacity, and health. Once this is in place, we expect to have even higher reliability for all of Roblox and be able to fail over nearly instantaneously rather than over several hours. Moving to a Cellular Infrastructure Our next priority was to create strong blast walls inside each data center to reduce the possibility of an entire data center failing. Cells (some companies call them clusters) are essentially a set of machines and are how we’re creating these walls. We replicate services both within and across cells for added redundancy. Ultimately, we want all services at Roblox to run in cells so they can benefit from both strong blast walls and redundancy. If a cell is no longer functional, it can safely be deactivated. Replication across cells enables the service to keep running while the cell is repaired. In some cases, cell repair might mean a complete reprovisioning of the cell. Across the industry, wiping and reprovisioning an individual machine, or a small set of machines, is fairly common, but doing this for an entire cell, which contains ~1,400 machines, is not. For this to work, these cells need to be largely uniform, so we can quickly and efficiently move workloads from one cell to another. We have set certain requirements that services need to meet before they run in a cell. For example, services must be containerized, which makes them much more portable and prevents anyone from making configuration changes at the OS level. We’ve adopted an infrastructure-as-code philosophy for cells: In our source code repository, we include the definition of everything that’s in a cell so we can rebuild it quickly from scratch using automated tools. Not all services currently meet these requirements, so we’ve worked to help service owners meet them where possible, and we’ve built new tools to make it easy to migrate services into cells when ready. For example, our new deployment tool automatically “stripes” a service deployment across cells, so service owners don’t have to think about the replication strategy. This level of rigor makes the migration process much more challenging and time consuming, but the long-term payoff will be a system where: It’s far easier to contain a failure and prevent it from spreading to other cells; Our infrastructure engineers can be more efficient and move more quickly; and The engineers who build the product-level services that are ultimately deployed in cells don’t need to know or worry about which cells their services are running in. Solving Bigger Challenges Similar to the way fire doors are used to contain flames, cells act as strong blast walls within our infrastructure to help contain whatever issue is triggering a failure within a single cell. Eventually, all of the services that make up Roblox will be redundantly deployed inside of and across cells. Once this work is complete, issues could still propagate wide enough to make an entire cell inoperable, but it would be extremely difficult for an issue to propagate beyond that cell. And if we succeed in making cells interchangeable, recovery will be significantly faster because we’ll be able to fail over to a different cell and keep the issue from impacting end users. Where this gets tricky is separating these cells enough to reduce the opportunity to propagate errors, while keeping things performant and functional. In a complex infrastructure system, services need to communicate with each other to share queries, information, workloads, etc. As we replicate these services into cells, we need to be thoughtful about how we manage cross-communication. In an ideal world, we redirect traffic from one unhealthy cell to other healthy cells. But how do we manage a “query of death”—one that’s causing a cell to be unhealthy? If we redirect that query to another cell, it can cause that cell to become unhealthy in just the way we’re trying to avoid. We need to find mechanisms to shift “good” traffic from unhealthy cells while detecting and squelching the traffic that’s causing cells to become unhealthy. In the short term, we have deployed copies of computing services to each compute cell so that most requests to the data center can be served by a single cell. We are also load balancing traffic across cells. Looking further out, we’ve begun building a next-generation service discovery process that will be leveraged by a service mesh, which we hope to complete in 2024. This will allow us to implement sophisticated policies that will allow cross-cell communication only when it won’t negatively impact the failover cells. Also coming in 2024 will be a method for directing dependent requests to a service version in the same cell, which will minimize cross-cell traffic and thereby reduce the risk of cross-cell propagation of failures. At peak, more than 70 percent of our back-end service traffic is being served out of cells and we’ve learned a lot about how to create cells, but we anticipate more research and testing as we continue to migrate our services through 2024 and beyond. As we progress, these blast walls will become increasingly stronger. Migrating an always-on infrastructure Roblox is a global platform supporting users all over the world, so we can’t move services during off-peak or “down time,” which further complicates the process of migrating all of our machines into cells and our services to run in those cells. We have millions of always-on experiences that need to continue to be supported, even as we move the machines they run on and the services that support them. When we started this process, we didn’t have tens of thousands of machines just sitting around unused and available to migrate these workloads onto. We did, however, have a small number of additional machines that were purchased in anticipation of future growth. To start, we built new cells using those machines, then migrated workloads to them. We value efficiency as well as reliability, so rather than going out and buying more machines once we ran out of “spare” machines we built more cells by wiping and reprovisioning the machines we’d migrated off of. We then migrated workloads onto those reprovisioned machines, and started the process all over again. This process is complex—as machines are replaced and free up to be built into cells, they are not freeing up in an ideal, orderly fashion. They are physically fragmented across data halls, leaving us to provision them in a piecemeal fashion, which requires a hardware-level defragmentation process to keep the hardware locations aligned with large-scale physical failure domains. A portion of our infrastructure engineering team is focused on migrating existing workloads from our legacy, or “pre-cell,” environment into cells. This work will continue until we’ve migrated thousands of different infrastructure services and thousands of back-end services into newly built cells. We expect this will take all of next year and possibly into 2025, due to some complicating factors. First, this work requires robust tooling to be built. For example, we need tooling to automatically rebalance large numbers of services when we deploy a new cell—without impacting our users. We’ve also seen services that were built with assumptions about our infrastructure. We need to revise these services so they do not depend upon things that could change in the future as we move into cells. We’ve also implemented both a way to search for known design patterns that won’t work well with cellular architecture, as well as a methodical testing process for each service that’s migrated. These processes help us head off any user-facing issues caused by a service being incompatible with cells. Today, close to 30,000 machines are being managed by cells. It’s only a fraction of our total fleet, but it’s been a very smooth transition so far with no negative player impact. Our ultimate goal is for our systems to achieve 99.99 percent user uptime every month, meaning we would disrupt no more than 0.01 percent of engagement hours. Industry-wide, downtime cannot be completely eliminated, but our goal is to reduce any Roblox downtime to a degree that it’s nearly unnoticeable. Future-proofing as we scale While our early efforts are proving successful, our work on cells is far from done. As Roblox continues to scale, we will keep working to improve the efficiency and resiliency of our systems through this and other technologies. As we go, the platform will become increasingly resilient to issues, and any issues that occur should become progressively less visible and disruptive to the people on our platform. In summary, to date, we have: Built a second data center and successfully achieved active/passive status. Created cells in our active and passive data centers and successfully migrated more than 70 percent of our back-end service traffic to these cells. Set in place the requirements and best practices we’ll need to follow to keep all cells uniform as we continue to migrate the rest of our infrastructure. Kicked off a continuous process of building stronger “blast walls” between cells. As these cells become more interchangeable, there will be less crosstalk between cells. This unlocks some very interesting opportunities for us in terms of increasing automation around monitoring, troubleshooting, and even shifting workloads automatically. In September we also started running active/active experiments across our data centers. This is another mechanism we’re testing to improve reliability and minimize failover times. These experiments helped identify a number of system design patterns, largely around data access, that we need to rework as we push toward becoming fully active-active. Overall, the experiment was successful enough to leave it running for the traffic from a limited number of our users. We’re excited to keep driving this work forward to bring greater efficiency and resiliency to the platform. This work on cells and active-active infrastructure, along with our other efforts, will make it possible for us to grow into a reliable, high performing utility for millions of people and to continue to scale as we work to connect a billion people in real time. Recommended Revolutionizing Creation on Roblox with Generative AIOur Vision for All Ages2021 Year in Review — A Letter from Our CEOEnabling Creation of Anything, Anywhere, by Anyone

全文表示

ソース：https://blog.roblox.com/2023/12/making-robloxs-infrastructure-efficient-resilient/

「ROBLOX」最新情報はこちら

ROBLOXの動画をもっと見る

セルランの推移をチェックしましょう

12月7日前日の様子	ゲームセールス：99位総合セールス：圏外無料ランキング：42位
23'12月8日(金曜日) 記事掲載日 ※1日の最高順位	ゲームセールス：圏外総合セールス：圏外無料ランキング：42位
無料ランキング急上昇	注目度が高くなっています。ゲームを開始するチャンスです。
無料ランキング上位 ※初心者にオススメ！	話題性もあり新規またはアクティブユーザーが多い。リセマラのチャンスの可能性も高くサービス開始直後や期間などのゲームも多い。
サービス開始日	2012年12月11日
何年目？	4410日(12年)
周年いつ？	次回：2025年12月11日(13周年)
アニバーサリーまで	あと338日
ハーフアニバーサリー予測	2025年6月11日(12.5周年) あと155日
運営	Roblox Corporation

ROBLOX情報

ROBLOXについて何でもお気軽にコメントしてください(匿名)

全ソシャゲのコメントをチェック

Roblox（ロブロックス）は、無料でカンタンにダウンロードできる制作・交流型のバーチャル空間プラットフォームです。世界中のクリエーターが制作・投稿した膨大な数のバーチャル空間で友達と交流したりしながら楽しみましょう。バーチャル空間が何百万本もラインナップ Robloxのバーチャル空間には、何百万通りの楽しみ方があります。例えば... ・話題の映画やテレビ番組の公式ワールドを探検・バーチャルコンサート鑑賞・一流ファッションブランドのアパレル試着・ホラー空間で肝試し・eスポーツでみんなと試合したり、格闘ゲームや障害物アスレに挑戦・世界の都市で観光体験・アバターになったミュージシャンやタレントと交流 …など、盛りだくさんです。みんなとつながれます・パソコン、モバイル、Xbox One、VRヘッドセットなど、ほとんどの端末環境で動作。・友達と同じ端末を使っていなくても一緒に楽しめます。・テキストと音声チャット機能とプライベートメッセージ機能つき。・グループ機能で同じ趣味や推しの世界規模のコミュニティとつながれます。・アバターとして通話できる通信機能、Roblox Connect（ロブロックス・コネクト）搭載。なりたい自分になれちゃいます・アバターをカスタマイズして自分らしくコーディネート。・ブランドや企業のバーチャル空間内でバーチャル商品の購入もできます。・語学や教育に特化したバーチャル空間でスキルアップも。　・プログラミングやデザインを学べるツールも満載。・アイテムやバーチャル空間を制作して、クリエーターデビューもできます。オリジナルのバーチャル空間制作: https://www.roblox.com/develop サポート: https://help.roblox.com お問い合わせ: https://corp.roblox.com/contact/ プライバシーポリシー: https://en.help.roblox.com/hc/ja/articles/115004630823 保護者ガイド: https://corp.roblox.com/parents/ 利用規約: https://www.roblox.com/info/terms ご注意: 参加するには、ネットワーク接続が必要です。Robloxは、Wi-Fiに接続した状態での使用が最適です。Roblox上の機能やコンテンツの一部は、お住まいの国や地域、お使いの言語ではご利用いただけない場合があります。

全文表示

今も楽しいですが物や人を飛ばすで荒らしや稀にチーターがいるのが嫌ですでもロブロはとっっっっててっっても好きです！無課金でも可愛い顔に無料の部類を追加してください2023年の様におねがいします！ (★5)(24/12/29)

Hate the hackers I mind my business but hackers suddenly disturb me I also hate scammers but love the games and events (★4)(24/12/29)

とにかく最高です。ﾀﾋぬまでこのゲームやってられます。有名なゲームは大量にあるし何億個もある面白いゲームは見たことがありません。課金しても損はほぼありません。だけど最近課金してる人が多すぎて無課金勢がいじめられるという事が多発してて困ってます。僕は課金勢なんですけどそういう現場はちょくちょうみてます。物や人を飛ばすっていうゲームとかは無課金差別がやばいです。そして多分日本人の治安の悪さ世界一です。べつに行くなと言う訳では無いんですけどまじで治安悪いので気をつけてください。でも当たりのサーバーに行けばめっちゃ最高のゲームでもあります。治安は悪いけど面白いゲームかな。とにかくロブロックスは最高ですね。やろうかなと思ってる人、絶対にやってください。特にやった方がいいゲームを言います（個人的な意見）・ブロックスフルーツ・ブルックヘブン・スラップバトル・プリーズトネート（これはロブロックス内通貨であるロバックスを寄付するゲームです）・ドアーズ・ひみつのおるすばん・最強の戦場・ライバル・殺人ミステリー2 ・むありあ放送局このような感じです。ちなみにむありあ放送局は行った方がいいですね。むありあっていうユーチューバー居るんですけどその人がよくロバックス配布企画してるんですよね。なのでグループに入って告知みながらやってみて下さい。ライブでもロバックス配布してるのでグループの告知をみながら待ちましょう！ (★3)(24/12/29)

Sometimes I need to use real money to buy something special but I still love Roblox because I can meet new friends and there are many different games (★5)(24/12/29)

1.正直課金ゲー正直課金ゲーすぎて笑えません。ロブロックス面白いですが、課金をしないとあまり私的には面白くないです。私は課金しましたが、最初は私も無課金です。誰でもそうです。なのでそこはもう少し無課金に優しくしてもらいたいです。 2.出会い厨出会い厨多すぎです。特に、物や人を飛ばす・海鼠の湯で彼氏欲しいなどの発言をしている方がいらっしゃいます。見ていて不快です。 3.やらしい行動 2でも言った、温泉でやらしい行動をしている、男女がいます。こちらも見ていて不快です。 4.□倍詐欺ロブロックスの中には色々なゲームがありその中の「おねがい寄付して｣（寄付ゲー）を例にしますね。そのゲームは寄付をしたり寄付をされたりするゲームなんですけど、その中に「〇ロバ（ロブロックスの中のお金）をくれたら□倍で返します。｣といった詐欺が大量発生😅どういうことかというと、その詐欺に騙されてAさん（例え）が寄付します。するとその□倍にして返すという詐欺をBさんとしましょう。そのBさんは寄付をしてもらうと直ぐにそのワールドから抜けるとのことです。正直それは犯罪なのでは？と思います。 5.運営さんが厳しすぎる。運営さん、厳しすぎます。少しの発言で1日BANされます。 (★4)(24/12/29)

最近、ロブロックスのアプデがきましたよね。フレンドからも聞きました。チャットをできなくしました。1人のフレンドがチャットができなくなっています。プライベートチャットができなくなったのが、1番の欠点です。ここからはプライベートチャットのことをプラチャと言います。何故かと言うと、野良のサーバーに行く時があるんですが、その時、プラチャが出来ないと、他の人が話に割り込んでくるので、そこだけは修正してください。良いところ・ゲームが面白い・無料のスキンがあるだけです。個人的な評価なので質問は御控えください。悪いところ・暴言キッズがいる・誤banされる・入った時フレンドがいるサーバーに飛ばされる (上の悪いところはプレイヤーが満員だった時) ・治安が悪い修正してほしいところ・プレイヤーが満員の時、別のサーバーに飛ばしてほしい。・チーターがチートをしたら、無期限バンをする。 ⚠️注意　これは個人的に修正してほしいところ。・ゲームによくバグ技がある。 (揺るぎない魂の壁抜けを除いて．) のみです。あとがきよく、自分が通報しようと思ったら、逃げられるんです。個人的には結構いいと思いますよ。ただ…一つだけですけどね…また変わったことがあったらまたレビューします。では、good-bye！ (★5)(24/12/29)

レビューをもっと見る

One thought on “「ロブロックス、インフラストラクチャーの効率性と信頼性を向上させる取り組みを発表」”

ROBLOX より:

2023年12月8日 2:20 AM

この記事では、Robloxのインフラストラクチャをより効率的かつ強靭にする取り組みについて説明されています。Robloxの成長に伴い、その技術インフラの規模と複雑さも増してきました。現在、私たちのインフラストラクチャは世界中の7000万人以上のユーザーをサポートしており、1日に数千の内部サービスが稼働しています。このような大規模な環境を効率的に管理するために、私たちは独自のハイブリッドプライベートクラウドインフラストラクチャを構築しています。この取り組みは、コストとネットワークの遅延を管理するために重要です。私たちの目標は、常にオンである体験を世界中の人々に提供することです。

返信

ご意見らくがき帳コメントをキャンセル

カテゴリーへ移動して「ROBLOX」の最新情報をチェックしてください

「ロブロックス、インフラストラクチャーの効率性と信頼性を向上させる取り組みを発表」

How We’re Making Roblox’s Infrastructure More Efficient and Resilient

One thought on “「ロブロックス、インフラストラクチャーの効率性と信頼性を向上させる取り組みを発表」”

ご意見らくがき帳 コメントをキャンセル

ご意見らくがき帳コメントをキャンセル