ceph的ceph crushmap算法究竟是怎么样的

点击联系发帖人 时间：2016-03-21 02:27

ceph crush

&Openstack统一存储Ceph之Crush算法介绍—京东云计算总监何雨
秒后自动跳转到登录页
(奖励10下载豆)
快捷登录：
举报类型：
不规范：上传重复资源
不规范：标题与实际内容不符
不规范：资源无法下载或使用
其他不规范行为
违规：资源涉及侵权
违规：含有危害国家安全等内容
违规：含有反动/色情等内容
违规：广告内容
详细原因：
任何违反下载中心规定的资源，欢迎Down友监督举报，第一举报人可获5-10下载豆奖励。
openstack虚拟化云计
Redis入门学习必备教
Windows2008故障转移
2015 IBM领跑互联网
下载戴尔企业私有云
浪潮服务器RAID配置
现任明教教主 vSph
Openstack统一存储Ceph之Crush算法介绍—京东云计算总监何雨
上传时间：
技术分类：
资源评价：
（1位用户参与评价）
已被下载&11&次
Openstack统一存储Ceph之Crush算法介绍—京东云计算总监何雨
本资料共包含以下附件：
Openstack统一存储Ceph之Crush算法介绍—京东云计算总监何雨.pptx
51CTO下载中心常见问题：
1.如何获得下载豆？
1)上传资料
2)评论资料
3)每天在首页签到领取
4)购买VIP会员服务，无需下载豆下载资源
5)更多途径：
2.如何删除自己的资料？
下载资料意味着您已同意遵守以下协议：
1.资料的所有权益归上传用户所有
2.未经权益所有人同意，不得将资料中的内容挪作商业或盈利用途
3.51CTO下载中心仅提供资料交流平台，并不对任何资料负责
4.本站资料中如有侵权或不适当内容，请邮件与我们联系（）
5.本站不保证资源的准确性、安全性和完整性, 同时也不承担用户因使用这些资料对自己和他人造成任何形式的伤害或损失
下载1453次
相关专题推荐
磁盘阵列简称RAID，有“价格便宜且多
网络存储系统的搭建能够为我们带来极
VMware是提供一套虚拟机解决方案的软
从开发、测试、生产三部曲这样的运作
本专题为vmware中文视频教程，在线视
本专题介绍了weblogic服务器在企业应
Vmware View是Vmware的桌面和应用虚拟
vSphere不是一个单独的产品，它由一系
本专题全面深入讲解Windows Server 2
本专题收集了高俊峰老师讲解的系统集
IBM TSM 备份软件实战教学视频，包含
菜鸟腾飞安全网VIP_精通VMware虚拟机
2013年传智播客WebService视频教程，
Active Directory 实操作参考系列，本
服务器虚拟化技术以VMware公司的vSph
LoadRunner，是一种预测系统行为和性
本周下载热点
意见或建议：
联系方式：
您已提交成功！感谢您的宝贵意见，我们会尽快处理当前访客身份：游客 [
CEPH 的 CRUSH 算法原理
英文原文：
CRUSH的效率依赖于存储层级的深度和CRUSH建立在何种bucket类型上。图8 比较了c(r, x)从每种bucket类型中选择一个单一副本需要的时间 (Y )并将之作为作为bucket (X)尺寸的一个函数。在一个高的等级, CRUSH的规模O(log n)—与层级深度成线性关系—倘若是单独的bucket，或许是O(n)（list型和straw型bucket成线性比例），并不会超过一个固定的最大尺寸。何时何地应该使用单独的bucket取决于添加，删除或调整的期望值。 List型bucket相比于Straw型bucket具有一个轻微的性能优势，但是进行删除操作时List型bucket可能会导致过多的数据转移。Tree型 bucket对于很大或正常地修改bucket而言是一个好的选择，因为它具有合适的计算量和重组花费。CRUSH性能关键—无论是运行时间还是结果质量—都取决与所用的整型哈希函数。伪随机值都是通过使用一个基于Jenkin的32位哈希混淆的整型多输入哈希函数计算得出的[Jenkins 1997]。在它提出的形式里，用于CRUSH映射函数的时间中的45%是花费在计算哈希值上，这使得哈希成为整体速度和分布质量的关键和成熟优化的目标。
4.3.1 Negligent Aging
CRUSH leaves failed devices in place in the storage hierar- chy both because failure is typically a temporary condition (failed disks are usually replaced) and because it avoids in- efficient data reorganization. If a storage system ages in ne- glect, the number of devices that are failed but not replaced may become significant. Although CRUSH will redistribute data to non-failed devices, it does so at a small performance penalty due to a higher probability of backtracking in the placement algorithm. We evaluated the mapping speed for a 1,000 device cluster while varying the percentage of devices marked as failed. For the relatively extreme failure scenario in which half of all devices are dead, the mapping calcula- tion time increases by 71%. (Such a situation would likely be overshadowed by heavily degraded I/O performance as each devices’ workload doubles.)
4.3.1 疏忽老化CRUSH 将故障的设备任然保留在存储层级中不仅因为故障是一个典型的临时条件（故障的硬盘通常会被替换掉）而且它避免了低效率的数据重组。如果一个存储系统长时间疏于管理，发生故障但没有及时替换掉的设备的数目是很显著的。尽管CRUSH将会把数据重新分配到无故障的设备中，由于放置算法中高的回溯概率，任然会存在一个小的性能损失。我们评估了一个具有1000个设备的集群的映射速度，而且采用多个将设备标记为故障的百分比。对于相对极端的故障情境：有一半的设备发生故障，映射计算时间增加了71%。（这种使之黯然失色的情况可能是由于每个设备的工作负载加倍而严重降低了I/O性能。）
4.4 Reliability
Data safety is of critical importance in large storage systems, where the large number of devices makes hardware failure the rule rather than the exception. Randomized distribution strategies like CRUSH that decluster replication are of par- ticular interest because they expand the number of peers with which any given device shares data. This has two competing and (generally speaking) opposing effects. First, recovery after a failure can proceed in parallel because smaller bits of replicated data are spread across a larger set of peers, reduc- ing recovery times and shrinking the window of vulnerability to additional failures. Second, a larger peer group means an increased probability of a coincident second failure losing shared data. With 2-way mirroring these two factors cancel each other out, while overall data safety with more than two replicas increases with declustering [Xin et al. 2004].
However, a critical issue with multiple failures is that, in general, one cannot expect them to be independent—in many cases a single event like a power failure or a physi- cal disturbance will affect multiple devices, and the larger peer groups associated with declustered replication greatly increase the risk of data loss. CRUSH’s separation of repli- cas across user-defined failure domains (which does not ex- ist with RUSH or existing hash-based schemes) is specifically designed to prevent concurrent, correlated failures from causing data loss. Although it is clear that the risk is reduced, it is difficult to quantify the magnitude of the improvement in overall system reliability in the absence of a specific storage cluster configuration and associated historical failure data to study. Although we hope to perform such a study in the fu- ture, it is beyond the scope of this paper.
4.4 可靠性数据安全对大规模存储系统而言是至关重要的，大量的设备使硬件故障而不是软件异常成为规则失效的主要原因。像CRUSH这样的集群复制尤其对随机分布策略感兴趣，因为它们可以扩展与任意给定设备共享数据的节点的数量。这有两个对立（通常说)相反效应。第一，故障后的恢复可以并行进行，因为复制的数据的小片段分布在一大组节点中，减小了恢复时间并缩小了额外的故障的易损窗口期。第二，一个大的节点组意味着同时发生的二次失效导致损失共享数据的概率增加。双路镜像使这两个因素互相抵消，然而随着两个以上副本的分散技术会增加整体数据的安全性[Xin et al. 2004]。但是多故障的一个关键问题是：总体上，我们不能不能期望它们是独立的——在许多情况下一个单独事件像电源故障或机械破坏将影响多个设备，与分散复制相关的大规模节点组会增加数据损失的风险。CRUSH的通过用户定义故障域（并不存在与RUSH或基于哈希的方案中）副本分离技术就是为了防止这种并发有关的故障引起的数据损失的而特殊设计的。尽管能够明显的减少风险，但是由于缺乏对特定的存储集群配置和相关的历史失效数据的研究，很难量化对整个系统的提升的数量级。尽管我们希望在未来进行这样一个研究，但那超出了本论文的范围。
5 Future Work
Although data safety concerns related to coincident fail- ures were the primary motivation for designing CRUSH, study of real system failures is needed to determine their character and frequency before Markov or other quantitative models can used to evaluate their precise effect on a system’s mean time to data loss (MTTDL).
CRUSH’s performance is highly dependent on a suitably strong multi-input integer hash function. Because it simul- taneously affects both algorithmic correctness—the qual- ity of the resulting distribution—and speed, investigation into faster hashing techniques that are sufficiently strong for CRUSH is warranted.
5 未来的工作
[?. 目前CRUSH使用的原始规则结构是足够复杂的, 足以支持我们现在可预见的数据分布策略. 一些系统的特殊需求可以通过一个更强大的规则结构来满足. 虽然数据安全问题相关的同步失败是设计CRUSH主要动机, 对实际系统故障的研究需要确定它们的特性和频率，这项工作要在 Markov 或其他量化模型可以用于评估它们对系统的平均数据丢失时间(MTTDL)的精确影响之前进行. CRUSH的性能高度依赖于一个合适强度的多输入整型哈希函数. 因为它同时影响算法的正确性（结果分布的质量）和速度, 对高速哈希技术的研究是CRUSH充分强大的保证.
6 Conclusions
Distributed storage systems present a distinct set of scala- bility challenges for data placement. CRUSH meets these challenges by casting data placement as a pseudo-random mapping function, eliminating the conventional need for al- location metadata and instead distributing data based on a weighted hierarchy describing available storage. The struc- ture of the cluster map hierarchy can reflect the underlying physical organization and infrastructure of an installation, such as the composition of storage devices into shelves, cab- inets, and rows in a data center, enabling custom placement rules that define a broad class of policies to separate object replicas into different user-defined failure domains (with, say, independent power and network infrastructure). In do- ing so, CRUSH can mitigate the vulnerability to correlated device failures typical of existing pseudo-random systems with declustered replication. CRUSH also addresses the risk of device overfilling inherent in stochastic approaches by se- lectively diverting data from overfilled devices, with minimal computational cost.
6 结论分布式存储系统对数据放置的可伸缩性提出了一系列挑战。CRUSH通过将数据分配作为一个伪随机映射函数来应对这些挑战，通过使用基于加权层级结构描述可用存储空间的分布数据取代和消除了对配置元数据的传统需求。集群映射层级结构可以反映出潜在的物理组织和装置的基础结构，例如将存储设备整合到机架，机柜，和数据中心的排，可用的自定义放置规则，规则可以定义宽泛的策略来将对象的副本分散到不同的用户定义故障域 (换句话说, 独立的电源和网络基础设施)。在这种情况下, CRUSH可以减缓现有典型非集群复制伪随机系统相关设备的易损性。CRUSH同样可以通过选择性地从溢出设备中转移数据来处理随机方法的内在设备溢出的风险，这种方法消耗的计算能力最小。 CRUSH用最高效的方式实现所有的这些, 无论是计算效率还是需求的元数据。映射计算的运行时间是O(log n)，只需要十几毫秒就可以完成对上千设备的计算。效率，可靠性，和灵活性的强健组合使CRUSH成为大规模分布式存储系统的最有吸引力的选择.
7 Acknowledgements
R. J. Honicky’s excellent work on RUSH inspired the de- velopment of CRUSH. Discussions with Richard Golding, Theodore Wong, and the students and faculty of the Storage Systems Research Center were most helpful in motivating and refining the algorithm. This work was supported in part by Lawrence Livermore National Laboratory, Los Alamos National Laboratory, and Sandia National Laboratory under contract B520714. Sage Weil was supported in part by a fellowship from Lawrence Livermore National Laboratory. We would also like to thank the industrial sponsors of the SSRC, including Hewlett Packard Laboratories, IBM, Intel, Microsoft Research, Network Appliance, Onstor, Rocksoft, Symantec, and Yahoo.
由于R. J. Honicky在RUSH上的出色工作激发了CRUSH的发展。与Richard Golding，Theodore Wong和学生们讨论，存储系统研究中心有力地激励和细化了算法。这项工作得到了劳伦斯利物摩国家实验室（Lawrence Livermore National Laboratory），洛斯阿拉莫斯国家实验室（Los Alamos National Laboratory），美国桑地亚国家实验室（Sandia National Laboratory）在B520714合约下的部分支持。Sage Weil的部分奖学金来自劳伦斯利物摩国家实验室（Lawrence Livermore National Laboratory）。我们还要感谢SSRC的赞助，这还包括惠普实验室（Hewlett Packard Laboratories），IBM，英特尔，微软研究院，网络设备公司（Network Appliance），安吉星（Onstor），Rocksoft，赛门铁克（Symantec）和雅虎（Yahoo）。
8 Availability
The CRUSH source code is licensed under the LGPL, and is available at:
http://www.cs.ucsc.edu/~sage/crush
References
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.SC2006 November 2006, Tampa, Florida, USA0-/06 $20.00 2006 IEEE
φ () φ (&)
8&源码下载
CRUSH源码版权为LGPL所有，下载地址为
http://www.cs.ucsc.edu/~sage/crush
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC2006 November 2006, Tampa, Florida, USA 0-/06 $20.00 2006 IEEE
φ () φ (·)ceph存储 ceph的CRUSH算法的源码分析
ceph存储 ceph的CRUSH算法的源码分析
[摘要：ceph的CRUSH算法的源码剖析 1 源文件剖析剖析的ceph版本是ceph-0.41。 1.1 rule取bucket的干系 http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH 1.2 crush目次下的文件 builder.c builder.h crush.c crush.h]
ceph的CRUSH算法的源码分析
1 源文件分析
分析的ceph版本是ceph-0.41。
1.1 rule与bucket的关系
http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH
1.2 crush目录下的文件
builder.cbuilder.hcrush.ccrush.hCrushWrapper.ccCrushWrapper.hCrushWrapper.igrammar.hhash.chash.hmapper.cmapper.hsample.txttypes.h
CRUSH头文件的包含关系
1.3 crush.h中
定义了crush_rule、crush_bucket、crush_bucket_uniform、crush_bucket_list、crush_bucket_tree、crush_bucket_straw、crush_map等数据结构。
其中crush_bucket结构
&span style=&font-size:12&&struct crush_bucket {
/* this'll be negative */
/* non- type=0 is reserved for devices */
/* one of CRUSH_BUCKET_* */
/* which hash function to use, CRUSH_HASH_* */
/* 16-bit fixed point */
/* num items */ //假如size为0，说明它不包含item。
//数组，它包含item的id，这些id可能都是负数，也可能都是自然数。
//假如是负数，表示它包含的item都是bucket，假如是自然数，表示它的item都是device。
* cached random permutation: used for uniform bucket and for
* the linear search fallback for the other bucket types.
__u32 perm_x;
/* @x for which *perm is defined */
__u32 perm_n;
/* num elements of *perm that are permuted/defined */
__u32 * };
1.4 crush.c中
crush_get_bucket_item_weight函数用于获取指定bucket中的item的weight。crush_calc_parents函数计算device和bucket的parent，并记录。crush_destroy_bucket函数是用于销毁bucket数据结构。crush_destroy函数用于销毁crush_map结构。
1.5 build.c中
构造、操作 crush_map、rule、bucket。
crush_craete函数用于创建一个crush_map结构。也就是申请crush_map结构的空间。crush_finalize函数用于最后初始化一个crush_map结构。当所有的bucket都被添加到crush_map之后才调用该函数。该函数执行三个操作
计算max_devices的值。分配device_parents数组和bucket_parents数组的空间。为crush_map中的device和bucket创建parent map(调用的是crush_calc_parents函数)。
crush_make_rule函数用于创建一个crush_rule结构。crush_rule_set_step函数对一个crush_rule结构设置step(包括op、arg1、arg2)。crush_add_rule函数往crush_map中添加rule。crush_get_next_bucket_id函数寻找空槽位(返回未使用的bucket id)。crush_add_bucket函数添加bucket到crush_map上。crush_remove_bucket函数从crush_map中移除指定的bucket。在移除bucket前，bucket中的所有item都已被移除，它的weight为0，因此它不需要再调整weight。crush_make_bucket函数创建一个bucket结构。需要指定CRUSH算法(uniform、list、tree、straw)、hash算法、bucket类型(自定义的type)、size(包括多少个item)、这些items的id数组、这些items的weights。crush_bucket_add_item函数往bucket中添加item。crush_bucket_adjust_item_weight函数调整bucket中指定item的weight，并相应调整bucket的weight。crush_reweight_bucket函数计算crush_map中指定bucket的weight。crush_bucket_remove_item函数从bucket中移除指定的item。并调整bucket的weight。
1.6 在hash.h、hash.c中
使用Robert Jenkins的HASH算法，地址是http://burtleburtle.net/bob/hash/evahash.html
crush_hash32函数计算1个32位参数的哈希值，返回的哈希值也是32位的。crush_hash32_2函数计算2个32位参数的哈希值，返回的哈希值也是32位的。crush_hash32_3函数计算3个32位参数的哈希值，返回的哈希值也是32位的。crush_hash32_4函数计算4个32位参数的哈希值，返回的哈希值也是32位的。crush_hash32_5函数计算5个32位参数的哈希值，返回的哈希值也是32位的。
1.7 在mapper.h、mapper.c中
crush_find_rule函数是根据指定的ruleset、type、size在crush_map中找到相应的的crush_rule id。crush_do_rule函数根据指定的输入(哈希值)、rule在crush_map中执行rule，并返回相应的数组(包含一组OSD集合)。本文会重点分析这个函数。
1.8 在CrushWrapper.h、CrushWrapper.c中
包含了CrushWrapper类(这个类是Crush算法的包装类)，主要包括对item、rule、map、bucket的操作，还有encode和decode操作，把一些参数编码成为bufferlist，或者把bufferlist解码成原参数。最重要的是do_rule函数，该函数根据rule、PG号等一些参数返回多个OSD。
2 CRUSH算法流程
假设我组建一套存储系统，有3个机架(host)，每个机架上有4台主机(host)，每个主机上有2个磁盘(device)，则一共有24个磁盘。预计的扩展方式是添加主机或者添加机架。
2.1 创建crush_map
我们的bucket有三种: root、rack、host。root包含的item是rack，root的结构是straw。rack包含的item是host，rack的结构是tree。host包括的item是device，host的结构式uniform。这是因为每个host包括的device的数量和权重是一定的，不会改变，因此要为host选择uniform结构，这样计算速度最快。
执行下面的命令，得到crush_map
# crushtool --num_osds 24 -o crushmap --build host uniform 2 rack tree 4 root straw 0 # crushtool -d crushmap -o flat.txt
查看crush_map
root@ceph-01:~# cat flat.txt
&span style=&font-size:12&&# begin crush map
# devices device 0 device0 device 1 device1 device 2 device2 device 3 device3 device 4 device4 device 5 device5 device 6 device6 device 7 device7 device 8 device8 device 9 device9 device 10 device10 device 11 device11 device 12 device12 device 13 device13 device 14 device14 device 15 device15 device 16 device16 device 17 device17 device 18 device18 device 19 device19 device 20 device20 device 21 device21 device 22 device22 device 23 device23
# types type 0 device type 1 host type 2 rack type 3 root
# buckets host host0 {
# do not change unnecessarily
# weight 2.000
alg uniform
# do not change bucket size (2) unnecessarily
# rjenkins1
item device0 weight 1.000 pos 0
item device1 weight 1.000 pos 1 } host host1 {
# do not change unnecessarily
# weight 2.000
alg uniform
# do not change bucket size (2) unnecessarily
# rjenkins1
item device2 weight 1.000 pos 0
item device3 weight 1.000 pos 1 } host host2 {
# do not change unnecessarily
# weight 2.000
alg uniform
# do not change bucket size (2) unnecessarily
# rjenkins1
item device4 weight 1.000 pos 0
item device5 weight 1.000 pos 1 } host host3 {
# do not change unnecessarily
# weight 2.000
alg uniform
# do not change bucket size (2) unnecessarily
# rjenkins1
item device6 weight 1.000 pos 0
item device7 weight 1.000 pos 1 } host host4 {
# do not change unnecessarily
# weight 2.000
alg uniform
# do not change bucket size (2) unnecessarily
# rjenkins1
item device8 weight 1.000 pos 0
item device9 weight 1.000 pos 1 } host host5 {
# do not change unnecessarily
# weight 2.000
alg uniform
# do not change bucket size (2) unnecessarily
# rjenkins1
item device10 weight 1.000 pos 0
item device11 weight 1.000 pos 1 } host host6 {
# do not change unnecessarily
# weight 2.000
alg uniform
# do not change bucket size (2) unnecessarily
# rjenkins1
item device12 weight 1.000 pos 0
item device13 weight 1.000 pos 1 } host host7 {
# do not change unnecessarily
# weight 2.000
alg uniform
# do not change bucket size (2) unnecessarily
# rjenkins1
item device14 weight 1.000 pos 0
item device15 weight 1.000 pos 1 } host host8 {
# do not change unnecessarily
# weight 2.000
alg uniform
# do not change bucket size (2) unnecessarily
# rjenkins1
item device16 weight 1.000 pos 0
item device17 weight 1.000 pos 1 } host host9 {
# do not change unnecessarily
# weight 2.000
alg uniform
# do not change bucket size (2) unnecessarily
# rjenkins1
item device18 weight 1.000 pos 0
item device19 weight 1.000 pos 1 } host host10 {
# do not change unnecessarily
# weight 2.000
alg uniform
# do not change bucket size (2) unnecessarily
# rjenkins1
item device20 weight 1.000 pos 0
item device21 weight 1.000 pos 1 } host host11 {
# do not change unnecessarily
# weight 2.000
alg uniform
# do not change bucket size (2) unnecessarily
# rjenkins1
item device22 weight 1.000 pos 0
item device23 weight 1.000 pos 1 } rack rack0 {
# do not change unnecessarily
# weight 8.000
# do not change pos for existing items unnecessarily
# rjenkins1
item host0 weight 2.000 pos 0
item host1 weight 2.000 pos 1
item host2 weight 2.000 pos 2
item host3 weight 2.000 pos 3 } rack rack1 {
# do not change unnecessarily
# weight 8.000
# do not change pos for existing items unnecessarily
# rjenkins1
item host4 weight 2.000 pos 0
item host5 weight 2.000 pos 1
item host6 weight 2.000 pos 2
item host7 weight 2.000 pos 3 } rack rack2 {
# do not change unnecessarily
# weight 8.000
# do not change pos for existing items unnecessarily
# rjenkins1
item host8 weight 2.000 pos 0
item host9 weight 2.000 pos 1
item host10 weight 2.000 pos 2
item host11 weight 2.000 pos 3 } root root {
# do not change unnecessarily
# weight 24.000
# rjenkins1
item rack0 weight 8.000
item rack1 weight 8.000
item rack2 weight 8.000 }
# rules rule data {
type replicated
min_size 2
max_size 2
step take root
step chooseleaf firstn 0 type host
step emit }
# end crush map &/span&
2.2 CRUSH_MAP结构解析
crush_map结构中的buckets成员是bucket结构指针数组，buckets成员保存了上面这些bucket结构的指针。上面这些bucket结构的指针在buckets中的下标是 [-1-id]。buckets数组的元素如下所示。
{ &host0, &host1, &host2, … , &host11, &rack0, &rack1, &rack2, &root}
&&&&& pos&&&&
&&&&& 1&&&&
&&&&& 12&&&&
&&&&& 13&&&&
&&&&& 14&&&&
&&&&& 15&&&&
&&&&& &bucket&&&&
&&&&& &host0&&&&
&&&&& &host1&&&&
&host11&&&&
&&&&& &rack0&&&&
&&&&& &rack1&&&&
&&&&& &rack2&&&&
&&&&& &root&&&&
&&&&& &bucked_id&&&&
&&&&& -1&&&&
&&&&& -2&&&&
&&&&& -13&&&&
&&&&& -14&&&&
&&&&& -15&&&&
&&&&& -16&&&&
bucket的id使用负数是为了和device区分，因为bucket的item可以是device，也可以是bucket。比如host0的item数组中包含的元素是{0, 1}，它们是device0、device1。而rack2的item数组中包含的元素是{-9, -10, -11, -12}，它们是host8、host9、host10、host11。
2.3 数据分配rule
它的数据分配rule包括3条step。
step take root
step chooseleaf firstn 0 type host
它的副本数是2。
2.4 crush_do_rule函数的解析
根据参数x(object_id或者是pg_id)，crush_do_rule返回一组device的id(这组device用于保存object的副本)。crush_do_rule还有其他参数。
const struct crush_map *map ，反映了当前存储系统的架构。int ruleno，选择哪种rule。int xint force，指定副本在哪个device上，假如为-1，则不指定。int *result，crush_do_rule函数会把device的id放到这个数组中。int result_max，最大副本数。const __u32 *weight，当前所有device权重的数组。
crush_do_rule函数在mapper.cc中。在第509行中，CRUSH依次执行每条step。
第一个step是”step take root”，因此CRUSH会执行512~523行，因为”root”所对应的id是-16，因此w[0] = -16，wsize = 1。
然后CRUSH执行第二条 “step chooseleaf firstn 0 type host”， CRUSH会执行525~588行代码。CRUSH执行到541行时，firstn = 1， recurse_to_leaf = 1。
因为wsize = 1，因此542行的循环只执行一次。
执行到568行时，numrep = 2， j = 0， i = 0 。
执行到569行时，会调用crush_choose函数。map-&buckets[-1-w[i]] = &root。
crush_choose函数的参数如下所示
const struct crush_map *map，反映了当前存储系统的架构。struct crush_bucket *bucket，在这个bucket中选择item，crush_do_rule函数传给crush_choose的bucket是&root。&root中的item是{&rack0，&rack1，&rack2}，它们的id是{-13，-14，-15}。const __u32 *weight，所有device的权重int xint numrep，crush_do_rule函数传给crush_choose的numrep是2。int type，crush_do_rule函数传给crush_choose的type是”host”。int *out，输出。crush_choose会在bucket中选择numrep – outpos个item，并把它们放到out数组中。int outpos，被选择的items放到 out[outpos]、out[outpos+1]、… 、out[numrep - 1]。int firstn，”first n”选项，决定了r’的值。int recurse_to_leaf，假如recurse_to_leaf为真，则CRUSH会继续遍历这些item中的每个item(对每个item递归调用crush_choose)，并从每个item中选出一个device。int *out2，假如recurse_to_leaf为真，则CRUSH会把找到的devices放到out2数组中。
crush_choose会找到2个host，并在每个host中找到一个device。
当crush_choose第一次执行到356行时，in是&root bucket，r = 0。调用crush_bucket_choose函数。
crush_bucket_choose函数根据root的类型，选择调用bucket_straw_choose函数。假设bucket_straw_choose函数经过计算在&root中选择了&rack0，并返回rack0的id(-13)。
crush_choose函数执行到367行时，判断rack0的type是否是”host”，因为rack0的type是”rack”，因此CRUSH会再次执行322~356行的代码，并调用crush_bucket_choose函数。crush_bucket_choose函数根据rack0的类型，选择调用bucket_tree_choose函数。假设bucket_tree_choose函数经过计算在rack0中选择host2，并返回host2的id(-3)。
因为recurse_to_leaf是1，因此CRUSH会执行384~399行，在host2中选择一个device(假设是device4)，并把device的id放到out2数组中。
CRUSH第二次执行365行时，in是&root bucket，r = 1。这次过程不再复述，假设CRUSH在root中选择了rack1，在rack1中选择了host6，在host6中选择了device13。
返回crush_do_rule函数，执行579~588行，则w数组中的元素是{4, 13}。
最后CRUSH会执行第三个 “step emit”，执行591~597行，把复制w数组到result数组上。
{4，13}代表device4和device13，这表明x对应的设备是{device4, device13}。
CRUSH算法完成了x到{device4, device13}的映射。
感谢关注 Ithao123精品文库频道，是专门为互联网人打造的学习交流平台，全面满足互联网人工作与学习需求，更多互联网资讯尽在 IThao123!
Laravel是一套简洁、优雅的PHP Web开发框架(PHP Web Framework)。它可以让你从面条一样杂乱的代码中解脱出来；它可以帮你构建一个完美的网络APP，而且每行代码都可以简洁、富于表达力。
Hadoop是一个由Apache基金会所开发的分布式系统基础架构。
用户可以在不了解分布式底层细节的情况下，开发分布式程序。充分利用集群的威力进行高速运算和存储。
Hadoop实现了一个分布式文件系统（Hadoop Distributed File System），简称HDFS。HDFS有高容错性的特点，并且设计用来部署在低廉的（low-cost）硬件上；而且它提供高吞吐量（high throughput）来访问应用程序的数据，适合那些有着超大数据集（large data set）的应用程序。HDFS放宽了（relax）POSIX的要求，可以以流的形式访问（streaming access）文件系统中的数据。
Hadoop的框架最核心的设计就是：HDFS和MapReduce。HDFS为海量的数据提供了存储，则MapReduce为海量的数据提供了计算。
产品设计是互联网产品经理的核心能力，一个好的产品经理一定在产品设计方面有扎实的功底，本专题将从互联网产品设计的几个方面谈谈产品设计
随着国内互联网的发展，产品经理岗位需求大幅增加，在国内，从事产品工作的大部分岗位为产品经理，其实现实中，很多从事产品工作的岗位是不能称为产品经理，主要原因是对产品经理的职责不明确，那产品经理的职责有哪些，本专题将详细介绍产品经理的主要职责
IThao123周刊}

淘宝游戏网

ceph的ceph crushmap算法究竟是怎么样的

我要回帖

更多关于 ceph crush 的文章

更多推荐