鱼喃

听!布鲁布鲁,大鱼又在那叨叨了

Abstract

In this big data time, high performance distributed systems are required to process the large volumn of data. However, it is not easy to organize plenty of nodes. One of the significant problems is distributed consensus, which means every node in the cluster will eventually reach a consensus without any conflicts.

Raft is a distributed consensus algorithm which has been proved workable. This expriment contitues the previous expriment and implements the log replication and finally tests the whole system in many abnormal situations.

Read more »

Abstract

Since the time Hadoop came up, the Hadoop ecosystem is getting larger and larger. There are so many softwares being developed around Hadoop. Apache HBase and Apache Hive are two of them.

In this expriment, for the purpose of learning these two softwares, we use HBase and Hive to continue our reseach on wuxia novels mentioned before.

Read more »

Abstract

Since the time Hadoop came up, the Hadoop ecosystem is getting larger and larger. There are so many softwares being developed around Hadoop. Apache HBase and Apache Hive are two of them.

In this expriment, for the purpose of learning these two softwares, we use HBase and Hive to continue our reseach on wuxia novels mentioned before.

Read more »

Abstract

In this big data time, high performance distributed systems are required to process the large volumn of data. However, it is not easy to organize plenty of nodes. One of the significant problems is distributed consensus, which means every node in the cluster will eventually reach a consensus without any conflict.

Raft is a distributed consensus algorithm which has been proved workable. This expriment mainly focus on designing and implementing leader election described in rart algorithm.

Read more »

介绍

Redis是内存数据库,所有的数据都是存放在内存中,所以它的容量是受到内存大小限制的。当Redis的数据量超过单机内存时,就需要考虑使用集群来扩展。
Redis集群分为两种节点:主节点和从节点。运行时节点可能会实效,考虑到高可用性,至少需要3个主节点。当其中一个主节点实效后,利用少数服从多数的策略,从当机主节点的从节点列表中选出一个从节点接替主节点,其他从节点转换成新节点的从节点。Redis集群中数据是分片存储的,即数据被划分成一定数量的slot,然后根据算法决定slot对应的主节点,主节点间数据没有冗余,冗余的部分由从节点负责。
本文以搭建一个6节点,3主3从的Redis集群为例。所有脚本和文件见redis-cluster

Read more »

总结

这次分布式爬虫系统的设计和实现,学习和了解了多方面的知识。通过实际编写一个爬虫,并在集群上大范围的测试,了解到了更多爬虫相关的知识,尤其是网页编码、去重、错误处理等方面,这些通常是教学视频所涉及不到的。Apache Storm是一个分布式的流式计算平台,设计简单但是功能强大,通过学习和利用Apache Storm构建一个分布式的流式计算系统,深入了解了Storm相关的知识。通过对Storm的学习,了解Storm各个节点、组件之间的互连,数据交换,对学习分布式系统有很大的帮助。

Read more »

测试是软件开发中重要的一环,对于分布式系统来说更是如此。通过测试,既是检查程序是否按照预期运行,又是对系统的性能进行测试,只有通过测试才能发现静态代码分析不能找到的问题。分布式爬虫系统编码完毕之后就是在已经搭建好的平台上进行测试,通过长时间的运行并观察系统状态,继而分析系统存在的缺陷的不足。

Read more »

在测试过程中遇到各种各样的问题,在不断调整中总结出以下一些常见的问题和解决(缓解)方案。因为有时一个设计会同时影响到不同的部分,或者说它们本身就是存在联系的,所以内容上会有重叠的地方。

Read more »