鱼喃

听!布鲁布鲁,大鱼又在那叨叨了

StreamSpider爬虫之代理

分布式的爬虫系统怎么可以没有代理呢?整个系统基本构建完之后就开始考虑这些附加的特性。

三种类型

查看参数,一共有三种代理模式 Proxy.Type.HTTPProxy.Type.SOCKSProxy.Type.DIRECT。一般代理网站上提供的都是http代理,可以考虑从这些网站上爬一点来用,不过可用性真的是不忍直视,想要可靠还是得付费或者自建。手头的代理都是基于SOCKS的,所以考虑使用SOCKS。

两种方式

有两种方式可以实现代理:一种是随机抽取一个可用代理,设置HttpClient的代理;另一种是做一个代理网关,代码写固定的地址,然后由具体的网关执行下载操作。

使用方案

以上两种方案都需要提供额外的服务,有一定的工作量。正好程序是运行在Docker Swarm环境中的,可以利用一下负载均衡机制,即创建一个多副本的服务ss-proxy,每个实例随机抽取一个可用代理(类型相同),并定时检测,不可用时退出。代码里地址写ss-proxy,然后解析时docker network会分发到某一个具体的代理服务器上。

示例

HttpClient不支持SOCKS协议,所以需要对原有代码进行比较大的改动。

把HttpClient对象的创建抽取出来

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/*
* build HttpClient by different proxies
* disable Follow redirects cause They can not handle Url properly witch having Chinese and in other charsets
* Ref: https://my.oschina.net/SmilePlus/blog/682198
* */
private HttpClient buildHttpClient(){
HttpClient httpClient;
RequestConfig config = RequestConfig.custom().build();
if(proxy_type == Proxy.Type.SOCKS){
Registry<ConnectionSocketFactory> reg = RegistryBuilder.<ConnectionSocketFactory>create()
.register("http", new PlainConnectionSocketFactory(){
@Override
public Socket createSocket(HttpContext context) throws IOException {
InetSocketAddress socksAddr = (InetSocketAddress) context.getAttribute("socks.address");
Proxy proxy = new Proxy(Proxy.Type.SOCKS, socksAddr);
return new Socket(proxy);
}
@Override
public Socket connectSocket(int connectTimeout, Socket socket, HttpHost host, InetSocketAddress remoteAddress, InetSocketAddress localAddress, HttpContext context) throws IOException {
// Convert address to unresolved
InetSocketAddress unresolvedRemote = InetSocketAddress
.createUnresolved(host.getHostName(), remoteAddress.getPort());
return super.connectSocket(connectTimeout, socket, host, unresolvedRemote, localAddress, context);
}
})
.register("https", new SSLConnectionSocketFactory(SSLContexts.createSystemDefault()){
@Override
public Socket createSocket(HttpContext context) throws IOException {
InetSocketAddress socksAddr = (InetSocketAddress) context.getAttribute("socks.address");
Proxy proxy = new Proxy(Proxy.Type.SOCKS, socksAddr);
return new Socket(proxy);
}
})
.build();

PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager(reg);
httpClient = HttpClients.custom()
.setDefaultRequestConfig(config)
.setConnectionManager(cm)
.build();
InetSocketAddress socksAddr = new InetSocketAddress(proxy_host, proxy_port);
context.setAttribute("socks.address", socksAddr);

}else if(proxy_type == Proxy.Type.HTTP){
HttpHost proxy = new HttpHost(proxy_host, proxy_port, null);
httpClient = HttpClientBuilder.create().setDefaultRequestConfig(config)
.setProxy(proxy)
.build();
}else{ //direct
httpClient = HttpClientBuilder.create().setDefaultRequestConfig(config).build();
}
return httpClient;
}

原来的代码改成:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public void doGet(String url) {
logger.info(url);
try {
statusCode = 0;
errMsg = null;
context = HttpClientContext.create();
HttpGet httpGet = new HttpGet();
httpGet.setURI(new URI(url));
HttpClient httpClient = buildHttpClient();
HttpResponse response = httpClient.execute(httpGet, context);
statusCode = response.getStatusLine().getStatusCode();
HttpEntity entity = response.getEntity();
if(entity==null){
errMsg = "Entity is null";
return;
}
byte[] bytes = EntityUtils.toByteArray(entity);
html = new String(bytes);
}catch (Exception ex) {
errMsg = ex.getMessage();
}
}

完整代码见httpclient/proxy

注意这行

1
HttpResponse response = httpClient.execute(httpGet, context);

参考

java httpclient使用socks5代理(二)使用socks5代理服务