分布式的爬虫系统怎么可以没有代理呢?整个系统基本构建完之后就开始考虑这些附加的特性。
三种类型
查看参数,一共有三种代理模式 Proxy.Type.HTTP
、Proxy.Type.SOCKS
、Proxy.Type.DIRECT
。一般代理网站上提供的都是http代理,可以考虑从这些网站上爬一点来用,不过可用性真的是不忍直视,想要可靠还是得付费或者自建。手头的代理都是基于SOCKS的,所以考虑使用SOCKS。
两种方式
有两种方式可以实现代理:一种是随机抽取一个可用代理,设置HttpClient的代理;另一种是做一个代理网关,代码写固定的地址,然后由具体的网关执行下载操作。
使用方案
以上两种方案都需要提供额外的服务,有一定的工作量。正好程序是运行在Docker Swarm环境中的,可以利用一下负载均衡机制,即创建一个多副本的服务ss-proxy,每个实例随机抽取一个可用代理(类型相同),并定时检测,不可用时退出。代码里地址写ss-proxy,然后解析时docker network会分发到某一个具体的代理服务器上。
示例
HttpClient不支持SOCKS协议,所以需要对原有代码进行比较大的改动。
把HttpClient对象的创建抽取出来
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
|
private HttpClient buildHttpClient(){ HttpClient httpClient; RequestConfig config = RequestConfig.custom().build(); if(proxy_type == Proxy.Type.SOCKS){ Registry<ConnectionSocketFactory> reg = RegistryBuilder.<ConnectionSocketFactory>create() .register("http", new PlainConnectionSocketFactory(){ @Override public Socket createSocket(HttpContext context) throws IOException { InetSocketAddress socksAddr = (InetSocketAddress) context.getAttribute("socks.address"); Proxy proxy = new Proxy(Proxy.Type.SOCKS, socksAddr); return new Socket(proxy); } @Override public Socket connectSocket(int connectTimeout, Socket socket, HttpHost host, InetSocketAddress remoteAddress, InetSocketAddress localAddress, HttpContext context) throws IOException { InetSocketAddress unresolvedRemote = InetSocketAddress .createUnresolved(host.getHostName(), remoteAddress.getPort()); return super.connectSocket(connectTimeout, socket, host, unresolvedRemote, localAddress, context); } }) .register("https", new SSLConnectionSocketFactory(SSLContexts.createSystemDefault()){ @Override public Socket createSocket(HttpContext context) throws IOException { InetSocketAddress socksAddr = (InetSocketAddress) context.getAttribute("socks.address"); Proxy proxy = new Proxy(Proxy.Type.SOCKS, socksAddr); return new Socket(proxy); } }) .build();
PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager(reg); httpClient = HttpClients.custom() .setDefaultRequestConfig(config) .setConnectionManager(cm) .build(); InetSocketAddress socksAddr = new InetSocketAddress(proxy_host, proxy_port); context.setAttribute("socks.address", socksAddr);
}else if(proxy_type == Proxy.Type.HTTP){ HttpHost proxy = new HttpHost(proxy_host, proxy_port, null); httpClient = HttpClientBuilder.create().setDefaultRequestConfig(config) .setProxy(proxy) .build(); }else{ httpClient = HttpClientBuilder.create().setDefaultRequestConfig(config).build(); } return httpClient; }
|
原来的代码改成:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| public void doGet(String url) { logger.info(url); try { statusCode = 0; errMsg = null; context = HttpClientContext.create(); HttpGet httpGet = new HttpGet(); httpGet.setURI(new URI(url)); HttpClient httpClient = buildHttpClient(); HttpResponse response = httpClient.execute(httpGet, context); statusCode = response.getStatusLine().getStatusCode(); HttpEntity entity = response.getEntity(); if(entity==null){ errMsg = "Entity is null"; return; } byte[] bytes = EntityUtils.toByteArray(entity); html = new String(bytes); }catch (Exception ex) { errMsg = ex.getMessage(); } }
|
完整代码见httpclient/proxy
注意这行
1
| HttpResponse response = httpClient.execute(httpGet, context);
|
参考
java httpclient使用socks5代理(二)使用socks5代理服务