seimicrawler4j的使用的js怎么用

点击联系发帖人 时间：2016-09-01 04:47

seimicrawler

3952被浏览198378分享邀请回答31413 条评论分享收藏感谢收起1775 条评论分享收藏感谢收起查看更多回答SeimiAgent使用--通过js控制以浏览器级效果登录爬取京东信息
时间： 09:39:49
&&&& 阅读：192
&&&& 评论：
&&&& 收藏：0
本篇文章纯粹为了向还不是很了解SeimiAgent的同学演示下的部分能力，目标网站随意选的，并没有其他目的。
SeimiAgent简介
是基于QtWebkit开发的可在服务器端后台运行的一个webkit服务，可以通过SeimiAgent提供的http接口向SeimiAgent发送一个load请求（需求加载的URL以及对这个页面接受的渲染时间或是使用什么代理等参数），通过SeimiAgent去加载并渲染想要处理的动态页面，然后将渲染好的页面直接返给调用方进行后续处理，所以运行的SeimiAgent服务是与语言无关的，任何一种语言或框架都可以通过SeimiAgent提供的标准http接口来获取服务。SeimiAgent的加载渲染环境都是通用浏览器级的，所以不用担心他对动态页面的处理能力。同时支持渲染生成页面快照（png）和PDF，亦支持自定义js脚本处理基本渲染后的页面，具体请参见官方使用说明。
为了更为直观的了解，可以先看下分享在优酷上视频，
下面会是图片版的详细介绍
启动SeimiAgent
在SeimiAgent的目录下：
./bin/seimiagent -p 8000
登陆京东(JD)
向SeimiAgent发送登陆请求，演示中为了方便直接使用了curl与SeimiAgent进行交互。由于SeimiAgent接受标准的http指令，所以可以用任何语言来控制他。
curl -X POST -H "Accept-Charset: UTF-8" -H "Cache-Control: no-cache" -H "Postman-Token: 017ba6d3-8b1a-872e-88eb-ea663ce16313" -H "Content-Type: application/x-www-form-urlencoded" -d ‘url=/uc/login&renderTime=6000&script=$("#loginname").val("seimimaster");$("#nloginpwd").val("xxxxx");$(".login-btn&a").click();&contentType=img&useCookie=1‘ "http://localhost:8000/doload" -o login_jd.png
这里是告诉SeimiAgent启用cookie，并使用一段自定义的JavaScript脚本来控制页面进行登陆，并以图片的形式输出渲染结果（为了方便给大家看）。结果页面如下(文章篇幅有限文章内部使用时进行了动态截断，全图可以自行新窗口打开，去掉连接中?后的参数)：
通过头部可以看出登陆成功。
其他语言与SeimiAgent交互示例
java(okhttp)
OkHttpClient client = new OkHttpClient();
MediaType mediaType = MediaType.parse("application/x-www-form-urlencoded");
RequestBody body = RequestBody.create(mediaType, "url=https%3A%2F%%2Fuc%2Flogin&renderTime=6000&script=%24(%22%23loginname%22).val(%22seimimaster%22)%3B%24(%22%23nloginpwd%22).val(%22seimi%22)%3B%24(%22.login-btn%3Ea%22).click()%3B&contentType=img&useCookie=1");
Request request = new Request.Builder()
.url("http://localhost:8000/doload")
.post(body)
.addHeader("accept-charset", "UTF-8")
.addHeader("cache-control", "no-cache")
.addHeader("postman-token", "96caaa7b--cd15-af884aa19bb4")
.addHeader("content-type", "application/x-www-form-urlencoded")
Response response = client.newCall(request).execute();
import requests
url = "http://localhost:8000/doload"
payload = "url=https%3A%2F%%2Fuc%2Flogin&renderTime=6000&script=%24(%22%23loginname%22).val(%22seimimaster%22)%3B%24(%22%23nloginpwd%22).val(%22seimi%22)%3B%24(%22.login-btn%3Ea%22).click()%3B&contentType=img&useCookie=1"
headers = {
‘accept-charset‘: "UTF-8",
‘cache-control‘: "no-cache",
‘postman-token‘: "d5-97d6-f24f-deaa",
‘content-type‘: "application/x-www-form-urlencoded"
response = requests.request("POST", url, data=payload, headers=headers)
var request = require("request");
var options = { method: ‘POST‘,
url: ‘http://localhost:8000/doload‘,
{ ‘content-type‘: ‘application/x-www-form-urlencoded‘,
‘postman-token‘: ‘6d1bc037-3f74-6a2c-d3da-426e2070bc5a‘,
‘cache-control‘: ‘no-cache‘,
‘accept-charset‘: ‘UTF-8‘ },
{ url: ‘/uc/login‘,
renderTime: ‘6000‘,
script: ‘$("#loginname").val("seimimaster");$("#nloginpwd").val("seimi");$(".login-btn&a").click();‘,
contentType: ‘img‘,
useCookie: ‘1‘ } };
request(options, function (error, response, body) {
if (error) throw new Error(error);
//body 为图片文件流，自行处理
package main
"net/http"
"io/ioutil"
func main() {
url := "http://localhost:8000/doload"
payload := strings.NewReader("url=https%3A%2F%%2Fuc%2Flogin&renderTime=6000&script=%24(%22%23loginname%22).val(%22seimimaster%22)%3B%24(%22%23nloginpwd%22).val(%22seimi%22)%3B%24(%22.login-btn%3Ea%22).click()%3B&contentType=img&useCookie=1")
req, _ := http.NewRequest("POST", url, payload)
req.Header.Add("accept-charset", "UTF-8")
req.Header.Add("cache-control", "no-cache")
req.Header.Add("postman-token", "dd2d6df6-15a3-29b2-de490")
req.Header.Add("content-type", "application/x-www-form-urlencoded")
res, _ := http.DefaultClient.Do(req)
defer res.Body.Close()
body, _ := ioutil.ReadAll(res.Body)
var client = new RestClient("http://localhost:8000/doload");
var request = new RestRequest(Method.POST);
request.AddHeader("content-type", "application/x-www-form-urlencoded");
request.AddHeader("postman-token", "614dc816-370b-ac55-097e-e581ddac601c");
request.AddHeader("cache-control", "no-cache");
request.AddHeader("accept-charset", "UTF-8");
request.AddParameter("application/x-www-form-urlencoded", "url=https%3A%2F%%2Fuc%2Flogin&renderTime=6000&script=%24(%22%23loginname%22).val(%22seimimaster%22)%3B%24(%22%23nloginpwd%22).val(%22seimi%22)%3B%24(%22.login-btn%3Ea%22).click()%3B&contentType=img&useCookie=1", ParameterType.RequestBody);
IRestResponse response = client.Execute(request);
访问个人信息页
在登陆后继续访问个人信息页，验证cookie的连续性。
curl -X POST -H "Accept-Charset: UTF-8" -H "Cache-Control: no-cache" -H "Postman-Token: 6a6c9ae9-1b18-7c02-d1fb-49" -H "Content-Type: application/x-www-form-urlencoded" -d ‘url=/&renderTime=3000&contentType=img&useCookie=1‘ "http://localhost:8000/doload" -o profile_jd.png
获取结果如下(文章篇幅有限文章内部使用时进行了动态截断，全图可以自行新窗口打开，去掉连接中?后的参数)：
可以看到，cookie是延续有效的。
通过上面，可以看到让SeimiAgent只通过一条非常简单的JavaScript脚本便完成了京东这种复杂系统的登陆以及登陆后复杂的动态页面渲染。
所以，你们懂得
来github给尽情的砸star吧。
&&国之画&&&& &&&&chrome插件&&
版权所有京ICP备号-2
迷上了代码！主题信息（必填）
主题描述（最多限制在50个字符）
申请人信息（必填）
申请信息已提交审核，请注意查收邮件，我们会尽快给您反馈。
如有疑问，请联系
CSDN &《程序员》研发主编，投稿&纠错等事宜请致邮
你只管努力，剩下的交给时光！
如今的编程是一场程序员和上帝的竞赛，程序员要开发出更大更好、傻瓜都会用到软件。而上帝在努力创造出更大更傻的傻瓜。目前为止，上帝是赢的。个人网站：。个人QQ群：、
个人大数据技术博客：
前言曾几何时，动态页面（ajax，内部js二次渲染等等）信息提取一直都是爬虫开发者的心痛点，一句话，实在没有合适的工具。尤其在Java里面，像htmlunit这种工具都算得上解析动态页面的神器了，但是他依然不够完备，达不到浏览器级的解析效果，遇到稍微复杂点的页面就不行了。在经历的各种痛与恨后，笔者决定干脆开发一款专为应对抓取，监控，以及测试这类场景使用的动态页面渲染处理服务器。要达到浏览器级的效果，那必须基于浏览器内核来开发，幸运的是我们有开源的webkit，更为幸运的是我们有对开发者更为友好的QtWebkit。所以就这样诞生了。简介SeimiAgent是基于QtWebkit开发的可在服务器端后台运行的一个webkit服务，可以通过SeimiAgent提供的http接口向SeimiAgent发送一个load请求（需求加载的URL以及对这个页面接受的渲染时间或是使用什么代理等参数），通过SeimiAgent去加载并渲染想要处理的动态页面，然后将渲染好的页面直接返给调用方进行后续处理，所以运行的SeimiAgent服务是与语言无关的，任何一种语言或框架都可以通过SeimiAgent提供的标准http接口来获取服务。SeimiAgent的加载渲染环境都是通用浏览器级的，所以不用担心他对动态页面的处理能力。目前SeimiAgent只支持返回渲染好的HTML文档，后续会增加图像快照已经PDF的支持，方便更为多样化的使用需求。使用演示简介SeimiCrawler是一个敏捷的，独立部署的，支持分布式的Java爬虫框架，希望能在最大程度上降低新手开发一个可用性高且性能不差的爬虫系统的门槛，以及提升开发爬虫系统的开发效率。在SeimiCrawler的世界里，绝大多数人只需关心去写抓取的业务逻辑就够了，其余的Seimi帮你搞定。设计思想上SeimiCrawler受Python的爬虫框架Scrapy启发很大，同时融合了Java语言本身特点与Spring的特性，并希望在国内更方便且普遍的使用更有效率的XPath解析HTML，所以SeimiCrawler默认的HTML解析器是JsoupXpath(独立扩展项目，非jsoup自带),默认解析提取HTML数据工作均使用XPath来完成（当然，数据处理亦可以自行选择其他解析器）。整合使用部署SeimiAgent下载解压缩就不表了，上面的动态图中也有演示，下载地址可以到找到。进到SeimiAgent的bin目录，执行：./SeimiAgent -p 8000这是启动SeimiAgent服务并监听8000端口。接下来实际上就可以使用任何语言通过http请求发送加载页面的请求，然后得到渲染后的结果。当然我们这里是要介绍SeimiCrawler是如何整合使用SeimiAgent的。SeimiCrawler配置在v0.3.0版本中已经内置支持了SeimiAgent，只需要开发者配置好SeimiAgent的地址和端口，然后在生成具体Request时选择是否要提交给SeimiAgent，并且指定如何递交。直接上个完整例子在注释中说明吧：package cn.wanghaomiao.
import cn.wanghaomiao.seimi.annotation.C
import cn.wanghaomiao.seimi.def.BaseSeimiC
import cn.wanghaomiao.seimi.struct.R
import cn.wanghaomiao.seimi.struct.R
import cn.wanghaomiao.xpath.model.JXD
import mons.lang3.StringU
import org.springframework.beans.factory.annotation.V
* 这个例子演示如何使用SeimiAgent进行复杂动态页面信息抓取
* 汪浩淼 et.
@Crawler(name = "seimiagent")
public class SeimiAgentDemo extends BaseSeimiCrawler{
* 在resource/config/seimi.properties中配置方便更换，当然也可以自行根据情况使用自己的统一配置中心等服务。这里配置SeimiAgent服务所在地址。
@Value("${seimiAgentHost}")
private String seimiAgentH
@Value("${seimiAgentPort}")
private int seimiAgentP
public String[] startUrls() {
return new String[]{""};
public String seimiAgentHost() {
return this.seimiAgentH
public int seimiAgentPort() {
return this.seimiAgentP
public void start(Response response) {
Request seimiAgentReq = Request.build("","getTotalTransactions")
.useSeimiAgent()
.setSeimiAgentRenderTime(5000);
push(seimiAgentReq);
* 获取搜易贷首页总成交额
* response
public void getTotalTransactions(Response response){
JXDocument doc = response.document();
String trans = StringUtils.join(doc.sel("//div[@class='homepage-amount']/div[@class='number font-arial']/div/span/text()"),"");
("Final Res:{}",trans);
} catch (Exception e) {
e.printStackTrace();
}配置文件seimi.propertiesseimiAgentHost=127.0.0.1
seimiAgentPort=8000启动public class Boot {
public static void main(String[] args){
Seimi s = new Seimi();
s.start("seimiagent");
}SeimiCrawler启动后就可以看到你想要的搜易贷交易总额了。完整的Demo地址SeimiCrawler
Seimi基础系列2-SeimiCrawler整合Mybatis存储数据
26 Jul 2016
最近关注SeimiCrawler整合Mybatis的朋友比较多，故仅以此文抛砖引玉。如果是不了解的朋友也可以通过此文简单了解下。
SeimiCrawler简介
SeimiCrawler是一个敏捷的，独立部署的，支持分布式的Java爬虫框架，希望能在最大程度上降低新手开发一个可用性高且性能不差的爬虫系统的门槛，以及提升开发爬虫系统的开发效率。在SeimiCrawler的世界里，绝大多数人只需关心去写抓取的业务逻辑就够了，其余的Seimi帮你搞定。设计思想上SeimiCrawler受Python的爬虫框架Scrapy启发，同时融合了Java语言本身特点与Spring的特性，并希望在国内更方便且普遍的使用更有效率的XPath解析HTML，所以SeimiCrawler默认的HTML解析器是(独立扩展项目，非jsoup自带),默认解析提取HTML数据工作均使用XPath来完成（当然，数据处理亦可以自行选择其他解析器）。并结合彻底完美解决复杂动态页面渲染抓取问题。
下面正式开始整合Mybatis的内容。数据库以MySQL为例。
&dependency&
&groupId&cn.wanghaomiao&/groupId&
&artifactId&SeimiCrawler&/artifactId&
&version&1.2.0&/version&
&/dependency&
&dependency&
&groupId&mons&/groupId&
&artifactId&commons-dbcp2&/artifactId&
&version&2.1.1&/version&
&/dependency&
&dependency&
&groupId&mons&/groupId&
&artifactId&commons-pool2&/artifactId&
&version&2.4.2&/version&
&/dependency&
&dependency&
&groupId&mysql&/groupId&
&artifactId&mysql-connector-java&/artifactId&
&version&5.1.37&/version&
&/dependency&
&dependency&
&groupId&org.mybatis&/groupId&
&artifactId&mybatis-spring&/artifactId&
&version&1.3.0&/version&
&/dependency&
&dependency&
&groupId&org.mybatis&/groupId&
&artifactId&mybatis&/artifactId&
&version&3.4.1&/version&
&/dependency&
数据表结构
假设建有数据库，库名为xiaohuo，内含表结构如下：
CREATE TABLE `blog` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(300) DEFAULT NULL,
`content` text,
`update_time` timestamp NOT NULL DEFAULT ' 00:00:00' ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
对应的Model对象
package cn.wanghaomiao.
import cn.wanghaomiao.seimi.annotation.X
import mons.lang3.StringU
import mons.lang3.builder.ToStringB
* Xpath语法可以参考 /
* @since .
public class BlogContent {
@Xpath("//h1[@class='postTitle']/a/text()|//a[@id='cb_post_title_url']/text()")
//也可以这么写 @Xpath("//div[@id='cnblogs_post_body']//text()")
@Xpath("//div[@id='cnblogs_post_body']/allText()")
public Integer getId() {
public void setId(Integer id) {
public String getTitle() {
public void setTitle(String title) {
this.title =
public String getContent() {
public void setContent(String content) {
this.content =
public String toString() {
if (StringUtils.isNotBlank(content)&&content.length()&100){
//方便查看截断下
this.content = StringUtils.substring(content,0,100)+"...";
return ToStringBuilder.reflectionToString(this);
整合Mybatis的配置文件
resources下添加 mybatis-config.xml文件
一些基本的全局设置
&?xml version="1.0" encoding="UTF-8" ?&
&!DOCTYPE configuration
PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
"http://mybatis.org/dtd/mybatis-3-config.dtd"&
&configuration&
&settings&
&setting name="mapUnderscoreToCamelCase" value="true"/&
&/settings&
&/configuration&
resources下添加seimi-mybatis.xml文件
&?xml version="1.0" encoding="UTF-8"?&
&beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:context="http://www.springframework.org/schema/context"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd"&
&context:annotation-config /&
&bean id="mybatisDataSource" class="mons.dbcp2.BasicDataSource"&
&property name="driverClassName" value="${database.driverClassName}"/&
&property name="url" value="${database.url}"/&
&property name="username" value="${database.username}"/&
&property name="password" value="${database.password}"/&
&bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean" abstract="true"&
&property name="configLocation" value="classpath:mybatis-config.xml"/&
&bean id="seimiSqlSessionFactory" parent="sqlSessionFactory"&
&property name="dataSource" ref="mybatisDataSource"/&
&bean class="org.mybatis.spring.mapper.MapperScannerConfigurer"&
&property name="basePackage" value="cn.wanghaomiao.dao.mybatis"/&
&property name="sqlSessionFactoryBeanName" value="seimiSqlSessionFactory"/&
配置文件中的${database.driverClassName}是由于SeimiCrawler的demo工程还有动态配置的相关设置，此处亦可直接写死，不必再读其他配置。
在cn.wanghaomiao.dao.mybatis目录下添加DAO
package cn.wanghaomiao.dao.
import cn.wanghaomiao.model.BlogC
import org.apache.ibatis.annotations.I
import org.apache.ibatis.annotations.O
import org.apache.ibatis.annotations.P
* @since .
public interface MybatisStoreDAO {
@Insert("insert into blog (title,content,update_time) values (#{blog.title},#{blog.content},now())")
@Options(useGeneratedKeys = true, keyProperty = "blog.id")
int save(@Param("blog") BlogContent blog);
至此，Mybatis部分的已经就绪了。
package cn.wanghaomiao.
import cn.wanghaomiao.dao.mybatis.MybatisStoreDAO;
import cn.wanghaomiao.model.BlogC
import cn.wanghaomiao.seimi.annotation.C
import cn.wanghaomiao.seimi.def.BaseSeimiC
import cn.wanghaomiao.seimi.struct.R
import cn.wanghaomiao.seimi.struct.R
import cn.wanghaomiao.xpath.model.JXD
import org.springframework.beans.factory.annotation.A
import java.util.L
* 将解析出来的数据直接存储到数据库中,整合mybatis实现
* @author 汪浩淼 [et.]
* @since .
@Crawler(name = "mybatis")
public class DatabaseMybatisDemo extends BaseSeimiCrawler {
@Autowired
private MybatisStoreDAO storeToDbDAO;
public String[] startUrls() {
return new String[]{"/"};
public void start(Response response) {
JXDocument doc = response.document();
List&Object& urls = doc.sel("//a[@class='titlelnk']/@href");
("{}", urls.size());
for (Object s : urls) {
push(Request.build(s.toString(), "renderBean"));
} catch (Exception e) {
e.printStackTrace();
public void renderBean(Response response) {
BlogContent blog = response.render(BlogContent.class);
("bean resolve res={},url={}", blog, response.getUrl());
//使用神器paoding-jade存储到DB
int changeNum = storeToDbDAO.save(blog);
int blogId = blog.getId();
("store success,blogId = {},changeNum={}", blogId, changeNum);
} catch (Exception e) {
e.printStackTrace();
接下来简单启动下，
public class Boot {
public static void main(String[] args){
Seimi s = new Seimi();
s.start("mybatis");
可以看到如下日志：
00:25:18 INFO
c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 257,changeNum=1
00:25:18 INFO
c.w.crawlers.DatabaseMybatisDemo - bean resolve res=cn.wanghaomiao.model.BlogContent@3edc08c3[id=&null&,title=CoordinatorLayout自定义Bahavior特效及其源码分析CoordinatorLayout自定义Bahavior特效及其源码分析,content=@[CoordinatorLayout, Bahavior] CoordinatorLayout是android support design包中可以算是最重要的一个东西，运用它可以做出一些不错的特效...],url=/soaringEveryday/p/5711545.html
00:25:18 INFO
c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 258,changeNum=1
00:25:18 INFO
c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 259,changeNum=1
00:25:18 INFO
c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 260,changeNum=1
整合完毕！
生产环境工程打包部署以及启动，推荐使用maven-seimicrawler-plugin打包插件，详细请继续参阅或是“”。
完整的Demo工程地址}

淘宝游戏网

seimicrawler4j的使用的js怎么用

我要回帖

更多关于 seimicrawler 的文章

更多推荐