Querypredicatepushdownviahttp2.0 server pushsideFilters是什么意思

[ORC-40] C++ Reader does not support predicate pushdown. - ASF JIRA
New Feature
Resolution:
Unresolved
Affects Version/s:
Fix Version/s:
Component/s:
The Java reader can push down predicates to filter the data being read. The C++ reader doesn't do that yet.
Unassigned
Owen O'Malley
Vote for this issue
Start watching this issue下次自动登录
现在的位置:
& 综合 & 正文
hive的优化方式
hive相关优化方式:
Column Pruning
As name suggests -discard columns which are not needed
& select a,b from t where e&10;
&t contains 5 columns (a,b,c,d,e)
Columns c,d are discarded
Select only the relevant columns
Enabled by defaut
& hive.optimize.cp=true
Predicate Pushdown
Move predicate closer to the table scan only.
Enabled by default:
hive.optimize.ppd=true
Predicates moved up across joins.
select * from t1 join t2 on(t1.c1=t2.c2 and t1.c1&10)
select * from t1 join t2 on(t1.c1=t2.c2) where t1.c1&10
Special needs for outer joins:
Left outer join: predicates on the left side aliases are pushed
Right outer join:predicates on the right side aliases are pushed
Full outer join:none of the predicates are pushed
Non-deterministic functions(eg.rand()) not pushed.
Use annotation:
& #UDFType(deterministic=false)
The entire expression containing non-deterministic function is not pushed up
& c1&10 and c2&rand()
Partition Pruning
Reduce list of partitions to be scanned
Works on parse tree currently- some known bugs
select * from
(select c1, count(1) from t group by c1) subq
where subq.prtn=100;
select * from t1 join
(select * from t2) subq on(t1.c1=subq.c2)
where subq.prtn=100;
hive.mapred.mode=nonstrict
Strict mode, scan of a complete partitioned table fails.
Hive QL-Join
INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid,u.age
FROM page_view pv
JOIN user u
ON (pv.userid=u.userid);
Rightmost table streamed - whereas inner tables data is
kept in memory for a given key. Use largest table as the
right most table.
hive.mapred.mode=nonstrict
In strict mode,Cartesian product not allowed
INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid,u.age
FROM page_view p JOIN user u
ON (pv.userid=u.useid)
JOIN new user x on (u.userid=x.useid);
Same join key - merge into 1 map-reduce job -true for any number of tables with the same join key.
1 map-reduce job instead of 'n'
The merging happens for OUTER joins also
INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid,u.age
FROM page_view p JOIN user u
ON (pv.userid=u.userid)
JOIN new user x on (u.age=x.age);
Different join keys - 2 map-reduce jobs
INSERT OVERWRITE TABLE tmptable SELECT *
FROM page_view p JOIN user u
ON(pv.useid=u.userid);
INSERT OVERWRITE TABLE pv_users
SELECT x.pageid,x.age
FROM temptable x JOIN newuser y on(x.age=y.age);
Join Optimizations
& User specified small tables stored in hash tables on the mapper backed by jdbm
& No reducer needed
INSERT INTO TABLE pv_users
SELECT /*+MAPJOIN(pv) */ pv.pageid,u.age
FROM page_view pv JOIN user u
ON(pv.userid=u.userid);
&&Optimization phase
&& n-way map-join if(n-1) tables are map side readable
&&Mapper reads all (n-1) table before processing the main table under consideration
&&Map-side readable tables are cached in memory and backed by JDBM persistent hash tables
Parameters
&&hive.join.emit.interval=1000
&&hive.mapjoin.size.key=10000
&&hive.mapjoin.cache.numrows=10000
&&Sizes/statistics to determine join order
&&Sizes to enforce map-join
&&Better techniques for handling skew for a given key
&&Using sorted properties of the table
&&Fragmented joins
&&n-way joins for different join keys by replicating data
Hive QL - Group By
SELECT pageid, age, count(1)
FROM pv_users
GROUP BY paged,
Group by Optimizations
&&Map side partial aggregations
& Hash-based aggregates
& Serialized key/values in hash tables
& 90% speed improvement on Query
select count(1)
&& Load balancing for data skew
Parameters
&& hive.map.aggr=true
&& hive.groupby.skewindata=false
&& hive.groupby.mapaggr.checkinterval=100000
&& hive.map.aggr.hash.percentmemory=0.5
&& hive.map.aggr.hash.min.reduction=0.5
Multi GroupBy
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT gender, count(DISTINCT userid),count(userid)
GROUP BY gender
INSERT OVERWRITE TABLE pv_age_sum
SELECT age, count(DISTINCT userid)
GROUP BY age
&&n+1 map-reduce jobs instead of 2n
&& Single scan of input table
&&Same distinct key across all groupies
&&Always user multi-groupby
Merging of small files
&&Lots of small files creates problems for downstream jobs
& SELECT * FROM T WHERE x&10;
&& hive.merge.mapfiles=true
&& hive.merge.mapredfiles=false
&& hive.merge.size.per.task=256*
&& Increases time for current query
【上篇】【下篇】&&&&Presto技术内幕&JD-Presto研发团队&9
邀请好友参加吧
版 次:1页 数:20000字 数:200000印刷时间:日开 本:16开纸 张:胶版纸包 装:平装-胶订是否套装:否国际标准书号ISBN:9所属分类:&&&
下载免费当当读书APP
品味海量优质电子书,尊享优雅的阅读体验,只差手机下载一个当当读书APP
本商品暂无详情。
当当价:为商品的销售价,具体的成交价可能因会员使用优惠券、积分等发生变化,最终以订单结算页价格为准。
划线价:划线价格可能是图书封底定价、商品吊牌价、品牌专柜价或由品牌供应商提供的正品零售价(如厂商指导价、建议零售价等)或该商品曾经展示过的销售价等,由于地区、时间的差异化和市场行情波动,商品吊牌价、品牌专柜价等可能会与您购物时展示的不一致,该价格仅供您参考。
折扣:折扣指在划线价(图书定价、商品吊牌价、品牌专柜价、厂商指导价等)某一价格基础上计算出的优惠比例或优惠金额。如有疑问,您可在购买前联系客服咨询。
异常问题:如您发现活动商品销售价或促销信息有异常,请立即联系我们补正,以便您能顺利购物。
当当购物客户端手机端1元秒
当当读书客户端万本电子书免费读scala - Does Spark optimize DataFrame sample() function to do predicate pushdown - Stack Overflow
Stack Overflow for Teams
A private, secure home for your team's questions and answers.
to customize your list.
This site uses cookies to deliver our services and to show you relevant ads and job listings.
By using our site, you acknowledge that you have read and understand our , , and our .
Your use of Stack Overflow’s Products and Services, including the Stack Overflow Network, is subject to these policies and terms.
My question is Spark DataFrame.sample() function make use of ability to use predicate pushdown - sample record before deserializing it.
So if it makes such optimization - parquet would sample record first and deserialize only (for example) 10% of records if fraction=0.1
val df = sqlContext.read.load("datastore.parquet")
// the next one would be optimized to use predicate/filter pushdown - and will be rather fast on parquet dataset
df.filter($"col1"===1).count
== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#158L])
TungstenExchange SinglePartition
TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#161L])
Filter (event_type#0 = clk)
Scan ParquetRelation[hdfs://...PARQUET][event_type#0]
// the question here - is sample implementation optimized?
df.sample(false, 0.1).count
== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#153L])
TungstenExchange SinglePartition
TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#156L])
Sample 0.0, 0.1, false, -4505344
Scan ParquetRelation[hdfs://...PARQUET][]
// combined operation
df.filter($"col1"===1).sample(false, 0.1).count
== Physical Plan ==
TungstenAggregate(key=[], functions=
[(count(1),mode=Final,isDistinct=false)], output=[count#163L])
TungstenExchange SinglePartition
TungstenAggregate(key=[], functions=
[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#166L])
Sample 0.0, 0.1, false, 3692713
Filter (event_type#0 = clk)
Scan ParquetRelation[hdfs://...PARQUET][event_type#0]
3,84811834
Know someone who can answer? Share a link to this
via , , , or .
Your Answer
Sign up or
Sign up using Google
Sign up using Facebook
Post as a guest
Post as a guest
By clicking &Post Your Answer&, you acknowledge that you have read our updated ,
and , and that your continued use of the website is subject to these policies.
Browse other questions tagged
Stack Overflow works best with JavaScript enabled}

我要回帖

更多关于 predicate pushdown 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信