简述
最近在一个群里看到join和in的争论,到底什么时候用in,什么时候用join,我做了实验来验证
预备知识
join type介绍
join type并非指join类型,而是查询类型 explain sql后的type字段正是join type Mysql中有以下join type,性能从高到底为: * system : 系统常量表 * const : 常量表 * eq_ref : 主键或者唯一索引 * ref : 非空索引 * ref_or_null : 可空索引 * range :索引列使用 BETWEEN, IN, >=, LIKE等范围查询 * index : 循序扫描索引列 * ALL : 顺序扫描整张表
详见:https://dev.mysql.com/doc/internals/en/optimizer-determining-join-type.html
in和or等效
在Mysql中,这两个语句是等效的,in查询会被转换为or
column1 IN (1,2,3)
column1 = 1 OR column1 = 2 OR column1 = 3
详见:https://dev.mysql.com/doc/internals/en/optimizer-range-join-type.html
Mysql的Join算法(NLJ)
Mysql的join使用的NLJ算法(内循环join算法) 假设三个表之间的连接 t1,t2以及 t3是使用以下类型的连接来执行:
Table Join Type
t1 range
t2 ref
t3 ALL
如果使用简单的NLJ算法,则会按如下方式处理连接:
for each row in t1 matching range {
for each row in t2 matching reference key {
for each row in t3 {
if row satisfies join conditions, send to client
}
}
}
Mysql使用Block Nested-Loop (BNL)进行处理,也就是join_buffer缓存联合数据,join_buffer_size 控制此项
for each row in t1 matching range {
for each row in t2 matching reference key {
store used columns from t1, t2 in join buffer
if buffer is full {
for each row in t3 {
for each t1, t2 combination in join buffer {
if row satisfies join conditions, send to client
}
}
empty join buffer
}
}
}
if buffer is not empty {
for each row in t3 {
for each t1, t2 combination in join buffer {
if row satisfies join conditions, send to client
}
}
}
join_buffer_size 详见:https://dev.mysql.com/doc/refman/8.0/en/nested-loop-joins.html
join优化: https://dev.mysql.com/doc/refman/8.0/en/nested-join-optimization.html
实测数据
使用Mysql5.7测试
tag_count:3000,post_count:300000,tag_post_count:300000
多对多,使用where in ,子查询
sql:select * from posts where id in (select post_id from tag_post where tag_id in (select id from tags where type = "3" ) )
total:57653
cost:414.41ms
多对多,拆分成3个查询,分别执行
total tag_id:573
total post_id:57653
total:57653
cost:446.82ms
多对多,使用join
sql:select * from posts left join tag_post on posts.id = tag_post.post_id left join tags on tags.id = tag_post.tag_id where tags.type = "3"
total:57653
cost:295.14ms
一对多,使用子查询
sql:select * from posts where tag_id in (select id from tags where type=3)
total:57100
cost:260.81ms
一对多,使用in,并且拆分成两个查询
total tag_id:573
total:57100
cost:219.88ms
一对多,使用join
sql:select * from posts left join tags on tags.id = posts.tag_id where tags.type = "3"
total:57100
cost:223.49ms
结论
- in不慢,即使是上万的id。in会被mysql转换为or。
- in子查询比join慢,子查询有时无法正常使用索引
- 表记录越多,join越慢,如果是特别大的表,用in会快很多
- 拆分查询很有帮助,不过要注意IO固定开销