Mysql中in vs join

简述

最近在一个群里看到join和in的争论,到底什么时候用in,什么时候用join,我做了实验来验证

预备知识

join type介绍

join type并非指join类型,而是查询类型 explain sql后的type字段正是join type Mysql中有以下join type,性能从高到底为: * system : 系统常量表 * const : 常量表 * eq_ref : 主键或者唯一索引 * ref : 非空索引 * ref_or_null : 可空索引 * range :索引列使用 BETWEEN, IN, >=, LIKE等范围查询 * index : 循序扫描索引列 * ALL : 顺序扫描整张表

详见:https://dev.mysql.com/doc/internals/en/optimizer-determining-join-type.html

in和or等效

在Mysql中,这两个语句是等效的,in查询会被转换为or

column1 IN (1,2,3)
column1 = 1 OR column1 = 2 OR column1 = 3

详见:https://dev.mysql.com/doc/internals/en/optimizer-range-join-type.html

Mysql的Join算法(NLJ)

Mysql的join使用的NLJ算法(内循环join算法) 假设三个表之间的连接 t1,t2以及 t3是使用以下类型的连接来执行:

Table   Join Type
t1      range
t2      ref
t3      ALL

如果使用简单的NLJ算法,则会按如下方式处理连接:

for each row in t1 matching range {
  for each row in t2 matching reference key {
    for each row in t3 {
      if row satisfies join conditions, send to client
    }
  }
}

Mysql使用Block Nested-Loop (BNL)进行处理,也就是join_buffer缓存联合数据,join_buffer_size 控制此项

for each row in t1 matching range {
  for each row in t2 matching reference key {
    store used columns from t1, t2 in join buffer
    if buffer is full {
      for each row in t3 {
        for each t1, t2 combination in join buffer {
          if row satisfies join conditions, send to client
        }
      }
      empty join buffer
    }
  }
}

if buffer is not empty {
  for each row in t3 {
    for each t1, t2 combination in join buffer {
      if row satisfies join conditions, send to client
    }
  }
}

join_buffer_size 详见:https://dev.mysql.com/doc/refman/8.0/en/nested-loop-joins.html

join优化: https://dev.mysql.com/doc/refman/8.0/en/nested-join-optimization.html

实测数据

使用Mysql5.7测试

tag_count:3000,post_count:300000,tag_post_count:300000

多对多,使用where in ,子查询
sql:select * from posts where id in (select post_id from tag_post where tag_id in (select id from tags where type = "3" ) )
total:57653
cost:414.41ms

多对多,拆分成3个查询,分别执行
total tag_id:573
total post_id:57653
total:57653
cost:446.82ms

多对多,使用join
sql:select * from posts left join tag_post on posts.id = tag_post.post_id left join tags on tags.id = tag_post.tag_id where tags.type = "3"
total:57653
cost:295.14ms

一对多,使用子查询
sql:select * from posts where tag_id in (select id from tags where type=3)
total:57100
cost:260.81ms

一对多,使用in,并且拆分成两个查询
total tag_id:573
total:57100
cost:219.88ms

一对多,使用join
sql:select * from posts left join tags on tags.id = posts.tag_id where tags.type = "3"
total:57100
cost:223.49ms

结论

  • in不慢,即使是上万的id。in会被mysql转换为or。
  • in子查询比join慢,子查询有时无法正常使用索引
  • 表记录越多,join越慢,如果是特别大的表,用in会快很多
  • 拆分查询很有帮助,不过要注意IO固定开销