teaGod's blog Help

数据开发中的一些SQL经验总结

声明:
  本篇博客不涉及任何SQL性能考虑,只是讨论在OLAP场景下的一些理论解决方法,用来解决日常生产开发中的一些常见需求。
  同时,为了精简篇幅,SQL全部使用了全表查询的写法,没有加任何过滤条件,请注意仔细甄别。

两张表之间,没有任何关联关系。按日期聚合统计时SQL该怎么写?

举个例子,假设有两张表,一张是放款表,一张是还款表。现在要求用一条SQL,统计出每日的放款金额和还款金额。

CREATE TABLE disburse ( id BIGINT PRIMARY KEY AUTO_INCREMENT COMMENT '主键ID', disburse_date DATE NOT NULL COMMENT '放款日期', amount DECIMAL(15, 2) NOT NULL COMMENT '放款金额' ) ENGINE=InnoDB COMMENT='放款记录表';
CREATE TABLE repay ( id BIGINT PRIMARY KEY AUTO_INCREMENT COMMENT '主键ID', repay_date DATE NOT NULL COMMENT '还款日期', amount DECIMAL(15, 2) NOT NULL COMMENT '还款金额' ) ENGINE=InnoDB COMMENT='还款记录表';

有两种写法,我简称为 横着写 或者 竖着写。 横着写就是用 join ,竖着写就是用 union all ,下面请看代码。

SELECT a.stat_date, a.disburse_amount, b.repay_amount FROM ( SELECT disburse_date AS stat_date, SUM(amount) AS disburse_amount FROM disburse GROUP BY disburse_date ) AS a JOIN ( SELECT repay_date AS stat_date, SUM(amount) AS repay_amount FROM repay GROUP BY repay_date ) b ON a.stat_date = b.stat_date;
SELECT stat_date, SUM(disburse_amount) AS disburse_amount, SUM(repay_amount) AS repay_amount FROM ( SELECT disburse_date AS stat_date, SUM(amount) AS disburse_amount, 0 AS repay_amount FROM disburse GROUP BY disburse_date UNION SELECT repay_date AS stat_date, 0 AS disburse_amount, SUM(amount) AS repay_amount FROM repay GROUP BY repay_date ) t GROUP BY stat_date;

我为什么说这个join写法有坑呢?因为可能某天有放款但是没还款;或者反过来,有还款没放款。那么在用日期关联的时候,就可能会丢数据,不管你是把join改成left join还是right join都解决不了问题。

下面是针对这中join写法的两种优化版,解决了上述问题

SELECT t.stat_date, IFNULL(a.disburse_amount, 0) AS disburse_amount, IFNULL(b.repay_amount, 0) AS repay_amount FROM ( SELECT DISTINCT stat_date FROM ( SELECT disburse_date as stat_date FROM disburse UNION ALL SELECT repay_date as stat_date FROM repay ) AS all_date ) AS t LEFT JOIN ( SELECT disburse_date AS stat_date, SUM(amount) AS disburse_amount FROM disburse GROUP BY disburse_date ) AS a ON t.stat_date = a.stat_date LEFT JOIN ( SELECT repay_date AS stat_date, SUM(amount) AS repay_amount FROM repay GROUP BY repay_date ) b ON t.stat_date = b.stat_date;
SELECT COALESCE(a.stat_date, b.stat_date) AS stat_date, IFNULL(a.disburse_amount, 0) AS disburse_amount, IFNULL(b.repay_amount, 0) AS repay_amount FROM ( SELECT disburse_date AS stat_date, SUM(amount) AS disburse_amount FROM disburse GROUP BY disburse_date ) AS a FULL JOIN ( SELECT repay_date AS stat_date, SUM(amount) AS repay_amount FROM repay GROUP BY repay_date ) b ON a.stat_date = b.stat_date;

   第一种优化版写法,新加了一张全日期的日期表,作为主表,然后left join另外两个表。第二种写法是用full join ,即全外连接。


   两种写法都有个弊端,那就是都会存在空值的情况,你可以看到我在SELECT中大量使用了COALESCEIFNULL等函数对空值做特殊处理。 而且在性能方面,两种写法都很糟糕。一个引入了一张额外的表,且写法很啰嗦。一种引入了full join ,这是一种性能很差的语法,且有些数据库不支持。


   简而言之,遇到这种需求, union all写法是最优解。

有转化关系的SQL该怎么写?

有转化关系的数据,意思就是各个统计字段间具有层级依赖特性,像个漏斗一样是分层的,每层数据都来自上层,但是一层比一层少。最经典的比如安装注册转化、曝光点击转化等。

举个例子,假设有两张表,一张是安装表,一张是注册表,两张表用设备id关联,一个设备只有一个设备id,一个用户只有一个设备。

现在要求用一条SQL,按日期统计安装数、24小时内注册数、T0注册数、T0(中午12点)注册数、T0(晚上6点)注册数、T1注册数、T30注册数。

CREATE TABLE install ( id BIGINT PRIMARY KEY AUTO_INCREMENT COMMENT '主键ID', device_id VARCHAR(100) NOT NULL COMMENT '设备唯一标识', install_time DATETIME(6) NOT NULL COMMENT '安装时间(精确到微秒)' ) ENGINE=InnoDB COMMENT='安装表';
CREATE TABLE register ( id BIGINT PRIMARY KEY AUTO_INCREMENT COMMENT '用户ID/主键', device_id VARCHAR(100) NOT NULL COMMENT '设备唯一标识', register_time DATETIME(6) NOT NULL COMMENT '注册时间(精确到微秒)' ) ENGINE=InnoDB COMMENT='用户表';

   注册数是来自安装数,所以肯定小于等于安装数,就像漏斗一样。但是这个漏斗的出口是随着时间变化,越来越粗的。 所以这种需求一般都会要求一次性查看多个时期内的转化数据。比如上文提到的T0安装数,即用户安装当天即完成注册的数量;T1安装数,即用户安装当天或第二天完成注册的总数量。


   一般遇到这种需求,没有经验的小白先不要慌(其实我当时已经有一点想骂街了)。他的核心解决思路,其实就是两表join后, CASE函数里用两个时间字段做对比。一个不变的时间,一个变化的时间。 拿本例来说,不变的时间就是安装时间,变化的时间就是注册时间。下面请看代码

SELECT DATE(i.install_time) AS install_date, COUNT(DISTINCT i.device_id) AS install_num, COUNT(DISTINCT CASE WHEN TIMESTAMPDIFF(HOUR, i.install_time, r.register_time) BETWEEN 0 AND 24 THEN r.id ELSE NULL END) AS register_in_24hour, COUNT(DISTINCT CASE WHEN DATEDIFF(DATE(r.register_time), DATE(i.install_time)) BETWEEN 0 AND 0 THEN r.id ELSE NULL END) AS register_t0, COUNT(DISTINCT CASE WHEN DATE(r.register_time) = DATE(i.install_time) AND TIME(r.register_time) < '12:00:00' THEN r.id ELSE NULL END) AS register_t0_before_12, COUNT(DISTINCT CASE WHEN DATE(r.register_time) = DATE(i.install_time) AND TIME(r.register_time) < '18:00:00' THEN r.id END) AS register_t0_before_18, COUNT(DISTINCT CASE WHEN DATEDIFF(DATE(r.register_time), DATE(i.install_time)) BETWEEN 0 AND 1 THEN r.id ELSE NULL END) AS register_t1, COUNT(DISTINCT CASE WHEN DATEDIFF(DATE(r.register_time), DATE(i.install_time)) BETWEEN 0 AND 30 THEN r.id ELSE NULL END) AS register_t30 FROM install AS i LEFT JOIN register AS r ON i.device_id = r.device_id GROUP BY DATE(i.install_time)

因为数据库有着丰富的时间日期处理函数,所以理论上上面的写法会有多个变种,但是核心思路不会变。
再进阶一点,上面的SQL是按日统计的,如果老板想按照周维度或者月维度去看这个报表,该怎么统计?(已经想疯狂骂街了)其实简单换一下group by条件就好了,下面请看代码

SELECT CONCAT(MIN(DATE(i.install_time)),"~",MAX(DATE(i.install_time))) AS install_date, COUNT(DISTINCT i.device_id) AS install_num, COUNT(DISTINCT CASE WHEN TIMESTAMPDIFF(HOUR, i.install_time, r.register_time) BETWEEN 0 AND 24 THEN r.id ELSE NULL END) AS register_in_24hour, COUNT(DISTINCT CASE WHEN DATEDIFF(DATE(r.register_time), DATE(i.install_time)) BETWEEN 0 AND 0 THEN r.id ELSE NULL END) AS register_t0, COUNT(DISTINCT CASE WHEN DATE(r.register_time) = DATE(i.install_time) AND TIME(r.register_time) < '12:00:00' THEN r.id ELSE NULL END) AS register_t0_before_12, COUNT(DISTINCT CASE WHEN DATE(r.register_time) = DATE(i.install_time) AND TIME(r.register_time) < '18:00:00' THEN r.id END) AS register_t0_before_18, COUNT(DISTINCT CASE WHEN DATEDIFF(DATE(r.register_time), DATE(i.install_time)) BETWEEN 0 AND 1 THEN r.id ELSE NULL END) AS register_t1, COUNT(DISTINCT CASE WHEN DATEDIFF(DATE(r.register_time), DATE(i.install_time)) BETWEEN 0 AND 30 THEN r.id ELSE NULL END) AS register_t30 FROM install AS i LEFT JOIN register AS r ON i.device_id = r.device_id GROUP BY YEARWEEK(i.install_time, 1);
SELECT DATE_FORMAT(install_time, '%Y-%m') AS install_date, COUNT(DISTINCT i.device_id) AS install_num, COUNT(DISTINCT CASE WHEN TIMESTAMPDIFF(HOUR, i.install_time, r.register_time) BETWEEN 0 AND 24 THEN r.id ELSE NULL END) AS register_in_24hour, COUNT(DISTINCT CASE WHEN DATEDIFF(DATE(r.register_time), DATE(i.install_time)) BETWEEN 0 AND 0 THEN r.id ELSE NULL END) AS register_t0, COUNT(DISTINCT CASE WHEN DATE(r.register_time) = DATE(i.install_time) AND TIME(r.register_time) < '12:00:00' THEN r.id ELSE NULL END) AS register_t0_before_12, COUNT(DISTINCT CASE WHEN DATE(r.register_time) = DATE(i.install_time) AND TIME(r.register_time) < '18:00:00' THEN r.id END) AS register_t0_before_18, COUNT(DISTINCT CASE WHEN DATEDIFF(DATE(r.register_time), DATE(i.install_time)) BETWEEN 0 AND 1 THEN r.id ELSE NULL END) AS register_t1, COUNT(DISTINCT CASE WHEN DATEDIFF(DATE(r.register_time), DATE(i.install_time)) BETWEEN 0 AND 30 THEN r.id ELSE NULL END) AS register_t30 FROM install AS i LEFT JOIN register AS r ON i.device_id = r.device_id GROUP BY DATE_FORMAT(install_time, '%Y-%m');

未完待续。。。

Last modified: 07 December 2025