3

I have always had trouble getting SUMs on join tables, there is always an issue, I can get the results I need by running two queries, I am wondering if this two queries can be combine to make one join query, here is the queries I have and my attempt to join the query

Query 1

SELECT last_name, first_name, DATE_FORMAT( (mil_date),  '%m/%d/%y' ) AS dates, 
SUM( drive_time ) MINUTES FROM bhds_mileage LEFT JOIN bhds_teachers i 
ON i.ds_id = bhds_mileage.ds_id 
WHERE mil_date BETWEEN  '2016-04-11' AND  '2016-04-30'
AND bhds_mileage.ds_id =5
GROUP BY CONCAT( YEAR( mil_date ) ,  '/', WEEK( mil_date ) ) ,    
bhds_mileage.ds_id
ORDER BY last_name ASC , dates ASC 

the output in minutes is 271, 281, 279

Query 2

SELECT last_name, first_name, DATE_FORMAT((tm_date), '%m/%d/%y') AS dates,   
SUM(tm_hours) total FROM bhds_timecard LEFT JOIN bhds_teachers i 
ON i.ds_id = bhds_timecard.ds_id 
WHERE tm_date BETWEEN '2016-04-11' AND '2016-04-30' AND bhds_timecard.ds_id = 5
GROUP BY CONCAT(YEAR(tm_date), '/', WEEK(tm_date)), bhds_timecard.ds_id 
ORDER BY last_name ASC, dates ASC

The output here is 33.00, 36.00, 26.75

Now my attempt to join the query

SELECT last_name, first_name, DATE_FORMAT((tm_date), '%m/%d/%y') AS dates,  
SUM(tm_hours) total,  SUM( drive_time ) MINUTES FROM bhds_timecard 
LEFT JOIN bhds_teachers i ON i.ds_id = bhds_timecard.ds_id 
LEFT JOIN bhds_mileage ON DATE_FORMAT((bhds_timecard.tm_date), '%m/%d/%y') = 
DATE_FORMAT((bhds_mileage.mil_date), '%m/%d/%y') AND bhds_timecard.ds_id = bhds_mileage.ds_id
WHERE tm_date BETWEEN '2016-04-11' AND '2016-04-30' AND bhds_timecard.ds_id = 5
GROUP BY CONCAT(YEAR(tm_date), '/', WEEK(tm_date)), bhds_timecard.ds_id 

parenthesis is what is expected

this outputs 1044 (271), 1086 (281), 1215 (279)

2

2 Answers 2

8

When you use multiple joins in the main query, you end up with a cross product of all the tables, so the sums get multiplied by the number of rows matching in another table. You need to move the sums into subqueries.

SELECT last_name, first_name, DATE_FORMAT(LEAST(mil_date, tm_date),  '%m/%d/%y' ) AS dates, 
        total, minutes
FROM bhds_teachers AS i
LEFT JOIN (
    SELECT ds_id, YEARWEEK(mil_date) AS week, MIN(mil_date) AS mil_date, SUM(drive_time) AS minutes
    FROM bhds_mileage
    WHERE mil_date BETWEEN '2016-04-11' AND  '2016-04-30'
    AND bhds_mileage.ds_id = 5
    GROUP BY ds_id, week) AS m 
ON m.ds_id = i.ds_id
LEFT JOIN (
    SELECT ds_id, YEARWEEK(tm_date) AS week, MIN(tm_date) AS tm_date, SUM(tm_hours) AS total
    WHERE tm_date BETWEEN '2016-04-11' AND '2016-04-30' AND bhds_timecard.ds_id = 5
    GROUP BY ds_id, week) AS t 
ON t.ds_id = i.ds_id AND t.week = m.week
2

There's a couple of issues... partial Cartesian product (cross product) between bhds_mileage and bhds_timecard, because each detail row (within a group) from one table will be "cross joined" with the detail rows from the other table. And that happens before the GROUP BY operation collapses the rows, and computes the SUM. That explains why you are seeing "inflated" values.

The workaround for that is to compute at least one of the SUM() aggregates in an inline view... get the SUM() / GROUP BY() done like one of your first queries does. For clarity, you could do the same thing for both of the original queries, and then join the results from the inline views.

MySQL doesn't natively support FULL outer join. One of the tables will need to be the driving table. For example, we could use _timecard as the driving table, but that would mean that we have to return a row for a given week from _timecard in order to return a corresponding row from _mileage. That is, without a row in _timecard, we couldn't get a row from _mileage.

We notice that the join to bhds_teacher is an outer join. If we have a foreign key constraint between ds_id in both _mileage and _timecard, referencing _teacher, then that wouldn't necessarily need to be an outer join, we could use an inner join, and use _teacher as the driving table for two outer joins.

Another issue is the non-aggregates in the SELECT list... e.g. DATE_FORMAT((tm_date), '%m/%d/%y')

The GROUP BY is on year and week, so the value from the DATE_FORMAT is indeterminate... it could be from any tm_date within the group. There's no guarantee that you'll get the first day of the week, the earliest date within the week or whatever.

Also, the second parameter for WEEK function is omitted, so that will default to the default_week_format system variable. Personally, I'd avoid the YEAR, WEEK and CONCAT functions, and go with a simpler DATE_FORMAT, using an date format string that explicitly includes the mode parameter for the week.

If you want to join on "week", then the join predicate should be on the "week" value, not one indeterminate date within the week.

(There may be some specific constraints on the data which we aren't aware of... if there are rows in _mileage for a given week, on a monday, then we are guaranteed to have a _timecard for that same monday. In the more general case, we wouldn't have that guarantee.)

Even if we do have that guarantee, we aren't guaranteed that the non-aggregate in the SELECT list won't return the date from a Tuesday _timecard, and a Thursday _mileage... (unless there's some sort of guarantee that the data will include only rows with "Monday" dates on _timecard and _mileage). Withouat that, the non-aggregate expression is not a reliable expression for a join predicate.

Assuming ds_id is unique on _teacher, and is referenced by foreign keys ds_id from both _mileage and _timecard, then something like this:

SELECT i.last_name
     , i.first_name
     , tm.dates
     , tm.total_hours
     , mm.total_minutes
  FROM bhds_teacher i 
  LEFT
  JOIN ( SELECT t.ds_id
              , DATE_FORMAT( t.tm_date,'%Y/%U')          AS week_
              , DATE_FORMAT( MIN(t.tm_date) ,'%m/%d/%y') AS dates
              , SUM(t.tm_hours)                          AS total_hours
           FROM bhds_timecard t
          WHERE t.tm_date BETWEEN '2016-04-11' AND '2016-04-30'   -- <
            AND t.ds_id = 5                                       -- <
          GROUP
             BY t.ds_id
              , DATE_FORMAT( t.tm_date,'%Y/%U')                   -- week
       ) tm
    ON tm.ds_id = i.ds_id
  LEFT
  JOIN ( SELECT m.ds_id
              , DATE_FORMAT( m.mil_date,'%Y/%U')           AS week_
              , DATE_FORMAT( MIN(m.mil_date), '%m/%d/%y' ) AS dates
              , SUM( m.drive_time )                        AS total_minutes 
           FROM bhds_mileage m
          WHERE m.mil_date BETWEEN '2016-04-11' AND '2016-04-30'  -- <
            AND m.ds_id = 5                                       -- <
          GROUP
             BY m.ds_id
              , DATE_FORMAT( m.mil_date,'%Y/%U')                  -- week
       ) mm
    ON mm.ds_id = i.ds_id
   AND mm.week_ = tm.week_
 WHERE i.ds_id = 5                                                -- <
 ORDER
    BY i.last_name ASC, tm.dates ASC

Not the answer you're looking for? Browse other questions tagged or ask your own question.