Jump to content

COUNT(*) vs COUNT(tid)


KT Walrus
Go to solution Solved by Matt,

Recommended Posts

I'm working on implementing an app where the Items can be mapped to multiple Containers using a Map table to map Item ColumnId to Container ColumnId.

To do this, I'm using an INNER JOIN of the Map table in getItemsWithPermission() passing \IPS\Db::SELECT_DISTINCT flag.

This is mostly working, but sometimes getItemsWithPermission() is being called to get a COUNT instead of the actual Items.

Here is an example of the wrong SQL being built for the COUNT query:

SELECT DISTINCT COUNT(*) as cnt FROM `puce_topics`  INNER JOIN  `puce_forum_topic_map` ON puce_topics.tid=puce_forum_topic_map.topic_id AND ( puce_forum_topic_map.forum_id IN(1510000,1516600) ) WHERE puce_topics.approved < 2 AND puce_topics.approved!=-2 AND puce_topics.approved!=-3;

In my test case, with only 4 rows in puce_topics, it returns 6, because 2 of the topics are mapped to both the forum_ids in the query.

This COUNT query should really be written as:

SELECT DISTINCT COUNT(DISTINCT(puce_topics.tid)) as cnt FROM `puce_topics`  INNER JOIN  `puce_forum_topic_map` ON puce_topics.tid=puce_forum_topic_map.topic_id AND ( puce_forum_topic_map.forum_id IN(1510000,1516600) ) WHERE puce_topics.approved < 2 AND puce_topics.approved!=-2 AND puce_topics.approved!=-3;

which returns 4, which is correct.

An even better fix is to construct this COUNT query:

SELECT COUNT(DISTINCT(puce_topics.tid)) as cnt FROM `puce_topics`  INNER JOIN  `puce_forum_topic_map` ON puce_topics.tid=puce_forum_topic_map.topic_id AND ( puce_forum_topic_map.forum_id IN(1510000,1516600) ) WHERE puce_topics.approved < 2 AND puce_topics.approved!=-2 AND puce_topics.approved!=-3;

That is, if \IPS\Db::SELECT_DISTINCT flag is set, unset it for constructing the COUNT query and don't use * but the actual databaseColumnId.

Please fix in the November release, if possible. Or, if this isn't a bug, let me know what I am doing wrong. It seems to be that SELECT DISTINCT COUNT(*) is wrong and it sould be SELECT COUNT(DISTINCT(tid)) instead to count the number of distinct Items.

I might try using a Subquery on the map table instead of an INNER JOIN as a workaround but would prefer using a JOIN rather than a Subquery (I think JOINs might optimize better).

Link to comment
Share on other sites

Has anyone from IPS read this, or should I contact Customer Support for resolution? 

I need this fixed in the November release. I've fixed it locally (at least for my use case), but SELECT DISTINCT COUNT(*) just doesn't work. It needs to be SELECT COUNT(DISTINCT(<pkey>)). Also, there may be several places where this needs to be fixed since it seems fairly common to use COUNT(*) and not COUNT(<pkey>) throughout.

Link to comment
Share on other sites

  • 3 weeks later...
  • Management
  • Solution

Hi,

Unfortunately, I'm not keen on adding a DISTINCT onto that query as this effectively introduces a GROUP BY onto this query which is used very often. It'll not scale well at all.

I would use a sub-query and I think this would optimised a lot better for this purpose.

Link to comment
Share on other sites

On 11/15/2021 at 8:33 AM, Matt said:

Hi,

Unfortunately, I'm not keen on adding a DISTINCT onto that query as this effectively introduces a GROUP BY onto this query which is used very often. It'll not scale well at all.

I would use a sub-query and I think this would optimised a lot better for this purpose.

@MattCould you at least make the 'COUNT(*) as cnt' be returned by a protected static function in \IPS\Content\Item class and called in getItemsWithPermission()?

Then in my Item class, I could simply override and return 'COUNT(DISTINCT(puce_topics.tid))' so I get the correct count.

In my case, the cnt will be less than 200 without DISTINCT and less than 100 with DISTINCT, so I don't care if COUNT(DISTINCT) doesn't "scale well at all" as this query will never return a huge number for my table. I want the correct count so my code can rely on the count being exact and not approximate.

 

Link to comment
Share on other sites

On 11/19/2021 at 8:42 AM, Matt said:

Modifying getItemsWithPermission() fills me with overwhelming anxiety. 😃

I recommend a sub query, it negates the need to hook onto the count(*).

Not enjoying this reply 😀

I don't like that the current getItemsWithPermission() returns wrong count when my query needs DISTINCT.

But, since the Nodes I am querying with getItemsWithPermission() will have less than 1000 distinct Items, even if an Item may exist in multiple nodes (via a map table), I have decided to query for Item IDs prior to calling parent::getItemsWithPermission() and use "id IN (1,12,34,423,...)" where clause to find those items.

This avoids the bug in "COUNT(*)" versus "COUNT(DISTINCT tid))" in IC4 Core.

Would still like to see this bug fixed as it can be a subtle error in returning an overcount for some SELECT DISTINCT queries I might write in the future (long after I have forgotten this defect).

Link to comment
Share on other sites

  • Management
On 11/20/2021 at 9:37 PM, KT Walrus said:

Not enjoying this reply 😀

I don't like that the current getItemsWithPermission() returns wrong count when my query needs DISTINCT.

But, since the Nodes I am querying with getItemsWithPermission() will have less than 1000 distinct Items, even if an Item may exist in multiple nodes (via a map table), I have decided to query for Item IDs prior to calling parent::getItemsWithPermission() and use "id IN (1,12,34,423,...)" where clause to find those items.

This avoids the bug in "COUNT(*)" versus "COUNT(DISTINCT tid))" in IC4 Core.

Would still like to see this bug fixed as it can be a subtle error in returning an overcount for some SELECT DISTINCT queries I might write in the future (long after I have forgotten this defect).

It's not a bug though 😕 

I know you need this specific thing, but it's not enough to warrant a change to our core code when there is a better alternative.

On 11/21/2021 at 12:15 AM, Square Wheels said:

I don't pretend to understand what you are talking about, but I thought using (*) returned all fields and put more "strain" on the database?

No, it doesn't. 😀

Link to comment
Share on other sites

  • Management
10 hours ago, Square Wheels said:

OK, how about Select *?

SELECT * will return all the columns of each row into memory, but it won't put any additional strain on the MySQL server.

When you add DISTINCT(*) you are asking MySQL to create a temporary table (either in memory or on disk) where it puts all the results without the distinct, then filters them to only return the distinct items. It may also disrupt the optimiser which means that indexes that can be used to move the pointer very quickly through the table are not used, so instead MySQL has to look at each row, only to discard it as there is no match.

Link to comment
Share on other sites

On 11/23/2021 at 5:52 AM, Matt said:

It's not a bug though 😕 

How is this not a bug?

My query requires DISTINCT flag, and getItemsWithPermission() doesn't honor DISTINCT when it does "COUNT(*)" instead of "COUNT(DISTINCT(id))".

That said, I am no longer using DISTINCT but fetching IDs first and using 'id IN(...)' in WHERE clause... So, not a bug affecting me, but still a bug waiting to bite me or other developers in the future...

Link to comment
Share on other sites

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...