tricky query

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

tricky query

Merlin Moncure
I need a fast way (sql only preferred) to solve the following problem:

I need the smallest integer that is greater than zero that is not in the
column of a table.  In other words, if an 'id' column has values
1,2,3,4,6 and 7, I need a query that returns the value of 5.

I've already worked out a query using generate_series (not scalable) and
pl/pgsql.  An SQL only solution would be preferred, am I missing
something obvious?

Merlin

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [hidden email] so that your
       message can get through to the mailing list cleanly
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

John Arbash Meinel
Merlin Moncure wrote:

>I need a fast way (sql only preferred) to solve the following problem:
>
>I need the smallest integer that is greater than zero that is not in the
>column of a table.  In other words, if an 'id' column has values
>1,2,3,4,6 and 7, I need a query that returns the value of 5.
>
>I've already worked out a query using generate_series (not scalable) and
>pl/pgsql.  An SQL only solution would be preferred, am I missing
>something obvious?
>
>Merlin
>
>
Not so bad. Try something like this:

SELECT min(id+1) as id_new FROM table
    WHERE (id+1) NOT IN (SELECT id FROM table);

Now, this requires probably a sequential scan, but I'm not sure how you
can get around that.
Maybe if you got trickier and did some ordering and limits. The above
seems to give the right answer, though.

I don't know how big you want to scale to.

You might try something like:
SELECT id+1 as id_new FROM t
    WHERE (id+1) NOT IN (SELECT id FROM t)
    ORDER BY id LIMIT 1;

John
=:->


signature.asc (261 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

Bruno Wolff III
In reply to this post by Merlin Moncure
On Tue, Jun 28, 2005 at 10:21:16 -0400,
  Merlin Moncure <[hidden email]> wrote:
> I need a fast way (sql only preferred) to solve the following problem:
>
> I need the smallest integer that is greater than zero that is not in the
> column of a table.  In other words, if an 'id' column has values
> 1,2,3,4,6 and 7, I need a query that returns the value of 5.
>
> I've already worked out a query using generate_series (not scalable) and
> pl/pgsql.  An SQL only solution would be preferred, am I missing
> something obvious?

I would expect that using generate series from the 1 to the max (using
order by and limit 1 to avoid extra sequential scans) and subtracting
out the current list using except and then taking the minium value
would be the best way to do this if the list is pretty dense and
you don't want to change the structure.

If it is sparse than you can do a special check for 1 and if that
is present find the first row whose successor is not in the table.
That shouldn't be too slow.

If you are willing to change the structure you might keep one row for
each number and use a flag to mark which ones are empty. If there are
relatively few empty rows at any time, then you can create a partial
index on the row number for only empty rows.

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [hidden email] so that your
       message can get through to the mailing list cleanly
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

Cosimo Streppone
In reply to this post by Merlin Moncure
Merlin Moncure wrote:

> I need a fast way (sql only preferred) to solve the following problem:
> I need the smallest integer that is greater than zero that is not in the
> column of a table.
>
> I've already worked out a query using generate_series (not scalable) and
> pl/pgsql.  An SQL only solution would be preferred, am I missing
> something obvious?

Probably not, but I thought about this "brute-force" approach... :-)
This should work well provided that:

- you have a finite number of integers. Your column should have a biggest
   integer value with a reasonable maximum like 100,000 or 1,000,000.
   #define YOUR_MAX 99999

- you can accept that query execution time depends on smallest integer found.
   The bigger the found integer, the slower execution you get.

Ok, so:

Create a relation "integers" (or whatever) with every single integer from 1 to
YOUR_MAX:

   CREATE TABLE integers (id integer primary key);
   INSERT INTO integers (id) VALUES (1);
   INSERT INTO integers (id) VALUES (2);
   ...
   INSERT INTO integers (id) VALUES (YOUR_MAX);

Create your relation:

   CREATE TABLE merlin (id integer primary key);
   <and fill it with values>

Query is simple now:

   SELECT a.id FROM integers a
     LEFT JOIN merlin b ON a.id=b.id
         WHERE b.id IS NULL
      ORDER BY a.id LIMIT 1;

Execution times with 100k tuples in "integers" and
99,999 tuples in "merlin":

   >\timing
   Timing is on.
   >select i.id from integers i left join merlin s on i.id=s.id where s.id is
null order by i.id limit 1;
    99999

   Time: 233.618 ms
   >insert into merlin (id) values (99999);
   INSERT 86266614 1
   Time: 0.579 ms
   >delete from merlin where id=241;
   DELETE 1
   Time: 0.726 ms
   >select i.id from integers i left join merlin s on i.id=s.id where s.id is
null order by i.id limit 1;
    241

   Time: 1.336 ms
   >

--
Cosimo


---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

Merlin Moncure
In reply to this post by Merlin Moncure
> Merlin Moncure wrote:
>
> > I need a fast way (sql only preferred) to solve the following
problem:
> > I need the smallest integer that is greater than zero that is not in
the
> > column of a table.
> >
> > I've already worked out a query using generate_series (not scalable)
and
> > pl/pgsql.  An SQL only solution would be preferred, am I missing
> > something obvious?
 
> Probably not, but I thought about this "brute-force" approach... :-)
> This should work well provided that:
>
> - you have a finite number of integers. Your column should have a
biggest
>    integer value with a reasonable maximum like 100,000 or 1,000,000.
>    #define YOUR_MAX 99999
[...]
:-) generate_series function does the same thing only a little bit
faster (although less portable).

generate_series(m,n) returns set of integers from m to n with time
complexity n - m.  I use it for cases where I need to increment for
something, for example:

select now()::date + d from generate_series(0,355) as d;

returns days from today until 355 days from now.

Merlin

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [hidden email])
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

Merlin Moncure
In reply to this post by Merlin Moncure
> Not so bad. Try something like this:
>
> SELECT min(id+1) as id_new FROM table
>     WHERE (id+1) NOT IN (SELECT id FROM table);
>
> Now, this requires probably a sequential scan, but I'm not sure how
you
> can get around that.
> Maybe if you got trickier and did some ordering and limits. The above
> seems to give the right answer, though.

it does, but it is still faster than generate_series(), which requires
both a seqscan and a materialization of the function.
 
> I don't know how big you want to scale to.

big. :)

merlin

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

John Arbash Meinel
In reply to this post by John Arbash Meinel
John A Meinel wrote:

> Merlin Moncure wrote:
>
>> I need a fast way (sql only preferred) to solve the following problem:
>>
>> I need the smallest integer that is greater than zero that is not in the
>> column of a table.  In other words, if an 'id' column has values
>> 1,2,3,4,6 and 7, I need a query that returns the value of 5.
>>
>> I've already worked out a query using generate_series (not scalable) and
>> pl/pgsql.  An SQL only solution would be preferred, am I missing
>> something obvious?
>>
>> Merlin
>>
>>
>
> Not so bad. Try something like this:
>
> SELECT min(id+1) as id_new FROM table
>    WHERE (id+1) NOT IN (SELECT id FROM table);
>
> Now, this requires probably a sequential scan, but I'm not sure how you
> can get around that.
> Maybe if you got trickier and did some ordering and limits. The above
> seems to give the right answer, though.
>
> I don't know how big you want to scale to.
>
> You might try something like:
> SELECT id+1 as id_new FROM t
>    WHERE (id+1) NOT IN (SELECT id FROM t)
>    ORDER BY id LIMIT 1;
>
> John
> =:->
Well, I was able to improve it to using appropriate index scans.
Here is the query:

SELECT t1.id+1 as id_new FROM id_test t1
    WHERE NOT EXISTS
        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
    ORDER BY t1.id LIMIT 1;

I created a test table which has 90k randomly inserted rows. And this is
what EXPLAIN ANALYZE says:

                                                               QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..12.10 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=1)
   ->  Index Scan using id_test_pkey on id_test t1  (cost=0.00..544423.27 rows=45000 width=4) (actual time=0.000..0.000 rows=1 loops=1)
         Filter: (NOT (subplan))
         SubPlan
           ->  Index Scan using id_test_pkey on id_test t2  (cost=0.00..6.01 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=15)
                 Index Cond: (id = ($0 + 1))
 Total runtime: 0.000 ms
(7 rows)

The only thing I have is a primary key index on id_test(id);

John
=:->


signature.asc (261 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

Sam Mason
In reply to this post by Merlin Moncure
Merlin Moncure wrote:
>I've already worked out a query using generate_series (not scalable) and
>pl/pgsql.  An SQL only solution would be preferred, am I missing
>something obvious?

I would be tempted to join the table to itself like:

  SELECT id+1
  FROM foo
  WHERE id > 0
    AND i NOT IN (SELECT id-1 FROM foo)
  LIMIT 1;

Seems to work for me.  Not sure if that's good enough for you, but
it may help.

  Sam

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

John Arbash Meinel
In reply to this post by Merlin Moncure
Merlin Moncure wrote:

>>Not so bad. Try something like this:
>>
>>SELECT min(id+1) as id_new FROM table
>>    WHERE (id+1) NOT IN (SELECT id FROM table);
>>
>>Now, this requires probably a sequential scan, but I'm not sure how
>>
>>
>you
>
>
>>can get around that.
>>Maybe if you got trickier and did some ordering and limits. The above
>>seems to give the right answer, though.
>>
>>
>
>it does, but it is still faster than generate_series(), which requires
>both a seqscan and a materialization of the function.
>
>
>
>>I don't know how big you want to scale to.
>>
>>
>
>big. :)
>
>merlin
>
>
See my follow up post, which enables an index scan. On my system with
90k rows, it takes no apparent time.
(0.000ms)
John
=:->


signature.asc (261 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

Merlin Moncure
In reply to this post by Merlin Moncure
John Meinel wrote:
> See my follow up post, which enables an index scan. On my system with
> 90k rows, it takes no apparent time.
> (0.000ms)
> John
> =:->

Confirmed.  Hats off to you, the above some really wicked querying.
IIRC I posted the same question several months ago with no response and
had given up on it.  I think your solution (smallest X1 not in X) is a
good candidate for general bits, so I'm passing this to varlena for
review :)

SELECT t1.id+1 as id_new FROM id_test t1
    WHERE NOT EXISTS
        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
    ORDER BY t1.id LIMIT 1;

Merlin

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

John Arbash Meinel
Merlin Moncure wrote:

>John Meinel wrote:
>
>
>>See my follow up post, which enables an index scan. On my system with
>>90k rows, it takes no apparent time.
>>(0.000ms)
>>John
>>=:->
>>
>>
>
>Confirmed.  Hats off to you, the above some really wicked querying.
>IIRC I posted the same question several months ago with no response and
>had given up on it.  I think your solution (smallest X1 not in X) is a
>good candidate for general bits, so I'm passing this to varlena for
>review :)
>
>SELECT t1.id+1 as id_new FROM id_test t1
>    WHERE NOT EXISTS
>        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>    ORDER BY t1.id LIMIT 1;
>
>Merlin
>
>
Just be aware that as your table fills it's holes, this query gets
slower and slower.
I've been doing some testing. And it starts at 0.00 when the first entry
is something like 3, but when you start getting to 16k it starts taking
more like 200 ms.

So it kind of depends how your table fills (and empties I suppose).

The earlier query was slower overall (since it took 460ms to read in the
whole table).
I filled up the table such that 63713 is the first empty space, and it
takes 969ms to run.
So actually if your table is mostly full, the first form is better.

But if you are going to have 100k rows, with basically random
distribution of empties, then the NOT EXISTS works quite well.

Just be aware of the tradeoff. I'm pretty sure the WHERE NOT EXISTS will
always use a looping structure, and go through the index in order.

John
=:->


signature.asc (261 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

Sam Mason
In reply to this post by John Arbash Meinel
John A Meinel wrote:
>SELECT t1.id+1 as id_new FROM id_test t1
>   WHERE NOT EXISTS
>       (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>   ORDER BY t1.id LIMIT 1;

This works well on sparse data, as it only requires as many index
access as it takes to find the first gap.   The simpler "NOT IN"
version that everybody seems to have posted the first time round
has a reasonably constant (based on the number of rows, not gap
position) startup time but the actual time spent searching for the
gap is much lower.

I guess the version you use depends on how sparse you expect the
data to be.  If you expect your query to have to search through
more than half the table before finding the gap then you're better
off using the "NOT IN" version, otherwise the "NOT EXISTS" version
is faster -- on my system anyway.

Hope that's interesting!


  Sam

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

Bruno Wolff III
In reply to this post by Merlin Moncure
On Tue, Jun 28, 2005 at 12:02:09 -0400,
  Merlin Moncure <[hidden email]> wrote:

>
> Confirmed.  Hats off to you, the above some really wicked querying.
> IIRC I posted the same question several months ago with no response and
> had given up on it.  I think your solution (smallest X1 not in X) is a
> good candidate for general bits, so I'm passing this to varlena for
> review :)
>
> SELECT t1.id+1 as id_new FROM id_test t1
>     WHERE NOT EXISTS
>         (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>     ORDER BY t1.id LIMIT 1;

You need to rework this to check to see if row '1' is missing. The
above returns the start of the first gap after the first row that
isn't missing.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

Cosimo Streppone
In reply to this post by John Arbash Meinel
John A Meinel wrote:
> John A Meinel wrote:
>> Merlin Moncure wrote:
>>
>>> I need the smallest integer that is greater than zero that is not in the
>>> column of a table.  In other words, if an 'id' column has values
>>> 1,2,3,4,6 and 7, I need a query that returns the value of 5.
>>
 >> [...]
 >
> Well, I was able to improve it to using appropriate index scans.
> Here is the query:
>
> SELECT t1.id+1 as id_new FROM id_test t1
>    WHERE NOT EXISTS
>        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>    ORDER BY t1.id LIMIT 1;

I'm very interested in this "tricky query".
Sorry John, but if I populate the `id_test' relation
with only 4 tuples with id values (10, 11, 12, 13),
the result of this query is:

   cosimo=> create table id_test (id integer primary key);
   NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index 'id_test_pkey'
for table 'id_test'
   CREATE TABLE
   cosimo=> insert into id_test values (10); -- and 11, 12, 13, 14
   INSERT 7457570 1
   INSERT 7457571 1
   INSERT 7457572 1
   INSERT 7457573 1
   INSERT 7457574 1
   cosimo=> SELECT t1.id+1 as id_new FROM id_test t1 WHERE NOT EXISTS (SELECT
t2.id FROM id_test t2 WHERE t2.id = t1.id+1) ORDER BY t1.id LIMIT 1;
    id_new
   --------
        15
   (1 row)

which if I understand correctly, is the wrong answer to the problem.
At this point, I'm starting to think I need some sleep... :-)

--
Cosimo


---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

Merlin Moncure
In reply to this post by Merlin Moncure
> On Tue, Jun 28, 2005 at 12:02:09 -0400,
>   Merlin Moncure <[hidden email]> wrote:
> >
> > Confirmed.  Hats off to you, the above some really wicked querying.
> > IIRC I posted the same question several months ago with no response
and
> > had given up on it.  I think your solution (smallest X1 not in X) is
a

> > good candidate for general bits, so I'm passing this to varlena for
> > review :)
> >
> > SELECT t1.id+1 as id_new FROM id_test t1
> >     WHERE NOT EXISTS
> >         (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
> >     ORDER BY t1.id LIMIT 1;
>
> You need to rework this to check to see if row '1' is missing. The
> above returns the start of the first gap after the first row that
> isn't missing.

Correct.  

In fact, I left out a detail in my original request in that I had a
starting value (easily supplied with where clause)...so what I was
really looking for was a query which started at a supplied value and
looped forwards looking for an empty slot.  John's supplied query is a
drop in replacement for a plpgsql routine which does exactly this.

The main problem with the generate_series approach is that there is no
convenient way to determine a supplied upper bound.  Also, in some
corner cases of my problem domain the performance was not good.

Merlin



---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

John Arbash Meinel
Merlin Moncure wrote:

>>On Tue, Jun 28, 2005 at 12:02:09 -0400,
>>  Merlin Moncure <[hidden email]> wrote:
>>
>>
>>>Confirmed.  Hats off to you, the above some really wicked querying.
>>>IIRC I posted the same question several months ago with no response
>>>
>>>
>and
>
>
>>>had given up on it.  I think your solution (smallest X1 not in X) is
>>>
>>>
>a
>
>
>>>good candidate for general bits, so I'm passing this to varlena for
>>>review :)
>>>
>>>SELECT t1.id+1 as id_new FROM id_test t1
>>>    WHERE NOT EXISTS
>>>        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>>>    ORDER BY t1.id LIMIT 1;
>>>
>>>
>>You need to rework this to check to see if row '1' is missing. The
>>above returns the start of the first gap after the first row that
>>isn't missing.
>>
>>
>
>Correct.
>
>In fact, I left out a detail in my original request in that I had a
>starting value (easily supplied with where clause)...so what I was
>really looking for was a query which started at a supplied value and
>looped forwards looking for an empty slot.  John's supplied query is a
>drop in replacement for a plpgsql routine which does exactly this.
>
>The main problem with the generate_series approach is that there is no
>convenient way to determine a supplied upper bound.  Also, in some
>corner cases of my problem domain the performance was not good.
>
>Merlin
>
>
Actually, if you already have a lower bound, then you can change it to:

SELECT t1.id+1 as id_new FROM id_test t1
    WHERE t1.id > id_min
        AND NOT EXISTS
        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
    ORDER BY t1.id LIMIT 1;

This would actually really help performance if you have a large table
and then empty entries start late.

On my system, where the first entry is 64k, doing where id > 60000
speeds it up back to 80ms instead of 1000ms.
John
=:->


signature.asc (261 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

Merlin Moncure
In reply to this post by Merlin Moncure
Cosimo wrote:

> I'm very interested in this "tricky query".
> Sorry John, but if I populate the `id_test' relation
> with only 4 tuples with id values (10, 11, 12, 13),
> the result of this query is:
>
>    cosimo=> create table id_test (id integer primary key);
>    NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> 'id_test_pkey'
> for table 'id_test'
>    CREATE TABLE
>    cosimo=> insert into id_test values (10); -- and 11, 12, 13, 14
>    INSERT 7457570 1
>    INSERT 7457571 1
>    INSERT 7457572 1
>    INSERT 7457573 1
>    INSERT 7457574 1
>    cosimo=> SELECT t1.id+1 as id_new FROM id_test t1 WHERE NOT EXISTS
> (SELECT
> t2.id FROM id_test t2 WHERE t2.id = t1.id+1) ORDER BY t1.id LIMIT 1;
>     id_new
>    --------
>         15
>    (1 row)
>
> which if I understand correctly, is the wrong answer to the problem.
> At this point, I'm starting to think I need some sleep... :-)

Correct, in that John's query returns the first empty slot above an
existing  filled slot (correct behavior in my case).  You could flip
things around a bit to get around thist tho.

Merlin

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match
Reply | Threaded
Open this post in threaded view
|

Re: tricky query

Sebastian Hennebrueder
In reply to this post by John Arbash Meinel
John A Meinel schrieb:

> John A Meinel wrote:
>
>>
>
> Well, I was able to improve it to using appropriate index scans.
> Here is the query:
>
> SELECT t1.id+1 as id_new FROM id_test t1
>    WHERE NOT EXISTS
>        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>    ORDER BY t1.id LIMIT 1;
>
> I created a test table which has 90k randomly inserted rows. And this is
> what EXPLAIN ANALYZE says:
>
>        


As Cosimo stated the result can be wrong. The result is always wrong
when the id with value 1 does not exist.

--
Best Regards / Viele Grüße

Sebastian Hennebrueder

----

http://www.laliluna.de

Tutorials for JSP, JavaServer Faces, Struts, Hibernate and EJB

Get support, education and consulting for these technologies - uncomplicated and cheap.


---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
Reply | Threaded
Open this post in threaded view
|

optimized counting of web statistics

Billy extyeightysix
In reply to this post by Merlin Moncure
Hola folks,

I have a web statistics Pg database (user agent, urls, referrer, etc)
that is part of an online web survey system. All of the data derived
from analyzing web server logs is stored in one large table with each
record representing an analyzed webserver log entry.

Currently all reports are generated when the logs are being analyzed
and before the data ever goes into the large table I mention above.
Well, the time has come to build an interface that will allow a user
to make ad-hoc queries against the stats and that is why I am emailing
the performance list.

I need to allow the user to specify any fields and values in a query.
For example,

"I want to see a report about all users from Germany that have flash
installed"  or
"I want to see a report about all users from Africa that are using FireFox 1"

I do not believe that storing all of the data in one big table is the
correct way to go about this. So, I am asking for suggestions,
pointers and any kind of info that anyone can share on how to store
this data set in an optimized manner.

Also, I have created a prototype and done some testing using the
colossal table. Actually finding all of the rows that satisfy the
query is pretty fast and is not a problem.  The bottleneck in the
whole process is actually counting each data point (how many times a
url was visited, or how many times a url referred the user to the
website). So more specifically I am wondering if there is way to store
and retrieve the data such that it speeds up the counting of the
statistics.

Lastly, this will become an open source tool that is akin to urchin,
awstats, etc. The difference is that this software is part of a suite
of tools for doing online web surveys and it maps web stats to the
survey respondent data.  This can give web site managers a very clear
view of what type of people come to the site and how those types use
the site.

Thanks in advance,

exty

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [hidden email])
Reply | Threaded
Open this post in threaded view
|

Re: optimized counting of web statistics

Billy extyeightysix
> The bottleneck in the
> whole process is actually counting each data point (how many times a
> url was visited, or how many times a url referred the user to the
> website). So more specifically I am wondering if there is way to store
> and retrieve the data such that it speeds up the counting of the
> statistics.

This is misleading, the counting is being done by perl.  so what is
happening is that I am locating all of the rows via a cursor and then
calculating the stats using perl hashes.  no counting is being in the
DB.  maybe it would be much faster to count in the db somehow?


exty

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [hidden email]
12