[SQL] Duplicated records

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[SQL] Duplicated records

Lucas-21
Hi.
How can I delete the duplicated records with "DELETE FROM TABLE WHERE..."
clause??
The problem is becouse I have imported data from Dbase (dbf) file, and this
function have not built the Constraint (unique, primary key, ...), and this
function is usually executed.

 select * from table1; --id may be primary key
 - Table1 -
 id | field 2
 0  | 'aaa'
 1  | 'bbb'
 2  | 'ccc'
 0  | 'aaa'  <<== The data is duplicated
 1  | 'bbb'
 2  | 'ccc'
 0  | 'aaa'
 1  | 'bbb'
 2  | 'ccc'
Is there a way to delete the duplicated data without build another table with
constraints and copy those data to the new table?
Something like "delete from table1 where ...???"

Thanks,
Lucas

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [SQL] Duplicated records

pfc-2

> How can I delete the duplicated records with "DELETE FROM TABLE WHERE..."
> clause??
> The problem is becouse I have imported data from Dbase (dbf) file, and  
> this
> function have not built the Constraint (unique, primary key, ...), and  
> this
> function is usually executed.

        If you have no primary key how can you reference a record in order to  
delete it ?
        I'd say use a temporary table...
        If you have complete row dupes (ie. the entire row is duplicated) use
        SELECT * FROM table GROUP BY *
        (or select distinct)

        If only the primary key is duplicated but other fields change, then you  
have to decide which one you wanna keep !

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match
Reply | Threaded
Open this post in threaded view
|

Re: [despammed] [SQL] Duplicated records

Andreas Kretschmer
In reply to this post by Lucas-21
am  24.05.2005, um 17:59:31 -0300 mailte [hidden email] folgendes:
> Hi.
> How can I delete the duplicated records with "DELETE FROM TABLE WHERE..."
> clause??

Please read http://www.gtsm.com/oscon2003/deletetid.html

Its a very good article about this problem.


Regards, Andreas
--
Andreas Kretschmer    (Kontakt: siehe Header)
Heynitz:  035242/47212,      D1: 0160/7141639
GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net
 ===    Schollglas Unternehmensgruppe    ===

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [hidden email] so that your
      message can get through to the mailing list cleanly
Reply | Threaded
Open this post in threaded view
|

Re: [SQL] Duplicated records

Lucas-21
Hi.
Thanks for the article...
But, I have read it and the query works very slow...
My table have aprox. 180.000 records (correct) and in entire table it has
aprox.360.000 records(duplicated)...
I tried to execute a query to delete the duplicated records, but it
worked very
very slow... look:

# select * from lanctos order by numos;
  numos | field1 | field2 | field3 |...
  00001 | test   | T2-2   | 2      |...
  00001 | test   | T2-2   | 2      |...
  00002 | Blabla | 0      | ABC    |...
  00002 | Blabla | 0      | ABC    |...
  00003 | Llllll | Oooooo | Rrrrrr |...
  00003 | Llllll | Oooooo | Rrrrrr |...
...

The records is entire duplicated (with all fields having the same data),
thinking the "numos" fields as primary key I have executed the query:

# DELETE from lanctos where not oid=(select oid from lanctos as l2 where
l2.numos=lanctos.numos limit 1);

I have tested others querys with EXPLAIN command to examine the performance
time, and this query was the best performance I got... but its is slow. Other
query is:

# DELETE from lanctos where not exists (select '1' from lanctos as l2 where
l2.numos=lanctos.numos and not l2.oid=lanctos.oid);

Is there a way to delete those duplicated records faster??? Remembering the
table have aprox 360.000 records...
Is better I create other table and copy those data??? How should I created???

Thanks.


Quoting Andreas Kretschmer <[hidden email]>:
> am  24.05.2005, um 17:59:31 -0300 mailte [hidden email] folgendes:
>> Hi.
>> How can I delete the duplicated records with "DELETE FROM TABLE WHERE..."
>> clause??
>
> Please read http://www.gtsm.com/oscon2003/deletetid.html
>
> Its a very good article about this problem.
>


---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [hidden email])
Reply | Threaded
Open this post in threaded view
|

Re: [SQL] Duplicated records

Alvaro Herrera
On Wed, May 25, 2005 at 01:58:07PM -0300, [hidden email] wrote:

> The records is entire duplicated (with all fields having the same data),
> thinking the "numos" fields as primary key I have executed the query:
>
> # DELETE from lanctos where not oid=(select oid from lanctos as l2 where
> l2.numos=lanctos.numos limit 1);

That's because you don't have an index on the Oid column.  You may want
to try using the ctid column instead; it's the physical row position in
the table so it doesn't need to scan the whole table each time.

HTH,

--
Alvaro Herrera (<alvherre[a]surnet.cl>)
"The only difference is that Saddam would kill you on private, where the
Americans will kill you in public" (Mohammad Saleh, 39, a building contractor)

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [despammed] Re: [SQL] Duplicated records

Andreas Kretschmer
In reply to this post by Lucas-21
am  25.05.2005, um 13:58:07 -0300 mailte [hidden email] folgendes:
> Hi.
> Thanks for the article...
> But, I have read it and the query works very slow...
> My table have aprox. 180.000 records (correct) and in entire table it has
> aprox.360.000 records(duplicated)...

How often is this necessary?


> Is there a way to delete those duplicated records faster??? Remembering the
> table have aprox 360.000 records...

I dont know, but i think, you should prevent duplicated records in the
future, and make the job (delete duplicates) now.

Btw.: you wrote, there is a primary key on the first row. Real?

,----[  sorry, messages in german language  ]
| test_db=# create table blub (id int primary key, name varchar);
| HINWEIS:  CREATE TABLE / PRIMARY KEY erstellt implizit einen Index >>blub_pkey<< für Tabelle >>blub<<
| CREATE TABLE
| test_db=# insert into blub values (1, 'x');
| INSERT 970706 1
| test_db=# insert into blub values (1, 'y');
| FEHLER:  duplizierter Schlüssel verletzt Unique-Constraint >>blub_pkey<<
`----

In other words: if there a primary key on the first row, you cannot
insert duplicates.


Regards, Andreas
--
Andreas Kretschmer    (Kontakt: siehe Header)
Heynitz:  035242/47212,      D1: 0160/7141639
GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net
 ===    Schollglas Unternehmensgruppe    ===

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|

Re: [SQL] Duplicated records

Lucas-21
In reply to this post by Lucas-21
Thanks....
CTAS (Create Table As Select) command works fine!!! With great performance.
I think it is the best way to correct the data...(apparently)
I didnt know about "select DISTINCT". I am going to read about it.

Thank you.

Quoting Bricklen Anderson <[hidden email]>:

>> Is there a way to delete those duplicated records faster??? Remembering the
>> table have aprox 360.000 records...
>> Is better I create other table and copy those data??? How should I
>> created???
>>
> for 180k rows (which isn't many) just do a CTAS (Create Table As Select):
>
> create table lanctos_distinct as select distinct * from lanctos;
> truncate table lanctos;
> alter table lanctos_distinct rename to lanctos;
>


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match
Reply | Threaded
Open this post in threaded view
|

Re: [SQL] Duplicated records

Andrew Hammond-2
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[hidden email] wrote:
> Thanks....
> CTAS (Create Table As Select) command works fine!!! With great performance.
> I think it is the best way to correct the data...(apparently)
> I didnt know about "select DISTINCT". I am going to read about it.

You might want to wrap this in some locking and throw in some
constraints to avoid dupes in the future...

BEGIN;
LOCK lanctos IN ACCESS EXCLUSIVE;
CREATE TABLE lanctos_distinct AS SELECT DISTINCT * FROM lanctos;
DROP TABLE lanctos;
ALTER TABLE lanctos_distinct RENAME TO lanctos;
ALTER TABLE lanctos ALTER id SET NOT NULL;
CREATE UNIQUE INDEX lanctos_id_idx ON lanctos (id);
ALTER TABLE lanctos ADD CONSTRAINT lanctos_id_pkey PRIMARY KEY (id);
COMMIT;

As always, don't forget to ANALYZE the new table.

- --
Andrew Hammond    416-673-4138    [hidden email]
Database Administrator, Afilias Canada Corp.
CB83 2838 4B67 D40F D086 3568 81FC E7E5 27AF 4A9A
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFClfpbgfzn5SevSpoRAnj3AJ9xvCmMiC9yWNmS9XLFZWO3o4vNcACfboz+
T442LzdAAV1DbIoj24rCJeA=
=vrDU
-----END PGP SIGNATURE-----

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [hidden email] so that your
      message can get through to the mailing list cleanly