Check for duplicates before inserting new rows

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Check for duplicates before inserting new rows

Rich Shepard
This is a new issue for me: I've received data from a different source and
need to add non-duplicates to two tables in the database. Each row in the
new data has a station ID and associated measurements.

The existing database includes a table for station information and another
for measurements made there.

I want to learn which stations and which measurements in the new data set
are not included in the existing tables. And, I don't want to try inserting
the new data and have postgres tell me when it's found duplicates,
especially since there are two tables involved.

My research into how to do this has not found a solution so I ask for
pointers to resources that will teach me how to add these new data to both
tables.

Regards,

Rich



Reply | Threaded
Open this post in threaded view
|

Re: Check for duplicates before inserting new rows

George Woodring
I would suggest creating a temp table based on the original table and loading the data into it first.  You can then purge the duplicates.

George Woodring
iGLASS Networks
www.iglass.net


On Fri, Sep 4, 2020 at 9:21 AM Rich Shepard <[hidden email]> wrote:
This is a new issue for me: I've received data from a different source and
need to add non-duplicates to two tables in the database. Each row in the
new data has a station ID and associated measurements.

The existing database includes a table for station information and another
for measurements made there.

I want to learn which stations and which measurements in the new data set
are not included in the existing tables. And, I don't want to try inserting
the new data and have postgres tell me when it's found duplicates,
especially since there are two tables involved.

My research into how to do this has not found a solution so I ask for
pointers to resources that will teach me how to add these new data to both
tables.

Regards,

Rich



Reply | Threaded
Open this post in threaded view
|

Re: Check for duplicates before inserting new rows

Rich Shepard
On Fri, 4 Sep 2020, George Woodring wrote:

> I would suggest creating a temp table based on the original table and
> loading the data into it first. You can then purge the duplicates.

George,

I hadn't thought of this. Using a duplicate table without a PK would work
well if there's only one attribute that needs checking.

In my case there are two tables involved: locations and measurements. So, I
can create a temporary table from the new data which holds only the
locations. I can then isolate the new location rows that do not already
exist in that table, then insert the new data to eliminate the duplicated
measurements.

Thanks,

Rich



Reply | Threaded
Open this post in threaded view
|

Re: Check for duplicates before inserting new rows

Rich Shepard
In reply to this post by Rich Shepard
On Fri, 4 Sep 2020, Olivier Gautherot wrote:

> First of all, what version of PostgreSQL are you using?

Olivier,

12.2.

> One way would be to add a UNIQUE constraint and perform for each row of the
> source table a INSERT ... ON CONFLICT DO NOTHING (see
> https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT)
>
> If it is a 2-way merge, I would encapsulate the process in a function
> (which will create a transaction to protect your process) and add a column
> to trace the rows that have been merged. For this purpose, you can use the
> ON CONFLICT DO UPDATE ...

Lots for me to ponder.

The new data has attributes stored in two database tables: location and
measurements. I'll need to think about how both could be handled in a single
function. For example, the new data contains a row with a location not
already in the existing location table. That location needs to be added to
the location table and its associated measurement attributes then added to
the existing measurements table.

Thanks for the suggestion,

Rich


Reply | Threaded
Open this post in threaded view
|

Re: Check for duplicates before inserting new rows

Chris Sterritt


On 04/09/2020 15:46, Rich Shepard wrote:
On Fri, 4 Sep 2020, Olivier Gautherot wrote:

First of all, what version of PostgreSQL are you using?

Olivier,

12.2.

One way would be to add a UNIQUE constraint and perform for each row of the
source table a INSERT ... ON CONFLICT DO NOTHING (see
https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT)

If it is a 2-way merge, I would encapsulate the process in a function
(which will create a transaction to protect your process) and add a column
to trace the rows that have been merged. For this purpose, you can use the
ON CONFLICT DO UPDATE ...

Lots for me to ponder.

The new data has attributes stored in two database tables: location and
measurements. I'll need to think about how both could be handled in a single
function. For example, the new data contains a row with a location not
already in the existing location table. That location needs to be added to
the location table and its associated measurement attributes then added to
the existing measurements table.

Thanks for the suggestion,

Rich



Assuming some simple table structures (I've not included PK or FK definitions for simplicity):

create table station (station_id integer, station_data text);

create table measurement (measurement_id bigserial, station_id integer, ameasurement text);

-- Add some test stations
insert into station(station_id, station_data)
values (1, 'station1'),
 (2, 'station2'),
 (3, 'station3') ;


*Query to identify new measurements:
select station_id, ameasurement
from (VALUES (1,'1meas1'), -- this represents your new test data set
             (1,'1meas2'),
             (2,'2meas1'),
             (3,'3meas1')) as m(station_id, ameasurement)
except
select station_id, ameasurement
from measurement;

The query above will give you a list of the new values which are not currently stored in table measurement.
Run it and we get all 4 rows returned.

Add a measurement row:
insert into measurement(station_id, ameasurement)
values (2,'2meas1');

Now if you repeat the check for new measurements with the same query as at *, you only get 3 rows.

Cheers, Chris Sterritt
Reply | Threaded
Open this post in threaded view
|

Re: Check for duplicates before inserting new rows

Chris Sterritt
In reply to this post by Rich Shepard


On 04/09/2020 14:21, Rich Shepard wrote:
This is a new issue for me: I've received data from a different source and
need to add non-duplicates to two tables in the database. Each row in the
new data has a station ID and associated measurements.

The existing database includes a table for station information and another
for measurements made there.

I want to learn which stations and which measurements in the new data set
are not included in the existing tables. And, I don't want to try inserting
the new data and have postgres tell me when it's found duplicates,
especially since there are two tables involved.

My research into how to do this has not found a solution so I ask for
pointers to resources that will teach me how to add these new data to both
tables.

Regards,

Rich



To insert data into both tables:


drop table if exists station;
drop table if exists measurement;

create table station (station_id serial, station_data text);

create table measurement (measurement_id bigserial, station_id integer, ameasurement text);

insert into station(station_data)
values ('station1'),
 ('station2'),
 ('station3') ;

with src_data as (select station_data, ameasurement
                  from (VALUES ('station1','meas1'),
                               ('station2','meas2'),
                               ('station3','meas3'),
                               ('station4','meas4')) as m(station_data, ameasurement)),
     ins_station as (insert into station  (station_data)
                     select station_data from src_data
                     except
                     select station_data from station
                     returning station_id, station_data
                    )
insert into measurement (station_id, ameasurement)
    select s.station_id, sd.ameasurement
    from src_data sd
    join (SELECT station_id, station_data FROM ins_station UNION SELECT station_id, station_data FROM station) s using (station_data)
    except
    select station_id, ameasurement
    from measurement;

select * from station;
select * from measurement;


Regards,
Chris Sterritt
Reply | Threaded
Open this post in threaded view
|

Re: Check for duplicates before inserting new rows

Rich Shepard
In reply to this post by Rich Shepard
On Fri, 4 Sep 2020, Chris Sterritt wrote:

> Assuming some simple table structures (I've not included PK or FK definitions
> for simplicity):

Chris,

Thanks very much.

Stay well,

Rich