Bulk inserts into two (related) tables

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Bulk inserts into two (related) tables

Rich Shepard
I'm cleaning and formatting a 800-line data file to be inserted into a
database. Some of the input file fields will be inserted into an
'organizations' table letting postgres assign sequential org_id numbers.
Other fields will be inserted into a separate 'people' table associated with
each organization. The people table insert allows postgres to assign the
person_id number and this table's foreign key, people.org_id ==
organizations.org_id.

I could insert all new rows into the organizations table, then produce a
list of the org_id and org_name to manually insert the foreign key in the
related people table. Time and effort intense.

Is there a way to insert the two tables sequentially without manually adding
the organizations.org_id to the appropriate foreign key column (people.org_id)
of the people table?

Rich


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Adrian Klaver-4
On 5/21/19 9:56 AM, Rich Shepard wrote:

> I'm cleaning and formatting a 800-line data file to be inserted into a
> database. Some of the input file fields will be inserted into an
> 'organizations' table letting postgres assign sequential org_id numbers.
> Other fields will be inserted into a separate 'people' table associated
> with
> each organization. The people table insert allows postgres to assign the
> person_id number and this table's foreign key, people.org_id ==
> organizations.org_id.
>
> I could insert all new rows into the organizations table, then produce a
> list of the org_id and org_name to manually insert the foreign key in the
> related people table. Time and effort intense.
>
> Is there a way to insert the two tables sequentially without manually
> adding
> the organizations.org_id to the appropriate foreign key column
> (people.org_id)
> of the people table?

Well you are not going to know the org_id until the organization table
is loaded, which means something like:

1) First run through file load the organizations table.

2) Build a mapping of org_id to organization.

3) Run through data file again and load people data using the mapping in
2) to provide the people.org_id.

>
> Rich
>
>


--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Francisco Olarte
In reply to this post by Rich Shepard
Rich:

On Tue, May 21, 2019 at 6:56 PM Rich Shepard <[hidden email]> wrote:

> I'm cleaning and formatting a 800-line data file to be inserted into a
> database. Some of the input file fields will be inserted into an
> 'organizations' table letting postgres assign sequential org_id numbers.
> Other fields will be inserted into a separate 'people' table associated with
> each organization. The people table insert allows postgres to assign the
> person_id number and this table's foreign key, people.org_id ==
> organizations.org_id.
> I could insert all new rows into the organizations table, then produce a
> list of the org_id and org_name to manually insert the foreign key in the
> related people table. Time and effort intense.
> Is there a way to insert the two tables sequentially without manually adding
> the organizations.org_id to the appropriate foreign key column (people.org_id)
> of the people table?

From how you say it, I assume you have some data in your original
dumps which can relate boths, lets assume it's org_name, but may be an
org-code. If you do not have it it means you cannot match people to
orgs in your data, all is lost.

You can do it in a couple steps, first do a copy (org_id, org_name,
org.others) into the organizations table,

then create a temporary table and copy (people.id, people.org_name,
people.*)  into it

and then insert into people "select people.id, org.id as
people_org_id, people.* from tmp_people, orgs where
tmp_people.org_name = org.org_name)

and drop the temp table.

If the matching data is some king of code you do not want in the final
organization table, create a temporary organizations table, copy into
it, join it with the temporary people table, insert into the final
organizations table slicing with a select, drop it.

If you are using a sequence or similar thing for generating
organization ids and use the second approach, remember to use the same
sequence to generate the default values into the temporary table, or
chaos will ensue ( not really, but you'll need to advance it manually
).

This is a classic problem, the classic way to solve is that, prepare
an script which loads some temporary tables and then insert joins into
the final ones. On small data sets like yours you can just edit
everything into a single sql script.

Francisco Olarte.


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Rich Shepard
In reply to this post by Adrian Klaver-4
On Tue, 21 May 2019, Adrian Klaver wrote:

> Well you are not going to know the org_id until the organization table is
> loaded, which means something like:
>
> 1) First run through file load the organizations table.
>
> 2) Build a mapping of org_id to organization.
>
> 3) Run through data file again and load people data using the mapping in 2)
> to provide the people.org_id.

Adrian,

That's what I planned to do. I wondered if there was a more direct way known
to those with more experience than I have. The short answer is 'no.'

Thanks,

Rich


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Michael Lewis
In reply to this post by Adrian Klaver-4
For each row-
Insert into organizations table if the record does not exist, returning ID.
Insert into people using that ID.

Else, load all the data with empty ID column on person table,then just update the person table afterward and drop the org name column.

Perhaps I am missing something.
Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Rich Shepard
In reply to this post by Francisco Olarte
On Tue, 21 May 2019, Francisco Olarte wrote:

> From how you say it, I assume you have some data in your original
> dumps which can relate boths, lets assume it's org_name, but may be an
> org-code. If you do not have it it means you cannot match people to
> orgs in your data, all is lost.

Francisco,

Not yet with these new data.

I'll manually insert the org_id numbers from the organizations table into
the people table.

Thanks,

Rich


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Adrian Klaver-4
In reply to this post by Rich Shepard
On 5/21/19 10:22 AM, Rich Shepard wrote:

> On Tue, 21 May 2019, Adrian Klaver wrote:
>
>> Well you are not going to know the org_id until the organization table
>> is loaded, which means something like:
>>
>> 1) First run through file load the organizations table.
>>
>> 2) Build a mapping of org_id to organization.
>>
>> 3) Run through data file again and load people data using the mapping
>> in 2) to provide the people.org_id.
>
> Adrian,
>
> That's what I planned to do. I wondered if there was a more direct way
> known
> to those with more experience than I have. The short answer is 'no.'

The other way is to create the org_id for each organization ahead of
time and put it into the data file. Either way you have create the
org_id for the FK relationship, it is just a matter of where and when.

>
> Thanks,
>
> Rich
>
>


--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Rich Shepard
In reply to this post by Michael Lewis
On Tue, 21 May 2019, Michael Lewis wrote:

> For each row-
> Insert into organizations table if the record does not exist, returning ID.
> Insert into people using that ID.

Michael,

The org_id will not exist until I run the insert script.

> Else, load all the data with empty ID column on person table,then just
> update the person table afterward and drop the org name column.

Then I'll copy the org_id numbers to the appropriate row(s) in the people
table.

Thanks,

Rich


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Rich Shepard
In reply to this post by Adrian Klaver-4
On Tue, 21 May 2019, Adrian Klaver wrote:

> The other way is to create the org_id for each organization ahead of time
> and put it into the data file. Either way you have create the org_id for
> the FK relationship, it is just a matter of where and when.

Adrian,

I had thought of that but overlooked it. The max(org_id) number is 338 so I
can assign sequential numbers above that as I format the data for the insert
scripts.

Thanks for the reminder,

Rich


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Ron-2
In reply to this post by Rich Shepard
On 5/21/19 12:27 PM, Rich Shepard wrote:

> On Tue, 21 May 2019, Michael Lewis wrote:
>
>> For each row-
>> Insert into organizations table if the record does not exist, returning ID.
>> Insert into people using that ID.
>
> Michael,
>
> The org_id will not exist until I run the insert script.
>
>> Else, load all the data with empty ID column on person table,then just
>> update the person table afterward and drop the org name column.
>
> Then I'll copy the org_id numbers to the appropriate row(s) in the people
> table.

A Perl/Python/etc script would solve your problem.

--
Angular momentum makes the world go 'round.


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Will Hartung
In reply to this post by Rich Shepard

On May 21, 2019, at 9:56 AM, Rich Shepard <[hidden email]> wrote:

I could insert all new rows into the organizations table, then produce a
list of the org_id and org_name to manually insert the foreign key in the
related people table. Time and effort intense.

You can either use a script for the entire process, or, convert the people table to INSERT statements that have a SELECT for the foreign key as part of the insert.

INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id FROM org WHERE org_name=‘Main Office’))


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Rich Shepard
On Tue, 21 May 2019, Will Hartung wrote:

> You can either use a script for the entire process, or, convert the people
> table to INSERT statements that have a SELECT for the foreign key as part
> of the insert.
>
> INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id
> FROM org WHERE org_name=‘Main Office’))

Thanks, Will. That's a good alternative.

Regards,

Rich


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

jfinzel
In reply to this post by Rich Shepard
On Tue, May 21, 2019 at 12:24 PM Rich Shepard <[hidden email]> wrote:
On Tue, 21 May 2019, Francisco Olarte wrote:

> From how you say it, I assume you have some data in your original
> dumps which can relate boths, lets assume it's org_name, but may be an
> org-code. If you do not have it it means you cannot match people to
> orgs in your data, all is lost.

Francisco,

Not yet with these new data.

I'll manually insert the org_id numbers from the organizations table into
the people table.

To me, this is the key to your problem what will either make this a time saver or time waster.  Somehow you are accounting for what uniquely identifies organizations, right?

Say there are 5 fields that correspond to an organization.  I assume then you are creating only one new org_id for each unique combination of these fields?

Then take Francisco's suggestion, only use an md5 of the organization fields to create yourself a unique identifier.  Then you can use ctid (unique internal identifier for each row) to join back.  You use SQL like this:

SELECT md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM table;

Assume in example below that your unique "org" rows are the first 3 fields:

1. Load your data file into a loading table like so:
CREATE TABLE loader (org_name text, org_stuff_1 text, org_stuff_2 text, person_name text);
\copy loader from 'my_data.csv' with csv header
ALTER TABLE loader ADD COLUMN org_id INT;

Example data:
INSERT INTO loader VALUES ('a', ' ', ' ', 'Jerry');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Bob');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Janice');
INSERT INTO loader VALUES ('a', ' ', 'c', 'Chris');
INSERT INTO loader VALUES ('b', ' ', 'c', 'Jason');
INSERT INTO loader VALUES ('a', ' ', ' ', 'Alice');

2. Load org table:
test=# CREATE TABLE organizations (org_id serial primary key, org_name text, org_stuff_1 text, org_stuff_2 text);
CREATE TABLE
test=# INSERT INTO organizations (org_name, org_stuff_1, org_stuff_2)
test-# SELECT DISTINCT org_name, org_stuff_1, org_stuff_2
test-# FROM loader;
INSERT 0 4

3. Build mapping directly and update:
-- build hash of org fields in loader table, take ctid in order to map back later
WITH map_source AS (
SELECT ctid, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM loader)

-- build hash of org fields in organizations table to join back to loader and bring in org_id of course
, map_org AS (
SELECT org_id, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM organizations)

-- map by joining together on hash_identifier
, final_map AS (
SELECT org_id, ctid
FROM map_source l
INNER JOIN map_org o USING (hash_identifier)
)

-- Perform update
UPDATE loader l
SET org_id = fm.org_id
FROM final_map fm
WHERE fm.ctid = l.ctid;

Final data ready for the person table to be populated:
test=# table organizations;
 org_id | org_name | org_stuff_1 | org_stuff_2
--------+----------+-------------+-------------
      1 | a        |             | b
      2 | a        |             |
      3 | a        |             | c
      4 | b        |             | c
(4 rows)

test=# table loader;
 org_name | org_stuff_1 | org_stuff_2 | person_name | org_id
----------+-------------+-------------+-------------+--------
 a        |             |             | Jerry       |      2 |
 a        |             | b           | Bob         |      1 |
 a        |             | b           | Janice      |      1 |
 a        |             | c           | Chris       |      3 |
 b        |             | c           | Jason       |      4 |
 a        |             |             | Alice       |      2 |
(6 rows)


Hope this helps!
Thanks,
Jeremy

Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Francisco Olarte
In reply to this post by Rich Shepard
On Tue, May 21, 2019 at 7:24 PM Rich Shepard <[hidden email]> wrote:
> > From how you say it, I assume you have some data in your original
> > dumps which can relate boths, lets assume it's org_name, but may be an
> > org-code. If you do not have it it means you cannot match people to
> > orgs in your data, all is lost.
> Not yet with these new data.
> I'll manually insert the org_id numbers from the organizations table into
> the people table.

I'm curious, what org_id do you put (manually) to the people? you must
have some way to match it ( like, say, "I have an organization line,
followed by lines for people in this organization"

Because if you cannot match them by hand, the Db is not going to be of use.

Knowing how you know the org for a people is crucial for a solution (
i.e., for the previous example I'll just add a synthetic org_code line
to each one and do the to temp tables trick for a long file, or, just
advance the sequence (if needed manually ( to reserve org_id )) and
them do a perl one liner to augment the data ( but I've been perling
since the 4.019 times, so this may not be as easy for others ).

Francisco Olarte.


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Francisco Olarte
In reply to this post by jfinzel
Jeremy:

On Tue, May 21, 2019 at 11:58 PM Jeremy Finzel <[hidden email]> wrote:

> Then take Francisco's suggestion, only use an md5 of the organization fields to create yourself a unique identifier.  Then you can use ctid (unique internal identifier for each row) to join back.  You use SQL like this:

Sadly my suggestion only works if you can ( manually ) assign an
organization line to a people line, md5, field concatenation,
everything else is just optimization.

From what the OP has already told you have a heap of people, a heap of
organizations and a magic device to assign one to the others, the
org_id assignment ( using an unknown algorithm, we do not know if he
wants sequences, texts or cat gifs as IDs ) is easy, the pairing part
is unsolvable with the data we have.

At this moment I think the only useful link for this is
http://xyproblem.info/ ( for the OP, not for U ).

Francisco Olarte.


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Rich Shepard
In reply to this post by Francisco Olarte
On Wed, 22 May 2019, Francisco Olarte wrote:

> I'm curious, what org_id do you put (manually) to the people? you must
> have some way to match it ( like, say, "I have an organization line,
> followed by lines for people in this organization"

Francisco,

The first data entered was in small chunks so I'd add a few rows to the
organization table, output a file of org_id and org_name, then use the
source data to associate that org_id to the people associated with it.

With this new data source I want to populate the two tables more quickly. I
think the time consuming part is associating people with their organization.
I'll do more thinking about this.

I don't do perl, though.

Thanks,

Rich


Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

jfinzel
On Wed, May 22, 2019 at 7:40 AM Rich Shepard <[hidden email]> wrote:
On Wed, 22 May 2019, Francisco Olarte wrote:

> I'm curious, what org_id do you put (manually) to the people? you must
> have some way to match it ( like, say, "I have an organization line,
> followed by lines for people in this organization"

Francisco,

The first data entered was in small chunks so I'd add a few rows to the
organization table, output a file of org_id and org_name, then use the
source data to associate that org_id to the people associated with it.

With this new data source I want to populate the two tables more quickly. I
think the time consuming part is associating people with their organization.
I'll do more thinking about this.

I don't do perl, though.

There's absolutely no need to use anything beyond SQL here, though you could if you want to.

I really wonder how much we are just talking past each other simply because we don't know what your data looks like, so we can't show you how our examples apply to your use case.  If you provided a sample scrubbed data file, this whole thread probably would have been much shorter :).  Can you do that?

You said here again the most time consuming part is associating people with their organization.  Well, that's the whole question I was trying to optimize on.  You told Francisco that the data file does not have a unique org name that could be used as a unique organization identifier.  However you seem to have contradicted that by responding favorably to this solution:

INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id FROM org WHERE org_name=‘Main Office’))

How can this solution work if you said org_name cannot be used to link a person in the data file?

So, the question again becomes: does your data file have a combination of org fields that allows you to assign a person to a unique organization?
  • If the answer is still no, how could you ever assign people to a unique organization?  In that case you need to massage your data file first before trying to build a database schema, as Francisco noted.  This is basically a non-starter for your database schema.
  • If the answer is yes, that means you can use the combination of those fields to uniquely identify an organization, and thus link people to it later.  That's the reason for the md5 - to easily use many fields in combination as a unique id
Thanks,
Jeremy
Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Rich Shepard
On Wed, 22 May 2019, Jeremy Finzel wrote:

> There's absolutely no need to use anything beyond SQL here, though you
> could if you want to.

Jeremy,

This is a new experience for me so I didn't think of a SQL solution.

> I really wonder how much we are just talking past each other simply because
> we don't know what your data looks like, so we can't show you how our
> examples apply to your use case. If you provided a sample scrubbed data
> file, this whole thread probably would have been much shorter :).  Can you
> do that?

Not necessary; see below. Also, these data come from a regulator and
provided as an Excel spreadsheet. If they were extracted from a database
then that was very poorly designed because there's no consistency in how
fields/columns are formatted. This requires manual cleaning.

Each row in the source file (exported from the spreadsheet as .csv and
renamed to .txt for processing in emacs and awk) is a mixture of attributes
that belong in either or both of the organization and people tables in my
database. An awk script will extract the appropriate fields for each table.

> You told Francisco that the data file does not have a unique org name that
> could be used as a unique organization identifier. However you seem to
> have contradicted that by responding favorably to this solution:

The org_name is not the PK; the org_id is. This was assigned by postgres
when the original rows were inserted. Now, I can add the org_id in the
values to be inserted as I know the maximum org_id number in that table.

> INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id FROM
> org WHERE org_name=‘Main Office’))

Question: do I use this same syntax for each row to be inserted or can I
make it one long insert statement by separating the parenthesized values
with commas as I do when I update multiple rows in a table?

Thanks very much,

Rich


lup
Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

lup
>
> Each row in the source file (exported from the spreadsheet as .csv and
> renamed to .txt for processing in emacs and awk) is a mixture of attributes
Absolutely no need to rename the .csv for those tools.
> that belong in either or both of the organization and people tables in my
> database. An awk script will extract the appropriate fields for each table.



Reply | Threaded
Open this post in threaded view
|

Re: Bulk inserts into two (related) tables

Adrian Klaver-4
In reply to this post by Rich Shepard
On 5/22/19 7:38 AM, Rich Shepard wrote:

> On Wed, 22 May 2019, Jeremy Finzel wrote:
>
>> There's absolutely no need to use anything beyond SQL here, though you
>> could if you want to.
>
> Jeremy,
>
> This is a new experience for me so I didn't think of a SQL solution.
>
>> I really wonder how much we are just talking past each other simply
>> because
>> we don't know what your data looks like, so we can't show you how our
>> examples apply to your use case. If you provided a sample scrubbed data
>> file, this whole thread probably would have been much shorter :).  Can
>> you
>> do that?
>
> Not necessary; see below. Also, these data come from a regulator and
> provided as an Excel spreadsheet. If they were extracted from a database
> then that was very poorly designed because there's no consistency in how
> fields/columns are formatted. This requires manual cleaning.
>
> Each row in the source file (exported from the spreadsheet as .csv and
> renamed to .txt for processing in emacs and awk) is a mixture of attributes
> that belong in either or both of the organization and people tables in my
> database. An awk script will extract the appropriate fields for each table.

So does the people data have an organization attribute?

If so why not just assign the org_id while cleaning up the data?

>
>> You told Francisco that the data file does not have a unique org name
>> that
>> could be used as a unique organization identifier. However you seem to
>> have contradicted that by responding favorably to this solution:
>
> The org_name is not the PK; the org_id is. This was assigned by postgres
> when the original rows were inserted. Now, I can add the org_id in the
> values to be inserted as I know the maximum org_id number in that table.
>
>> INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id
>> FROM
>> org WHERE org_name=‘Main Office’))
>
> Question: do I use this same syntax for each row to be inserted or can I
> make it one long insert statement by separating the parenthesized values
> with commas as I do when I update multiple rows in a table?
>
> Thanks very much,
>
> Rich
>
>
>


--
Adrian Klaver
[hidden email]


12