Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datacopy CSV with null trailing columns not importing with -u #231

Open
jdearing-neudesic opened this issue May 6, 2021 · 7 comments
Open
Labels

Comments

@jdearing-neudesic
Copy link

I have a csv of data as follows:

1,APPAREL,
2,"BIOTECH, HEALTH CARE & PHARMA",
3,"FOOD, BEVERAGE & AGRICULTURE",

The this column is parentId, and its a self referencing foreign key:

table peergroup:
    columns:
    - id:
        identity: by default
        not_null: true
        type: integer
    - name:
        not_null: true
        type: character varying
    - parentid:
        type: integer
    foreign_keys:
      fk_peergroup_peergroup:
        columns:
        - parentid
        references:
          columns:
          - id
          schema: model
          table: peergroup
    owner: postgres
    primary_key:
      pk_peergroup:
        columns:
        - id
    unique_constraints:
      uq_peergroup_name:
        columns:
        - name

If I run yamltodb without -u and copy the generated commands into psql \copy works fine, If I do it with \u I get

Traceback (most recent call last):
  File "/home/jdearing/.local/bin/yamltodb", line 8, in <module>
    sys.exit(main())
  File "/home/jdearing/.local/lib/python3.8/site-packages/pyrseas/yamltodb.py", line 70, in main
    db.dbconn.copy_from(stmt[3], stmt[1])
  File "/home/jdearing/.local/lib/python3.8/site-packages/pgdbconn/dbconn.py", line 157, in copy_from
    curs.copy_from(f, table, sep)
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type integer: ""
CONTEXT:  COPY peergroup, line 1, column parentid: ""
@jmafc
Copy link
Member

jmafc commented May 7, 2021

It seems odd that psql would accept the input file since the third column values are missing. Have you tried adding a 0 at the end of each line to test?
@dvarrazzo Sorry to bother you again, but if SQL copy (I realize it's a Postgres invention) is supposed to accept missing data values (as it apparently does) and supply a default (I'm not sure it's a NULL or a 0 that is being injected), then it seems that psycopg copy_from() would have to abide by this (weird IMHO) rule.

@jmafc jmafc added the yamltodb label May 7, 2021
@jdearing-neudesic
Copy link
Author

The docks for the COPY STATEMENT which I fully acknowledge is distincy from the \copy instruction of the psql executable claims that the default null is "an unquoted empty string in CSV format". Therefor since the CSV in question has a trailing comma on each line, unlike the other CSVs in my project which don't end in a comma, I can conclude that at least by the standards of the server side COPY statement that CSV has encoded a NULL value for parentId column.

@tbussmann
Copy link

tbussmann commented May 7, 2021

To track this down:

Docs on \copy:

The syntax of this command is similar to that of the SQL COPY command. All options other than the data source/destination are as specified for COPY.

Docs on COPY:

COPY table_name [ ( column_name [, ...] ) ]
    FROM { 'filename' | PROGRAM 'command' | STDIN }
    [ [ WITH ] ( option [, ...] ) ]
    [ WHERE condition ]

FORMAT
Selects the data format to be read or written: text, csv (Comma Separated Values), or binary. The default is text.
DELIMITER
Specifies the character that separates columns within each row (line) of the file. The default is a tab character in text format, a comma in CSV format. This must be a single one-byte character. This option is not allowed when using binary format.
NULL
Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format. You might prefer an empty string even in text format for cases where you don't want to distinguish nulls from empty strings. This option is not allowed when using binary format.

So everything seems to depend on the FORMAT parameter. Let's see how that is set:

# expected format: (\copy, table, from, path, csv)
db.dbconn.copy_from(stmt[3], stmt[1])
here we pass only two parameters (likely path and table name) to pgdbconn's copy_from. There, the separator is set to it's default , and passed to psycopg2's curs_copy_from. This adds a default value to NULL and constructs a COPY SQL command like COPY %s%s FROM stdin WITH DELIMITER AS %s NULL AS %s

To conclude: DELIMITER is fixed set to pgdbconn's default ,, NULL is fixed set to psycopg2's default \N. In the end we are doing a simple text import which is the Postgres default for FORMAT which is not set (and not setable with the given abstraction layers) by perseas/psycopg2. Let's try to find out why:

Remember the syntax of that COPY command constructed? This doesn't look like what we have leant from the docs in the beginning. So read them again to find:

The following syntax was used before PostgreSQL version 9.0 and is still supported:

COPY table_name [ ( column [, ...] ) ]
    FROM { 'filename' | STDIN }
    [ [ WITH ]
          [ BINARY ]
          [ OIDS ]
          [ DELIMITER [ AS ] 'delimiter' ]
          [ NULL [ AS ] 'null string' ]
          [ CSV [ HEADER ]
                [ QUOTE [ AS ] 'quote' ]
                [ ESCAPE [ AS ] 'escape' ]
                [ FORCE NOT NULL column [, ...] ] ] ]

Ha! FORMAT was not available back then!

@jmafc
Copy link
Member

jmafc commented May 7, 2021

Thanks for the detailed trace, Tobias. Aside: your postgresql.org links use latest in the path instead of current.
Although FORMAT may not have been available in 8.4 or earlier, CSV as a standalone keyword was supported (in fact, \copy even in PG 13 still accepts csv without a preceding format). In any case, going back to the commit a3fc1bc which implemented support for -u in processing \copy statements, it seems I had intended to add csv as an argument but I dropped the ball somewhere. Adding it now, would mean messing with pgdbconn, which IMO is another reason to reintegrate that code back into pyrseas.

@tbussmann
Copy link

Aside: your postgresql.org links use latest in the path instead of current.

thanks, fixed. Obviously a lack of QA ;)

Although FORMAT may not have been available in 8.4 or earlier, CSV as a standalone keyword was supported

right. And that does change NULL defaults, just like FORMAT CSV does. But the current defaults are supplied by the wrappers, not PostgreSQL itself.

Adding it now, would mean messing with pgdbconn, which IMO is another reason to reintegrate that code back into pyrseas.

actually - if I understand the code correctly - skipping pgdbconn would still not allow to specify 'CSV' as psycopg does not support this parameter at all. But it would be possible to pass an empty string for the 'null' parameter of it's curs_copy_from to mimic that behaviour.

@jmafc
Copy link
Member

jmafc commented May 7, 2021

Thanks Tobias for the clarification. I see now that psycopg2 copy_from and copy_to do not support the csv parameter in any way, only copy_expert does.
@jdearing-neudesic I'm afraid that at this point, the only thing we can do is document that if you're going to use yamltodb -u to load some tables, columns with SQL NULLs will have to be encoded as \N. Implementing a workaound with copy_expert seems to me as a step backward for very little gain, and adding an empty string in our code would only fix your situation but would break anyone using the standard default for NULLs, even those using dbtoyaml to export the data in the first place.

@dvarrazzo
Copy link
Contributor

@dvarrazzo Sorry to bother you again, but if SQL copy (I realize it's a Postgres invention) is supposed to accept missing data values (as it apparently does) and supply a default (I'm not sure it's a NULL or a 0 that is being injected), then it seems that psycopg copy_from() would have to abide by this (weird IMHO) rule.

Hello,

copy_from and copy_to were a bad idea, for the same reason that there aren't cursor.select() or a cursor.insert() methods each one with a million of parameters and a few hundreds missing because they were introduced in a later Postgres release. After acknowledging this design error, the maintainer at the time added copy_expert(), which is the only thing that makes sense, because it accommodates any future Postgres feature and, like execute(), doesn't assume knowledge of the statement to send, but only of its protocol.

So no, there is no intention to support other parameters in copy_from(): the method is dropped from psycopg3, which only supports a cursor.copy() method (which supports not only file-like objects but also record streams and adaptations like normal queries).

The psycopg2.sql module, and its psycopg3.sql counterpart in the next release, allow to generate syntactically valid queries escaping table and field names correctly and allowing the user to use whatever feature is supported by the server they are talking to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants