Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lingual 1.1 silently drops first line of tsv source files #23

Closed
alexanderdean opened this issue Apr 25, 2014 · 5 comments
Closed

Lingual 1.1 silently drops first line of tsv source files #23

alexanderdean opened this issue Apr 25, 2014 · 5 comments

Comments

@alexanderdean
Copy link

Setup:

#!/bin/bash

# Config
hdfs_path=/local/lingual-tsv-test/
hbase_table=out
hbase_col_family=fields
export LINGUAL_PLATFORM=hadoop
export HADOOP_USER_NAME=hadoop

# Lingual 1.1
lingual catalog --init
lingual catalog --provider --add cascading:cascading-hbase:2.2.0:provider
lingual catalog --schema IN --add
lingual catalog --schema IN --stereotype TSVFILE -add \
--columns A,B \
--types   string,string
lingual catalog --schema IN --table IN --stereotype TSVFILE -add "${hdfs_path}" --format tsv

First row lost

Script:

# Test data - note no newline at start
printf "1\ta\n2\tb\n3\tc\n" > /tmp/lossy.tsv
hadoop fs -copyFromLocal /tmp/lossy.tsv "${hdfs_path}lossy.tsv"

# First line missing
lingual shell --sql - <<- EOQ
    SELECT * from "IN"."IN";
EOQ

Output:

{utcTimestamp=1398421144285, currentTimestamp=1398421144285, localTimestamp=1398421144285, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
{utcTimestamp=1398421144285, currentTimestamp=1398421144285, localTimestamp=1398421144285, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
{utcTimestamp=1398421144285, currentTimestamp=1398421144285, localTimestamp=1398421144285, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
+----+----+
| A  | B  |
+----+----+
| 2  | b  |
| 3  | c  |
+----+----+
2 rows selected (1.26 seconds)

Newline at start fixes it

Script:

# Test data - add newline at start
printf "\n1\ta\n2\tb\n3\tc\n" > /tmp/not-lossy.tsv
hadoop fs -rmr "${hdfs_path}"
hadoop fs -copyFromLocal /tmp/not-lossy.tsv "${hdfs_path}not-lossy.tsv"

# Doesn't work
lingual shell --sql - <<- EOQ
    SELECT * from "IN"."IN";
EOQ

Output:

{utcTimestamp=1398421279816, currentTimestamp=1398421279816, localTimestamp=1398421279816, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
{utcTimestamp=1398421279816, currentTimestamp=1398421279816, localTimestamp=1398421279816, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
{utcTimestamp=1398421279816, currentTimestamp=1398421279816, localTimestamp=1398421279816, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
+----+----+
| A  | B  |
+----+----+
| 1  | a  |
| 2  | b  |
| 3  | c  |
+----+----+
3 rows selected (1.318 seconds)
@alexanderdean alexanderdean changed the title Lingual 1.1 drops first line of tsv source files Lingual 1.1 silently drops first line of tsv source files Apr 25, 2014
@joeposner
Copy link
Contributor

Closing as not a bug since the default behavior with .tsv is to assume the first line is the header. The code here that creates the tsv file should provide a header (preferred option) or the header=false property should be set in the catalog definition.

@alexanderdean
Copy link
Author

Thanks for clarifying, looks like --properties header=false will do the trick.

@alexanderdean
Copy link
Author

Hmm, actually I don't fully understand this. What's the relationship between the header column names in the TSV files, and the column names specified in the --stereotype? If they don't match, no error is reported. Which one would be used in that case?

@joeposner
Copy link
Contributor

That's definitely something that we need to fix; we should report an error
more clearly if you try to set up a mismatch like that.

Currently you have to be careful to not make a stereotype that contradicts
a source that's self-defining.

On Fri, Apr 25, 2014 at 10:44 AM, Alexander Dean
[email protected]:

Hmm, actually I don't fully understand this. What's the relationship
between the header column names in the TSV files, and the column names
specified in the --stereotype? If they don't match, no error is reported.

Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-41419905
.

@alexanderdean
Copy link
Author

Awesome, added #25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants