atom feed10 messages in org.apache.spark.issues[jira] [Commented] (SPARK-22335) Unio...
FromSent OnAttachments
Liang-Chi Hsieh (JIRA)Oct 23, 2017 8:37 pm 
Carlos Bribiescas (JIRA)Oct 24, 2017 6:56 am 
Dongjoon Hyun (JIRA)Oct 24, 2017 10:53 am 
Carlos Bribiescas (JIRA)Oct 24, 2017 12:18 pm 
Dongjoon Hyun (JIRA)Oct 24, 2017 12:34 pm 
Dongjoon Hyun (JIRA)Oct 24, 2017 12:36 pm 
Liang-Chi Hsieh (JIRA)Oct 24, 2017 8:05 pm 
Apache Spark (JIRA)Oct 24, 2017 8:11 pm 
Carlos Bribiescas (JIRA)Oct 25, 2017 5:09 am 
Liang-Chi Hsieh (JIRA)Oct 25, 2017 5:52 am 
Subject:[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union
From:Carlos Bribiescas (JIRA) (ji@apache.org)
Date:Oct 24, 2017 6:56:00 am
List:org.apache.spark.issues

[
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216950#comment-16216950
]

Carlos Bribiescas commented on SPARK-22335:

-------------------------------------------

I think if unionByName replaced union then it would be a solution. Its
definitely a workaround... But as the api stands it feels like a bug since
Dataset is supposed to be typed.

Again, I suspect it has to do with the optimizer pushing the typing to a later
step, after the union by column order happens. If this is the root cause of the
bug I worry how else its being manifested, that is, what other bugs it may
cause. I'll have to think about it a bit more.

Union for DataSet uses column order instead of types for union

--------------------------------------------------------------

Key: SPARK-22335 URL: https://issues.apache.org/jira/browse/SPARK-22335 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Carlos Bribiescas

I see union uses column order for a DF. This to me is "fine" since they aren't
typed. However, for a dataset which is supposed to be strongly typed it is actually
giving the wrong result. If you try to access the members by name, it will use
the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")

abDf.union(baDf).show() // as linked ticket states, its "Not a problem"

val abDs = abDf.as[AB] val baDs = baDf.as[AB]

abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should
be correctly mapped by type, not by column order

abDs.union(baDs).map(_.a).show() // This gives wrong result since a
Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though
columns were out of order. abDs.map(_.a).show() // This is correct too baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround
for linked issue, slightly modified. However this seems wrong since its
supposed to be strongly typed

baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct
result, which is logically inconsistent behavior abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives correct
result {code} So its inconsistent and a bug IMO. And I'm not sure that the suggested work
around is really fair, since I'm supposed to be getting of type `AB`. More
importantly I think the issue is bigger when you consider that it happens even
if you read from parquet (as you would expect). And that its inconsistent when
going to/from rdd. I imagine its just lazily converting to typed DS instead of initially. So
either that typing could be prioritized to happen before the union or unioning
of DF could be done with column order taken into account. Again, this is
speculation..