atom feed4 messages in org.apache.spark.issues[jira] [Commented] (SPARK-20055) Docu...
FromSent OnAttachments
Apache Spark (JIRA)Oct 4, 2017 6:16 am 
Jorge Machado (JIRA)Oct 4, 2017 6:17 am 
Andrew Ash (JIRA)Oct 5, 2017 11:10 am 
Jorge Machado (JIRA)Oct 5, 2017 9:40 pm 
Subject:[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide
From:Jorge Machado (JIRA) (ji@apache.org)
Date:Oct 5, 2017 9:40:00 pm
List:org.apache.spark.issues

[
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194129#comment-16194129
]

Jorge Machado commented on SPARK-20055:

---------------------------------------

[~aash] Should I copy paste that options ? And there is some docs already

{code:java} def csv(paths: String*): DataFrame Permalink Loads CSV files and returns the result as a DataFrame.

This function will go through the input once to determine the input schema if
inferSchema is enabled. To avoid going through the entire data once, disable
inferSchema option or specify the schema explicitly using schema.

You can set the following CSV-specific options to deal with CSV files:

sep (default ,): sets the single character as a separator for each field and
value. encoding (default UTF-8): decodes the CSV files by the given encoding type. quote (default "): sets the single character used for escaping quoted values
where the separator can be part of the value. If you would like to turn off
quotations, you need to set not null but an empty string. This behaviour is
different from com.databricks.spark.csv. escape (default \): sets the single character used for escaping quotes inside an
already quoted value. comment (default empty string): sets the single character used for skipping
lines beginning with this character. By default, it is disabled. header (default false): uses the first line as names of columns. inferSchema (default false): infers the input schema automatically from data. It
requires one extra pass over the data. ignoreLeadingWhiteSpace (default false): a flag indicating whether or not
leading whitespaces from values being read should be skipped. ignoreTrailingWhiteSpace (default false): a flag indicating whether or not
trailing whitespaces from values being read should be skipped. nullValue (default empty string): sets the string representation of a null
value. Since 2.0.1, this applies to all supported types including the string
type. nanValue (default NaN): sets the string representation of a non-number" value. positiveInf (default Inf): sets the string representation of a positive infinity
value. negativeInf (default -Inf): sets the string representation of a negative
infinity value. dateFormat (default yyyy-MM-dd): sets the string that indicates a date format.
Custom date formats follow the formats at java.text.SimpleDateFormat. This
applies to date type. timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that
indicates a timestamp format. Custom date formats follow the formats at
java.text.SimpleDateFormat. This applies to timestamp type. maxColumns (default 20480): defines a hard limit of how many columns a record
can have. maxCharsPerColumn (default -1): defines the maximum number of characters allowed
for any given value being read. By default, it is -1 meaning unlimited length mode (default PERMISSIVE): allows a mode for dealing with corrupt records during
parsing. It supports the following case-insensitive modes. PERMISSIVE : sets other fields to null when it meets a corrupted record, and
puts the malformed string into a field configured by columnNameOfCorruptRecord.
To keep corrupt records, an user can set a string type field named
columnNameOfCorruptRecord in an user-defined schema. If a schema does not have
the field, it drops corrupt records during parsing. When a length of parsed CSV
tokens is shorter than an expected length of a schema, it sets null for extra
fields. DROPMALFORMED : ignores the whole corrupted records. FAILFAST : throws an exception when it meets corrupted records. columnNameOfCorruptRecord (default is the value specified in
spark.sql.columnNameOfCorruptRecord): allows renaming the new field having
malformed string created by PERMISSIVE mode. This overrides
spark.sql.columnNameOfCorruptRecord. multiLine (default false): parse one record, which may span multiple lines. {code}

Documentation for CSV datasets in SQL programming guide

-------------------------------------------------------

Key: SPARK-20055 URL: https://issues.apache.org/jira/browse/SPARK-20055 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 2.2.0 Reporter: Hyukjin Kwon

I guess things commonly used and important are documented there rather than
documenting everything and every option in the programming guide -
http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems JSON datasets
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets are
documented whereas CSV datasets are not. Nowadays, they are pretty similar in APIs and options. Some options are notable
for both, In particular, ones such as {{wholeFile}}. Moreover, several options
such as {{inferSchema}} and {{header}} are important in CSV that affect the
type/column name of data. In that sense, I think we might better document CSV datasets with some examples
too because I believe reading CSV is pretty much common use cases. Also, I think we could also leave some pointers for options of API
documentations for both (rather than duplicating the documentation). So, my suggestion is, - Add CSV Datasets section. - Add links for options for both JSON and CSV that point each API documentation - Fix trivial minor fixes together in both sections.