-
-
Notifications
You must be signed in to change notification settings - Fork 141
[AVRO] Generation of Avro schemas with logical types. #133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Decimals are looking good. I have made some great progress here. I now have schema generation support for {
"type" : "record",
"name" : "FixedDecimalType",
"namespace" : "com.fasterxml.jackson.dataformat.avro.schema.TestLogicalTypes$",
"fields" : [ {
"name" : "value",
"type" : {
"type" : "fixed",
"name" : "foo",
"namespace" : "com.fasterxml.jackson.dataformat.avro.schema",
"size" : 8,
"logicalType" : "decimal",
"precision" : 5,
"scale" : 0
}
} ]
} {
"type" : "record",
"name" : "BytesDecimalType",
"namespace" : "com.fasterxml.jackson.dataformat.avro.schema.TestLogicalTypes$",
"fields" : [ {
"name" : "value",
"type" : {
"type" : "bytes",
"logicalType" : "decimal",
"precision" : 5,
"scale" : 0
}
} ]
} Right now I am going down the path of using the conversions that are built into Avro for this. @cowtowncoder The big thing I could use your assistance for is direction on the deserialization. Should I be looking into extending AvroParser with methods like |
Excellent progress. One update from my side: I am now leaning towards creating 2.10 branch, as sort of forward-looking minor version, with some of concepts from 3.0 (basically simple But. I think this feature could make lots of sense for 2.10. That would make more sense than adding it in 2.9 patch. I hope to release 2.9.6 sometime in May (next 2 weeks), and then create 2.10 branch. |
@cowtowncoder I have decimals finished and working properly. I added some testing so now I'm doing round trips as well as comparing the serialized output against the equivalent GenericRecord generated by Avro. I want to make sure that the output from Jackson is readable Avro. I need a little help on the date types. Take |
…avro module to stay java 7 while supporting mapping of java.time out to their own serializers.
…t is how the official Avro project does it.
@cowtowncoder We are nearly there. I need your help with one more spot and I'm finished assuming there are no changes required for merging. I'm not sure how to override how For completeness I have implemented all of the Avro Logical Types. I also created another module that will allow us to have java.time.* support for |
Nevermind on the AnnotationIntrospector. I made a mistake and inherited from the wrong class. |
…son-dataformat-avro`
…is`, and `time-micros`.
@cowtowncoder I think I'm done. I have every logical type implemented with their java 7 and java 8 types. The build is failing for the java 8 portion. It's in a different module. Let me know if you want it moved someplace else. Please let me know what changes you need from here. Thanks for your help. It's much appreciated. I'm a huge fan of Jackson. |
@jcustenborder Excellent. I hope to get 2.9 release done soon, and after that could consider merging this. Java 8 issue is bit tricky since we'll really only require Java 8 with Jackson 3.x. I think what we could do is to merge Java 7 portions in 2.10 branch, but Java 8 ones only in But I should probably have a look at code changes -- if changes should be safe for 2.9 (aside from Java 8 part), could just merge for 2.9, but announce changes for 2.10. That is, official addition of features for that. |
Ah ok. I should have looked at code a bit earlier, as I think misunderstood approach. One thing I am not quite sure about is the addition of many new annotations. I was kind of assuming that these would not (always?) be needed, since:
I could perhaps see the need for some cases if information is not otherwise sufficient (like override to decide between BYTES/FIXED encoding). And obviously things like If new annotations are needed, on the other hand, a smaller set -- perhaps just one annotation? -- would seem better; with |
@cowtowncoder Sorry about the delay. I was on vacation. What about moving to a single annotation with an enum for the avro type and logical type? Decimal is really the only weird one because it can be either a fixed or a bytes hence the implementation. My goal was to put something in place where the current code in the wild would still work as before but still support the new logical types. That's why I had so many annotations. I could reduce this all to a single annotation. |
…ute for a single AvroType attribute that can handle all types.
@cowtowncoder I removed all of the attributes for a single attribute and deprecated AvroFixedSize. This will allow you to use a single AvroType attribute that can be used to create logical types and standard types such as a fixed. For example you can create a fixed with |
@cowtowncoder Sorry to bug you. Just wanted to close the gap on this one. Is this closer to what you were looking for? |
@cowtowncoder Sorry to bug you again. Please let me know if you need more changes on this one or if you want me to break any more of it out. I'm also happy to assist on what's needed here for the 3.0 release. |
@cowtowncoder Sorry to bug you again. Do you need me to do anything else with this pull? |
@cowtowncoder Do you need me to do anything else with this pull? |
AvroType logicalType = _findAnnotation(a, AvroType.class); | ||
if (null != logicalType) { | ||
switch (logicalType.logicalType()) { | ||
case TIMESTAMP_MILLISECOND: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamps have no timezone, explicit or implicit. it is more correct to model them using java.time.Instant. this can be converted into any XXXDateTime class by providing a timezone, or stating it is implicit - this information is not available from the encoded timestamp itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukejackson That makes sense because AVRO stores everything without timezone as well. With jackson I was under the impression that I had to create a serializer and deserializer for each of the java.time.* classes. Is there another approach that I can take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion would be to add java.time.Instant support - this is arguably the canonical representation of timestamps in Java 8+.
As for the other XXXDateTime types, I think there is an argument for them not to be supported, given there is no corresponding type in Avro.
If you do introduce a conversion, it will introduce information not on the wire (e.g. timezone) and also be lossy, given a date time with a time zone can map to two different timestamps during daylight savings time transitions.
I would argue that given these limitations, it should be left to the user to perform the conversion and understand their assumptions and associated limitations.
Note the closest Avro does have is its date and time-xxx logical types. To model a LocalDateTime in Avro requires two fields with these corresponding types. To model a ZonedDateTIme requires a third field with the zone id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what you are saying now. Not having Instant support was an oversight. My last comment I thought you were telling me to convert to instant and delegate that to a serializer. I'll add instant support. LocalDateTime and LocalDate would be the closest thing to the avro specification given that neither Local* or the avro specification include timezone information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was talking about a user of this library doing the conversion themselves, rather than Jackson itself.
I can see how LocalDate is the closest to the avro spec for date, but not sure I understand what you mean by LocalDateTime - I don't think there is anything in the avro spec that relates to that?
Don't want to sound nit picking, and appreciate your effort here. The concepts behind the date and time types are deceptively complicated. Given apparent recent stagnation in the Avro project (and there is a long standing PR to add java.time there), I am hoping that this Jackson module could be a workable alternative to core Avro for some of my use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I was thinking when I started this PR. The Local* time classes go not have timezone information so they translate perfectly to the corresponding avro logical types. Not having instant was just an oversight. I've already started on adding it to this pull. Here is what I'm thinking. Thinking about your comments on OffsetDateTime and ZonedDateTime. You are correct. I was planning on doing some lossy conversions to support these types. Alternatively I could create logical types for them. The point of logical types in Avro is to support things outside of the specification. To do this I could create a logical type called zoned-date
, zoned-time-millis
, zoned-time-micros
, zoned-timestamp-millis
, and zoned-timestamp-millis
. This would allow me to store the timezone information along with the date/timestamp information.
This was my original thoughts.
Java Type | Avro Logical Type | Notes |
---|---|---|
LocalDate | Date | Lossless conversion |
LocalTime | time-millis, time-micros | Lossless conversion |
LocalDateTime | timestamp-millis, timestamp-millis | Lossless conversion |
Instant | timestamp-millis, timestamp-millis | Lossless conversion |
OffsetDateTime | timestamp-millis, timestamp-millis | Lossy conversion? Custom logical type? Timestamp adjusted to UTC. |
ZonedDateTime | timestamp-millis, timestamp-millis | Lossy conversion? Custom logical type? Timestamp adjusted to UTC. |
OffsetTime | time-millis, time-micros | Lossy conversion? Custom logical type? Timestamp adjusted to UTC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't thought about adding new logical types to Avro.
I think the way forward depends on what you're looking to achieve. If it's a way to represent all the Java related time classes in Avro - i.e. the Java POJO is the canonical schema and the Avro representation is just derived from that - then I can understand you'd look for a way to encode them in Avro. There are some options:
- reusing another type which is close in meaning (e.g. timestamp-millis for LocalDateTime and have any consumers - which may not be this library - know that they should convert the timestamp into a ZonedDateTime with UTC time zone and convert to LocalDateTime)
- encoding the types as strings, and parsing them as such. I believe this is what Jackson does when encoding Java times to JSON.
- adding your own logical types. this would mean that any consumer - which again may not be this library - would also have to implement them.
- using the record type composed of types supported by Avro. this is what Jackson does when converting POJOs to JSON - they become JSON dictionaries. Avro records would be the equivalent here. e.g. LocalDateTime would be a record containing a date field and a time-millis/time-micros field.
If instead you are looking for a way to directly represent Avro records as Java classes - i.e. the Avro schema is the canonical schema and the Java POJOs are just derived from that - then you just need to support the Avro types and no more, and leave it to the user to define their schemas to model the data they want to encode (e.g. manually define a record representing local date and local time), or let them choose how to shoehorn the value into an existing Avro type (e.g. represent as a string, as a long, etc.)
For my use cases, the Avro schemas are the canonical schemas. As such I only use the Avro types. This is because I use Avro to communicate between applications written in multiple languages, such as Java, C++, Python, and I rely on Kafka tools that only support Avro types (e.g. Connect and its various sinks).
Where the Avro types don't support my use case I either use a different type (e.g. long for nano precision timestamps) or multiple fields (e.g. combination of date and time-millis for a local date time), and I'm happy to accept that these fields would appear as such in Java, C++, Python, my database tables, etc.
big +1 from me on getting this feature reviewed and merged |
throw new IllegalStateException("Date type should not have any time fields set to non-zero values."); | ||
} | ||
long unixMillis = calendar.getTimeInMillis(); | ||
int output =(int) (unixMillis / MILLIS_PER_DAY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: can use TimeUnit.MILLISECONDS.toDays(unixMillis) and avoid the MILLIS_PER_DAY constant
|
||
@Override | ||
public void serialize(Date value, JsonGenerator jsonGenerator, SerializerProvider serializerProvider) throws IOException { | ||
Calendar calendar = Calendar.getInstance(UTC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid Thai Buddhist or Japanese Imperial calendars, it may be safer to explicitly instantiate a GregorianCalendar here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively you can convert via Instant and ZonedDateTime, avoiding Calendar altogether
That's precisely why it's wrong. Encoding a |
Can't say that I agree with you there. LocalDateTime does not have timezone
information. Given the point of this PR is to add support for Logical
Types, how would you store it? Forcing users to use ZonedDateTime doesn't
necessarily fit either. Following the LogicalType specification there is no
place to store the timezone offset. Jackson serializes LocalDateTime in
JSON without storing the timezone.
…On Mon, Sep 2, 2019 at 12:27 PM Marcos Passos ***@***.***> wrote:
That's precisely why it's wrong. Encoding a LocalDateTime in UTC is not
serialization, but conversion. We should not make assumptions like that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#133?email_source=notifications&email_token=AABTHAH2N2QIWJ2JSWKGWWLQHVEHLA5CNFSM4E7CB4FKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5WJPZQ#issuecomment-527210470>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABTHAAFOOCNFIYRLLZ54P3QHVEHLANCNFSM4E7CB4FA>
.
|
I think you missed my point. |
No I saw that. I just disagree.
…On Mon, Sep 2, 2019 at 12:42 PM Marcos Passos ***@***.***> wrote:
I think you missed my point. LocalDateTime cannot be stored as a timestamp, so it should not be mapped like so.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@jcustenborder I would recommend you read more about Java 8 "local" date/time types: they are NOT timestamps -- "zoned" variant ( https://stackoverflow.com/questions/32437550/whats-the-difference-between-instant-and-localdatetime But as to my earlier question: "The goal for this PR was to provide support for Avro Logical Types" is bit general as support can mean multiple things. What I was earlier assuming (incorrectly I think), was that it'd
but I think this would not be doing that but rather
That is fine, but leads to follow-up question of exactly where this should be supported. I am ok with format module adding such generation, although in some sense it might make sense to have separate "Avro schema generation" module, as there is one for JSON Schema module. |
@cowtowncoder I think the disconnect is my comments are centered around binary compatibility with Avro data. My goal for this PR is to be able to write and write logical type data using Jackson that is compatible with Apache Avro. We have had some of these discussions before. LocalDateTime is a timestamp without timezone information. The Avro library assumes that objects of LocalDateTime are UTC. The priority of this PR is to read and write data in a way that is compatible with Apache Avro. |
There is no such thing like "timestamp without timezone information." A timestamp is a point in the timeline, always in UTC. What Avro does is conversion, rather than serialization - and it's wrong, IMHO. |
Conversion is fine given it's a known conversion. If we were talking about serialization in general I would agree with you. We are talking about conforming to a specific specification. Avro has a known specification for logical types. Both timestamp-millis and timestamp-micros state they are from UTC.
|
<scope>test</scope> | ||
</dependency> | ||
<!-- A bit of help to reduce boiler-plate in dummy test classes --> | ||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick note: no, unfortunately I will not allow Lombok as a dependency (some usage did sneak in, and I have been busy removing it). Originally reason was that Lombok use prevented clean build out of cloned repo (some component had to be locally installed).
|
||
public abstract class BaseTimeJsonDeserializer<T> extends JsonDeserializer<T> { | ||
private final TimeUnit resolution; | ||
final ZoneId zoneId = ZoneId.of("UTC"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static?
First of all, thank you for explanation of where mapping of Does Avro lib generate I'm bit at loss here as I can see the issue but not good solution. I would not want to be propagating what seem like misguided processing of date/time values. |
|
||
import com.fasterxml.jackson.core.Version; | ||
import com.fasterxml.jackson.databind.introspect.Annotated; | ||
import com.fasterxml.jackson.dataformat.avro.AvroAnnotationIntrospector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to use different Java package if this remains as separate component, as Java Module system does not allow Split packages (classes for single Java package coming from multiple modules, that is, jars with module-info).
But this is work-in-progress so it's fine until packaging details decided.
if (null != logicalType) { | ||
switch (logicalType.logicalType()) { | ||
case TIMESTAMP_MILLISECOND: | ||
if (a.getRawType().isAssignableFrom(LocalDateTime.class)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For serializers, should check isAssignableFrom
(since subtype serialization probably fine); for deserializers fine (and perhaps preferable) to do equality check -- can not substitute base value to subtype property.
|
||
@Override | ||
public T deserialize(JsonParser p, DeserializationContext ctxt) throws IOException { | ||
final long input = p.getLongValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably add checking for other input types, give more meaningful error message in case underlying physical token is not a 64-bit long (schema mismatch)? And unit test for the same would be useful as well.
import java.time.temporal.ChronoUnit; | ||
import java.util.concurrent.TimeUnit; | ||
|
||
public abstract class BaseTimeJsonDeserializer<T> extends JsonDeserializer<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest extending StdScalarDeserializer
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, on naming, since this is Avro-specific, maybe rather BaseTimeAvroDeserializer (or something).
import java.time.temporal.ChronoUnit; | ||
import java.util.concurrent.TimeUnit; | ||
|
||
public abstract class BaseTimeJsonSerializer<T> extends JsonSerializer<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with deserializer, probably better extend StdScalarSerializer
; change Json
-> Avro
*/ | ||
int scale() default 0; | ||
|
||
enum LogicalType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadocs would be crucial here, to explain mapping details.
protected final static ScalarDecoder READER_LONG = new LongReader(); | ||
protected final static ScalarDecoder READER_NULL = new NullReader(); | ||
protected final static ScalarDecoder READER_STRING = new StringReader(); | ||
protected final static ScalarDecoder READER_BOOLEAN = new ScalarDecoder.BooleanDecoder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could be changed to factory methods I suppose?
import java.io.IOException; | ||
import java.util.UUID; | ||
|
||
public class AvroUUIDDeserializer extends JsonDeserializer<UUID> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there need to add this? Wouldn't standard UUIDDeserializer
from jackson-databind
work?
(it actually supports both textual and 16-byte binary value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After playing with this a bit on 2.10, it looks like default serialization of java.util.UUID
with Avro is unfortunately String
. I can work on changing that in 2.11, but for 2.10 this would actually make sense I guess.
String.format("Cannot encode decimal with scale %s as scale %s.", decimal.scale(), scale) | ||
); | ||
} | ||
jsonGenerator.writeBinary(decimal.unscaledValue().toByteArray()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but it'd probably be acceptable to just cast JsonGenerator
as AvroGenerator
if that simplifies things.
On goal here: ok, yes. I can see that with serializer, deserializer, use of annotations it would implemented read/write support for Java 8 date/time types, using binary representations. And not just generate schema information. I can probably come around wrt Local date/time types as well, with some more time. But with that, some follow-up questions:
|
Ohhhhh. Looks like Avro backend is NOT overriding |
Thank you for reviewing this! Much appreciated. I'll go back over your comments and get some updates to address your feedback. I really appreciate your help. |
@jcustenborder At this point, I think I have a good understanding of what would be needed here, at high level. But discussion might be easiest to have over email (and summarize here as needed). But for others benefit, here's what I think:
So.... with that, I was thinking that it would be great to essentially.
The trick, really, is just to figure out sort of minimal set of changes needed in Avro module 2.10. Thank you everyone for working on getting this done: it is pretty ambitious but important undertaking, and once working, should open up new possibilities, beyond what I had thought practical. |
@jcustenborder Ok. After one more re-read, things much more clear. Changes to existing Avro backend make sense, the only possible caveat being Would it be possible to split this into two parts? I think that Avro-package changes are only related to internal API, not externally usable (... or if someone uses it, I think that'd be unsupported), so I would be ok with this going in one of 2.10.x patches if it does not make it in 2.10.0 (we only have 1 more day until that). |
I think this is probably obsoleted by later work, closing. |
This is the first part. We're not ready to merge. I want to get feedback on the approach I have so far. I have schema generation working. An Avro Decimal must have the precision set. This means that I have to know this during schema generation. Because of this I decided to go with an annotation based approach. This seems to work pretty well so far. For example this is how.
is generated by this java.
Using annotations gives us a weird benefit / problem with existing code. If you don't have an annotation a BigDecimal will be serialized as a string. This is great if you have existing code but a pain if you forget the attribute.
I went with a similar approach with all of the other logical types. In order to utilize them you need to use the corresponding annotation.
#132