Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode column names with encoding of context #210

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Adrian-Hirt
Copy link

We detected some behaviour which might lead to some unexpected failures when migrating from the mysql2 Adapter to the trilogy adapter.
In the result set returned by a query, one can access the column names using the fields attribute, however the encoding of these strings is different from the encoding of the database.

Behaviour in mysql2:

When a query is executed via the database adapter (or rather its client), the column names (in the attribute fields of the result) respect the encoding set for the database connection. Running a query on a database with encoding: utf8mb4 specified in the client creation respects this encoding for the column names, returning the strings with Encoding:UTF-8 encoding.

Behaviour in trilogy:

Running the same query on the same database using the trilogy adapter returns the column names encoded with Encoding:US-ASCII, ignoring the database encoding.
This might lead to some unexpected encoding issues when the client uses the names from the query to further process the data.

Tests:

Behaviour with mysql2 Adapter:

client = Mysql2::Client.new(host: '127.0.0.1', port: 3306, username: 'root', encoding: 'utf8mb4', collation: 'utf8mb4_unicode_520_ci', database: 'mydatabase')

if client.ping
  result = client.query('SELECT * FROM users LIMIT 10')
  puts result.fields.map(&:encoding).join(', ')
end

Result: UTF-8, UTF-8, UTF-8, UTF-8, UTF-8, UTF-8, UTF-8

Behaviour with trilogy 2.9.0:

client = Trilogy.new(host: '127.0.0.1', port: 3306, username: 'root', encoding: 'utf8mb4', collation: 'utf8mb4_unicode_520_ci', database: 'mydatabase')

if client.ping
  result = client.query('SELECT * FROM users LIMIT 10')
  puts result.fields.map(&:encoding).join(', ')
end

Result: US-ASCII, US-ASCII, US-ASCII, US-ASCII, US-ASCII, US-ASCII, US-ASCII

Behaviour after patch:

client = Trilogy.new(host: '127.0.0.1', port: 3306, username: 'root', encoding: 'utf8mb4', collation: 'utf8mb4_unicode_520_ci', database: 'mydatabase')

if client.ping
  result = client.query('SELECT * FROM users LIMIT 10')
  puts result.fields.map(&:encoding).join(', ')
end

Result with patch: UTF-8, UTF-8, UTF-8, UTF-8, UTF-8, UTF-8, UTF-8


With this patch, trilogy returns the strings in the same encoding as the database, which matches the behaviour of the previous used mysql2 adapter.
I am not sure whether simply returning the strings in ASCII encoding is a deliberate choice or not, therefore feel free to accept or reject this PR, both is fine, I wanted to bring this to attention.

Let me know if you need any other informations, and I'd appreciate someone looking at my usage of Ruby C-API methods to ensure everything is used in its intended way, as I have only limited experience with the C-API.

The CI passed in my fork, see here and here.

Have a pleasant day!

@@ -811,10 +811,12 @@ static VALUE read_query_response(VALUE vargs)
}
}

rb_encoding *conn_enc = rb_to_encoding(ctx->encoding);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rb_to_encoding is quite costly. We probably should it only once after connect or something, and keep a rb_encoding * in ctx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, it should be moved outside the loop, so that if you have 20 fields you don't do the conversion 20 times.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I moved the call to rb_to_encoding to the rb_trilogy_connect method and store it in rb_encoding *conn_encoding in the ctx. I also replaced the call to the rb_to_encoding method in rb_trilogy_query to use this value in the ctx struct. Let me know if you'd need any other adaptions for this.

@@ -811,10 +811,12 @@ static VALUE read_query_response(VALUE vargs)
}
}

rb_encoding *conn_enc = rb_to_encoding(ctx->encoding);

#ifdef HAVE_RB_INTERNED_STR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be updated to HAVE_RB_ENC_INTERNED_STR and extconf.rb should also be updated accordingly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I changed this to use HAVE_RB_ENC_INTERNED_STR and added the corresponding have_func to extconf.rb

@@ -456,6 +457,7 @@ static VALUE rb_trilogy_connect(VALUE self, VALUE encoding, VALUE charset, VALUE

RB_OBJ_WRITE(self, &ctx->encoding, encoding);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RB_OBJ_WRITE(self, &ctx->encoding, encoding);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're keeping both. I think we can get rid of the reference to encoding. Just turn encoding into a rb_encoding *, so we don't need to mark it or antyhing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I remove encoding from the ctx (and also from the trilogy_ctx struct) alltogether?

It's only used in the RB_OBJ_WRITE call (which, if I understand correctly, writes the encoding value to the ctx->encoding) and in the rb_to_encoding call, which probably makes storing this obsolete as we have the rb_encoding pointer in the ctx struct which can be used to encode values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I'm suggesting yes. That member is no longer used, so no point keeping it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, didn't see your other comment in the conversation tab 😅 I applied these changes, now encoding in the ctx struct is the rb_encoding * value which can be directly used :)

@byroot
Copy link
Contributor

byroot commented Dec 5, 2024

Looks good to me now, but I'm not a maintainer so I can't merge.

@Adrian-Hirt
Copy link
Author

@byroot Great, thanks for reviewing, in that case let's wait for someone from the maintainers to review this as well :)

@composerinteralia
Copy link
Contributor

Makes sense to me. It's still not quite the same as mysql2, which uses the default internal encoding, but that might also be fine—we are not obligated to match mysql2 behavior.

@Adrian-Hirt
Copy link
Author

Adrian-Hirt commented Dec 6, 2024

Makes sense to me. It's still not quite the same as mysql2, which uses the default internal encoding, but that might also be fine—we are not obligated to match mysql2 behavior.

Not matching the exact mysql2 behaviour is totally fine, I only mentioned this because it broke some code that relied on non-ascii encoding of column names 😅

Let me know if you need me to change anything else for you to be able to accept this PR 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants