Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode collation #1162

Open
Kas-code opened this issue Feb 13, 2023 · 6 comments
Open

Unicode collation #1162

Kas-code opened this issue Feb 13, 2023 · 6 comments

Comments

@Kas-code
Copy link

When specifying the SQLite.CollationAttribute on columns, we only have the options BINARY, NOCASE and RTRIM. There is no Unicode collation option. Is it possible to create a custom collation or any other way to make WHERE queries work on unicode values?

@csm101
Copy link
Contributor

csm101 commented Jul 14, 2023

If what you need is to make text searches "accent-insensitive" (or any weird character-insensitive... like allow german users to search indifferently for "große" or "grosse"), I have done it in my custom version of sqlite-net (this feature is not available in my github repository.. it is currently in the private repository of my company).

what I did is to use Unidecode.NET (see my version published here https://github.com/csm101/Unidecode.NET/tree/array_instead_of_dictionary.. this one fixes the two open issues in the official repository and has a 3X performance improvement... I have issued a pull request but I haven't received any feedback yet)

I used this library to implement a custom sqlite Unidecode function, so to implement accent insensitive searches i do this:

   conn.Query<MyRecord>("select * from mytable where Unidecode(germanfield) like '%'||?||'%'", "große".Unidecode()) 

(in my customized sqlite-net there is a mechanism for registering sqlite functions implemented in c#)

If there is some interest about this, I could try to extract the relevant code from my private repository and issue a pull request...

@Kas-code
Copy link
Author

Kas-code commented Jul 14, 2023

@csm101
Edit: I don't think that Unidecode would be a viable solution for my problem because I'm using Chinese text, which doesn't have a simple translation to ASCII.
I have a database of chinese sentences, and searching by the mandarin field returns odd results if I allow the database to perform the query. So instead I have to pull the whole table into memory first and then perform the search in C# like the following:

public async Task<List<Sentence>> GetByMandarin(string mandarin)
{
    return (await _database.Table<Sentence>().ToListAsync()).Where(s => s.Mandarin == mandarin).ToList();
}

This is obviously hugely inefficient so it would be great if the database could perform this operation.

@csm101
Copy link
Contributor

csm101 commented Jul 14, 2023

I have no knowledge about chinese: my target languages were only european languages and cirillyc,
but Unidecode has a huge translation table, and I see that it translates some symbols to the equivalent "phonetic" rapresentation... which surely aren't european...

have a look at this translation table here (from my version of unicode):
https://github.com/csm101/Unidecode.NET/blob/array_instead_of_dictionary/assets/unidecoder-decodemap.txt

Does this conversion map address mandarin (I have no idea)?

Anyway you could still implement your own "MandarinLike()" sql function in c#... I made it pretty simple to do in my sqlite version:

this is how, for example I added a "GPSDistance(lat1,long2, lat2, long2)" function that calculates the distance in meters between two gps coordinates:

public class SQLiteGPSDistanceFunction : SQLiteFunction
{ 
  public SQLiteGPSDistanceFunction(SQLiteConnection connection)
    : base(connection, FunctionName: "GPSDistance", ArgumentCount: 4, Deterministic: true)
  {
  }

  public override void Execute(SqliteFunctionRunContext ctx)
  {
    if ( ctx.IsNull(0) || ctx.IsNull(1) || ctx.IsNull(2)|| ctx.IsNull(3))
    {
      ctx.SetNullResult();
      return;
    }
    var latit1 = ctx.DoubleParam(0);
    var longit1 = ctx.DoubleParam(1);
    var latit2 = ctx.DoubleParam(2);
    var longit2 = ctx.DoubleParam(3);

    double distance = CoordinatesFunctions.Distance(latit1, longit1, latit2, longit2);
    ctx.SetResult(distance);
  }
}

and I also added a static global event handler to register these functions whenever a sqlite connection connects to any database:

    SQLiteConnection.AfterLogon +=
      (connection, args) =>
      {
        _ = new SQLiteUnidecodeFunction(connection); 
        _ = new SQLiteGPSDistanceFunction(connection); 
      };

I addressed only simple functions, but sqlite apis would allow me to implement also custom aggregate functions (like count(), max) or even custom analytic functions (select count(*) over (partition by field1, field2, order by ...) )

@csm101
Copy link
Contributor

csm101 commented Jul 14, 2023

and... since I managed to get the attention of an author of the project.. would you mind reviewing my pull request for cancellable queries? #1176

Actually I needed to be able to cancel long running queries for the same implementation I needed this Unidecode function, that is: to implement this kind of search dialog in android:
image
thank you!

@Kas-code
Copy link
Author

Sorry I'm not an author on the project, when I said I would be happy to review your pull request, I mean I can read it, but I don't think I'm able to approve it into the repo.
Once I've made an implementation of SQLiteFunction for MandarinLike(), how would I use that in queries? for example the code that I posted in my previous comment?

@csm101
Copy link
Contributor

csm101 commented Jul 14, 2023

My bad, I tought you had write access to the repository and you could approve pull requests,
anyway... if you have added your custom function to sqlite, you can use it in plain SQL commands:

return await _database.Query("select * from sentences s where MandarinEquals(s.Mandarin, ?) = true"));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants