Skip to content

Latest commit

 

History

History
499 lines (424 loc) · 18.1 KB

ticdc-canal-json.md

File metadata and controls

499 lines (424 loc) · 18.1 KB
title summary
TiCDC Canal-JSON Protocol
Learn the concept of TiCDC Canal-JSON Protocol and how to use it.

TiCDC Canal-JSON Protocol

Canal-JSON is a data exchange format protocol defined by Alibaba Canal. In this document, you can learn how Canal-JSON data formats are implemented in TiCDC, including the TiDB extension field, the definitions of the Canal-JSON data formats, and comparison with the official Canal.

Use Canal-JSON

When using Message Queue (MQ) as the downstream Sink, you can specify Canal-JSON in sink-uri. TiCDC wraps and constructs Canal-JSON messages with Event as the basic unit, and sends TiDB data change Events to the downstream.

There are three types of Events:

  • DDL Event: Represents a DDL change record. It is sent after an upstream DDL statement is successfully executed. The DDL Event is sent to the MQ Partition with the index being 0.
  • DML Event: Represents a row data change record. This type of Event is sent when a row change occurs. It contains the information about the row after the change occurs.
  • WATERMARK Event: Represents a special time point. It indicates that the Events received before this point is complete. It applies only to the TiDB extension field and takes effect when you set enable-tidb-extension to true in sink-uri.

The following is an example of using Canal-JSON:

{{< copyable "shell-regular" >}}

cdc cli changefeed create --pd=http://127.0.0.1:2379 --changefeed-id="kafka-canal-json" --sink-uri="kafka://127.0.0.1:9092/topic-name?kafka-version=2.4.0&protocol=canal-json"

TiDB extension field

The Canal-JSON protocol is originally designed for MySQL. It does not contain important fields such as the TiDB-specific unique identifier for the CommitTS transaction. To solve this problem, TiCDC appends a TiDB extension field to the Canal-JSON protocol format. After you set enable-tidb-extension to true (false by default) in sink-uri, TiCDC behaves as follows when generating Canal-JSON messages:

  • TiCDC sends DML Event and DDL Event messages that contain a field named _tidb.
  • TiCDC sends WATERMARK Event messages.

The following is an example:

{{< copyable "shell-regular" >}}

cdc cli changefeed create --pd=http://127.0.0.1:2379 --changefeed-id="kafka-canal-json-enable-tidb-extension" --sink-uri="kafka://127.0.0.1:9092/topic-name?kafka-version=2.4.0&protocol=canal-json&enable-tidb-extension=true"

Definitions of message formats

This section describes the formats of DDL Event, DML Event and WATERMARK Event, and how the data is resolved on the consumer side.

DDL Event

TiCDC encodes a DDL Event into the following Canal-JSON format.

{
    "id": 0,
    "database": "test",
    "table": "",
    "pkNames": null,
    "isDdl": true,
    "type": "QUERY",
    "es": 1639633094670,
    "ts": 1639633095489,
    "sql": "drop database if exists test",
    "sqlType": null,
    "mysqlType": null,
    "data": null,
    "old": null,
    "_tidb": {     // TiDB extension field
        "commitTs": 163963309467037594
    }
}

The fields are explained as follows.

Field Type Description
id Number The default value is 0 in TiCDC.
database String The name of the database where the row is located
table String The name of the table where the row is located
pkNames Array The names of all the columns that make up the primary key
isDdl Bool Whether the message is a DDL event
type String Event types defined by Canal-JSON
es Number 13-bit (millisecond) timestamp when the event that generated the message happened
ts Number 13-bit (millisecond) timestamp when TiCDC generated the message
sql String When isDdl is true, records the corresponding DDL statement
sqlType Object When isDdl is false, records how the data type of each column is represented in Java
mysqlType object When isDdl is false, records how the data type of each column is represented in MySQL
data Object When isDdl is false, records the name of each column and its data value
old Object Only if the message is generated by an update Event, records the name of each column and the data value before the update
_tidb Object TiDB extension field. It exists only if you set enable-tidb-extension to true. The value of commitTs is the TSO of the transaction that caused the row to change.

DML Event

TiCDC encodes a row of DML data change event as follows:

{
    "id": 0,
    "database": "test",
    "table": "tp_int",
    "pkNames": [
        "id"
    ],
    "isDdl": false,
    "type": "INSERT",
    "es": 1639633141221,
    "ts": 1639633142960,
    "sql": "",
    "sqlType": {
        "c_bigint": -5,
        "c_int": 4,
        "c_mediumint": 4,
        "c_smallint": 5,
        "c_tinyint": -6,
        "id": 4
    },
    "mysqlType": {
        "c_bigint": "bigint",
        "c_int": "int",
        "c_mediumint": "mediumint",
        "c_smallint": "smallint",
        "c_tinyint": "tinyint",
        "id": "int"
    },
    "data": [
        {
            "c_bigint": "9223372036854775807",
            "c_int": "2147483647",
            "c_mediumint": "8388607",
            "c_smallint": "32767",
            "c_tinyint": "127",
            "id": "2"
        }
    ],
    "old": null,
    "_tidb": {     // TiDB extension field
        "commitTs": 163963314122145239
    }
}

WATERMARK Event

TiCDC sends a WATERMARK Event only when you set enable-tidb-extension to true. The value of the type field is TIDB_WATERMARK. The Event contains the _tidb field, and the field contains only one parameter watermarkTs. The value of watermarkTs is the TSO recorded when the Event is sent.

When you receive an Event of this type, all Events with commitTs less than watermarkTs have been sent. Because TiCDC provides the "At Least Once" semantics, data might be sent repeatedly. If a subsequent Event with commitTs less than watermarkTs is received, you can safely ignore this Event.

The following is an example of the WATERMARK Event.

{
    "id": 0,
    "database": "",
    "table": "",
    "pkNames": null,
    "isDdl": false,
    "type": "TIDB_WATERMARK",
    "es": 1640007049196,
    "ts": 1640007050284,
    "sql": "",
    "sqlType": null,
    "mysqlType": null,
    "data": null,
    "old": null,
    "_tidb": {     // TiDB extension field
        "watermarkTs": 429918007904436226
    }
}

Data resolution on the consumer side

As you can see from the example above, Canal-JSON has a uniform data format, with different field filling rules for different Event types. You can use a uniform method to resolve this JSON format data, and then determine the Event type by checking the field values.

  • When isDdl is true, the message contains a DDL Event.
  • When isDdl is false, you need to further check the type field. If type is TIDB_WATERMARK, it is a WATERMARK Event; otherwise, it is a DML Event.

Field descriptions

The Canal-JSON format records the corresponding data type in the mysqlType field and the sqlType field.

MySQL Type field

In the mysqlType field, the Canal-JSON format records the string of MySQL Type in each column. For more information, see TiDB Data Types.

SQL Type field

In the sqlType field, the Canal-JSON format records Java SQL Type of each column, which is the data type corresponding to the data in JDBC. Its value can be calculated by MySQL Type and the specific data value. The mapping is as follows:

MySQL Type Java SQL Type Code
Boolean -6
Float 7
Double 8
Decimal 3
Char 1
Varchar 12
Binary 2004
Varbinary 2004
Tinytext 2005
Text 2005
Mediumtext 2005
Longtext 2005
Tinyblob 2004
Blob 2004
Mediumblob 2004
Longblob 2004
Date 91
Datetime 93
Timestamp 93
Time 92
Year 12
Enum 4
Set -7
Bit -7
JSON 12

Integer types

You need to consider whether integer types have the Unsigned constraint and the value size, which corresponds to different Java SQL Type Codes respectively, as shown in the following table.

MySQL Type String Value Range Java SQL Type Code
tinyint [-128, 127] -6
tinyint unsigned [0, 127] -6
tinyint unsigned [128, 255] 5
smallint [-32768, 32767] 5
smallint unsigned [0, 32767] 5
smallint unsigned [32768, 65535] 4
mediumint [-8388608, 8388607] 4
mediumint unsigned [0, 8388607] 4
mediumint unsigned [8388608, 16777215] 4
int [-2147483648, 2147483647] 4
int unsigned [0, 2147483647] 4
int unsigned [2147483648, 4294967295] -5
bigint [-9223372036854775808, 9223372036854775807] -5
bigint unsigned [0, 9223372036854775807] -5
bigint unsigned [9223372036854775808, 18446744073709551615] 3

The following table shows the mapping relationships between Java SQL Types in TiCDC and their codes.

Java SQL Type Java SQL Type Code
CHAR 1
DECIMAL 3
INTEGER 4
SMALLINT 5
REAL 7
DOUBLE 8
VARCHAR 12
DATE 91
TIME 92
TIMESTAMP 93
BLOB 2004
CLOB 2005
BIGINT -5
TINYINT -6
Bit -7

For more information about Java SQL Types, see Java SQL Class Types.

Comparison of TiCDC Canal-JSON and the official Canal

The way that TiCDC implements the Canal-JSON data format, including the Update Event and the mysqlType field, differs from the official Canal. The following table shows the main differences.

Item TiCDC Canal-JSON Canal
Event of Update Type The old field contains all the column data The old field contains only the modified column data
mysqlType field For types with parameters, it does not contain the information of the type parameter For types with parameters, it contains the full information of the type parameter

Event of Update Type

For an Event of Update Type:

  • In TiCDC, the old field contains all the column data
  • In the official Canal, the old field contains only the modified column data

Assume that the following SQL statements are executed sequentially in the upstream TiDB:

create table tp_int
(
    id          int auto_increment,
    c_tinyint   tinyint   null,
    c_smallint  smallint  null,
    c_mediumint mediumint null,
    c_int       int       null,
    c_bigint    bigint    null,
    constraint pk
        primary key (id)
);

insert into tp_int(c_tinyint, c_smallint, c_mediumint, c_int, c_bigint)
values (127, 32767, 8388607, 2147483647, 9223372036854775807);

update tp_int set c_int = 0, c_tinyint = 0 where c_smallint = 32767;

For the update statement, TiCDC outputs an Event message with type as UPDATE, as shown below. The update statement only modifies the c_int and c_tinyint columns. The old field in the output event message contains all the column data.

{
    "id": 0,
    ...
    "type": "UPDATE",
    ...
    "sqlType": {
        ...
    },
    "mysqlType": {
        ...
    },
    "data": [
        {
            "c_bigint": "9223372036854775807",
            "c_int": "0",
            "c_mediumint": "8388607",
            "c_smallint": "32767",
            "c_tinyint": "0",
            "id": "2"
        }
    ],
    "old": [                              // In TiCDC, this field contains all the column data.
        {
            "c_bigint": "9223372036854775807",
            "c_int": "2147483647",        // Modified column
            "c_mediumint": "8388607",
            "c_smallint": "32767",
            "c_tinyint": "127",           // Modified column
            "id": "2"
        }
    ]
}

For the official Canal, the old field in the output event message contains only the modified column data, as shown below.

{
    "id": 0,
    ...
    "type": "UPDATE",
    ...
    "sqlType": {
        ...
    },
    "mysqlType": {
        ...
    },
    "data": [
        {
            "c_bigint": "9223372036854775807",
            "c_int": "0",
            "c_mediumint": "8388607",
            "c_smallint": "32767",
            "c_tinyint": "0",
            "id": "2"
        }
    ],
    "old": [                              // In Canal, this field contains only the modified column data.
        {
            "c_int": "2147483647",        // Modified column
            "c_tinyint": "127",           // Modified column
        }
    ]
}

mysqlType field

For the mysqlType field, if a type contains parameters, the official Canal contains the full information of the type parameter. TiCDC does not contain such information.

In the following example, the table-defining SQL statement contains a parameter for each column, such as the ones for decimal, char, varchar and enum. By comparing the Canal-JSON formats generated by TiCDC and the official Canal, you can see that TiCDC only contains the basic MySQL information in the mysqlType field. If you need the full information of the type parameter, you need to implement it by other means.

Assume that the following SQL statements are executed sequentially in the upstream TiDB:

create table t (
    id     int auto_increment,
    c_decimal    decimal(10, 4) null,
    c_char       char(16)      null,
    c_varchar    varchar(16)   null,
    c_binary     binary(16)    null,
    c_varbinary  varbinary(16) null,
    c_enum enum('a','b','c') null,
    c_set  set('a','b','c')  null,
    c_bit  bit(64)            null,
    constraint pk
        primary key (id)
);

insert into t (c_decimal, c_char, c_varchar, c_binary, c_varbinary, c_enum, c_set, c_bit)
values (123.456, "abc", "abc", "abc", "abc", 'a', 'a,b', b'1000001');

The output of TiCDC is as follows:

{
    "id": 0,
    ...
    "isDdl": false,
    "sqlType": {
        ...
    },
    "mysqlType": {
        "c_binary": "binary",
        "c_bit": "bit",
        "c_char": "char",
        "c_decimal": "decimal",
        "c_enum": "enum",
        "c_set": "set",
        "c_varbinary": "varbinary",
        "c_varchar": "varchar",
        "id": "int"
    },
    "data": [
        {
            ...
        }
    ],
    "old": null,
}

The output of the official Canal is as follows:

{
    "id": 0,
    ...
    "isDdl": false,
    "sqlType": {
        ...
    },
    "mysqlType": {
        "c_binary": "binary(16)",
        "c_bit": "bit(64)",
        "c_char": "char(16)",
        "c_decimal": "decimal(10, 4)",
        "c_enum": "enum('a','b','c')",
        "c_set": "set('a','b','c')",
        "c_varbinary": "varbinary(16)",
        "c_varchar": "varchar(16)",
        "id": "int"
    },
    "data": [
        {
            ...
        }
    ],
    "old": null,
}

Changes in TiCDC Canal-JSON

Changes in the Old field of the Delete events

From v5.4.0, the old field of the Delete events has changed.

The following is a Delete event message. Before v5.4.0, the old field contains the same content as the "data" field. In v5.4.0 and later versions, the old field is set to null. You can get the deleted data by using the "data" field.

{
    "id": 0,
    "database": "test",
    ...
    "type": "DELETE",
    ...
    "sqlType": {
        ...
    },
    "mysqlType": {
        ...
    },
    "data": [
        {
            "c_bigint": "9223372036854775807",
            "c_int": "0",
            "c_mediumint": "8388607",
            "c_smallint": "32767",
            "c_tinyint": "0",
            "id": "2"
        }
    ],
    "old": null,
    // The following is an example before v5.4.0. The `old` field contains the same content as the "data" field.
    "old": [
        {
            "c_bigint": "9223372036854775807",
            "c_int": "0",
            "c_mediumint": "8388607",
            "c_smallint": "32767",
            "c_tinyint": "0",
            "id": "2"
        }
    ]
}