adding more granular diff format for autoedits model training #6173

hitesh-1997 · 2024-11-21T17:31:12Z

Context

The PR makes the following high-level changes:

Current auto-edit model have trouble understanding the most recent diffs, where it suggest deleting the recently added line or suggest the change recently deleted. One reason is that it doesn't have a seperate view of short term and long term diff.
Introduces a more granular diff format for training the auto-edits model. Currently we only use a single diff format. The PR computes the line level for the changes made in the editor. In addition, ensures that all the continuous changes are groped together as a single entity. Additionally, it derives some strategies to calculate the diff at different granularity levels. Refer to the class for the entry point.
Introduce a helper function to diff format, to simulate the document changes using markers. Refer to helper function here
Refactors recent edits handling to separate long-term and short-term diffs.
Initially the data is logged to the telemetry, to be used for training and evaluating the model offline.
One final change is to log 10 sec diff data by the user in the analytics to capture the short term diffs.

Added Unit tests for various changes

hitesh-1997 added 15 commits November 21, 2024 23:00

adding more granular diff format for autoedits model training

75714db

temp changes

6b59db4

some change

28a55fb

checkpoint

8382f2b

checkpoint

554dcb4

basic test case structure

970cf69

improve marker strategy

b31229c

diff

9ab18aa

diff

b16c7dc

logic fix and test cases fix

5cab484

cleanup

003a8be

add long term and short term diff in autoedit prompt

214e55e

diff strategies

dfd07db

augment test case

090c789

add line level strategies

ac66aad

hitesh-1997 marked this pull request as ready for review November 24, 2024 00:12

hitesh-1997 requested review from valerybugakov and beyang November 24, 2024 00:12