-
Notifications
You must be signed in to change notification settings - Fork 6
Brainstorming
- Each SVN commit results in one or fewer commits in each branch of each Git repo
- Each SVN commit specifies paths that are added and deleted. These paths can be files or directories.
- When I write "path-tree" I mean to denote an SVN path, and if it is a directory, all of its descendant paths.
- "revnum" refers to the SVN revision in which added files appear in SVN and deleted files disappear.
- To determine the repositories and branches affected, we can walk deleted path-trees at revnum-1 and added path-trees at revnum.
Note: the ruleset can cause something to be deleted merely by mapping it elsewhere in the next revision.
Note: does not include recording of merge parents:
var patrie: Map((SVN revnum, SVN path) -> (repo, branch, path)
var branchHash, lastBranchHash: Map((Git repository, Git branch) -> SHA1)
var transactions: Map((Git repository, Git branch) -> git-fast-import stream)
function changed(R,B) :=
return branchHash[(R,B)] != lastBranchHash[(R,B)]
foreach SVN revision V
// Hash the contents of every branch
branchHash = {}
foreach SVN file path P
R, B, L = patrie[V,P]
H = SHA1 of SVN file at V,P // using svn_fs_file_checksum
branchHash[(R,B)].accumulate(hash((L,H)))
// open a Git transaction for each changed branch
foreach Git repo R
foreach Git branch B in R
if changed(R,B)
transactions[(R,B)].open()
transactions[(R,B)].deleteAllFiles()
// rewrite each changed branch
foreach SVN file path P
R, B, L = patrie[V,P]
if changed(R,B)
transactions[(R,B)].writeFile(L, SVNContents(V,P))
// Commit each changed branch
foreach Git repo R
foreach Git branch B in R
if changed(R,B)
transactions[(R,B)].commit()
lastBranchHash = branchHash
Now there are two questions:
- Is it correct?
- Can we optimize it?
- Avoid exploring every SVN file path when computing the set of changed branches
- Avoid exploring every SVN file path when finding files to rewrite
- Avoid exploring every SVN file path when computing new branch hashes
- Avoid rewriting every file in a changed branch
Git branch contents can change between SVN revisions for the following reasons:
- A file mapped to the branch was changed in SVN
- The set of rules mapping to that branch in SVN changed (because of revision ranges in the ruleset). Note: this implies we may gain efficiency by limiting rules to revision ranges only where necessary.
Subversion provides svn_fs_paths_changed2
, which can tell us about all the paths that were changed in SVN. We can easily see which branches are covered by this list of paths.
The set of rules in effect changes from revision to revision due to revision limits in the ruleset. These changes need to be detected and handled.
We can use a map from revision number to sets of branches targeted by rules whose limits begin or end in that revision. We can traverse that map as we process transactions, and re-evaluate all branches changed there. In fact, we know (a superset of) which paths in which branches are changing due to rule changes.
Definition: a Git address is a tuple: ( repository, refname, path)
Every rule maps a given SVN path into a Git address over some revision range.
When a rule starts or ends, it potentially affects a Git subtree rooted at some Git address. It is possible, if challenging, to determine the details of how the Git subtree in question has changed. However, that would be an expensive optimization to code, and—because the tree is generally accounted for completely by the rule in question—would rarely offer a great speed improvement. Instead, it makes sense to rewrite the Git subtree completely, using git-fast-import
's filedelete
command to clear the subtree, and adding all the files mapped into that subtree from SVN. By remembering the last mark
in each branch, we can arrange to leave a “backups/
…” tag on that commit when the rewritten Git subtree contains no files, and delete the Git branch.
To find the areas of the SVN tree that map into the Git subtree rooted at a given address, we'll need to enhance patrie
with a reverse mapping (from Git address to SVN path), and the ability to find all rules that map to the address or its subtree addresses.
invalid_paths = PathSet()
for r0 in rules_ending(revnum - 1):
git_addr = r0.git_address()
git.delete(git_addr, revnum) # will get rewritten if anything is mapped here
# Any part of SVN that was mapped into this part of the git
# tree now may end up mapped differently. Since every SVN
# path we invalidate gets completely remapped, it's OK to
# invalidate more than necessary.
for r1 in reverse_matches(git_addr, revnum - 1):
invalid_paths.add(r1.svn_path)
The same logic applies to rules starting in the current revision:
for r0 in rules_starting(revnum):
git_addr = r0.git_address()
git.delete(git_addr, revnum) # will get rewritten if anything is mapped here
for r1 in reverse_matches(git_addr, revnum):
invalid_paths.add(r1.svn_path)