Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter file contents, Was: Suggestion: add --as-binary option to convert line endings #95

Merged
merged 1 commit into from
Jun 22, 2018

Conversation

atykhyy
Copy link
Contributor

@atykhyy atykhyy commented May 28, 2017

Converting line endings in text files to LF is often needed when migrating from Mercurial to Git. Right now it is usually done by git filter-branch and other heavy-weight approaches, but it is very easy to do in fast-export provided that binary files have distinct filenames (e.g. distinct extensions). This change adds an option --as-binary that turns on auto-conversion of line endings to LF and takes a list of binary file extensions, e.g. hg-fast-export ..\repo --as-binary=.so,.dll. As an implementation side-effect, dot-files (e.g. .hgignore) are treated as text.

@frej
Copy link
Owner

frej commented Jun 4, 2017

My spontaneous reaction is that this is outside the scope of fast-export. Filter branch is the tool to use when you have to mangle the content of the version controlled files.

Additionally I see a number of problems with the implementation (just to be clear, even if these problems are addressed I'm not very likely to include the proposed functionality):

  • The name of the new flag does not in any way indicate that it does line ending conversion for all files not listed by the flag.

  • The only supported conversion is from CRLF to LF, what about the reciprocal? What about CR<-->LF if your source files are from a classic Mac? Which conversions should be supported?

  • How do you handle binary files without an extension? Shouldn't the pattern be a proper glob?

@atykhyy
Copy link
Contributor Author

atykhyy commented Jun 13, 2017

Indeed, using filter-branch after fast-import is the usual approach. However, converting line endings is the most common, and often the only, post-import operation when migrating repos from hg to git. Wouldn't it be convenient to take care of that with just an extra flag or two and not have to filter-branch (which is quite slow) afterwards?

I believe restricting conversion to target just LF is justified because git uses LF line endings for text files internally, and it takes some contortions to force git to store them in a different format even if one wanted it to. I implemented the rest of your very reasonable suggestions.

@frej frej added the feature-request This is a feature request label Jun 17, 2017
@frej
Copy link
Owner

frej commented Jun 17, 2017

[edited to add a hg-hash parameter to the filter]

As I said, I think this is outside the scope of fast-export. Since your first proposal this simple filter has already been extended to handle one more line ending. Pretty soon someone else will want support for converting character encodings, running clang-format on c-code, stripping old forgotten $Id$-tags, etc, and I don't want fast-export to evolve into emacs or acquire more flags than ls :)

What about adding a new flag --filter-contents <filter> <hg-hash> where <filter> is an executable filter which is called as <filter> <file-path> <hg-hash>? That way the filter can do whatever it wants. I would definitely merge such a patch as it adds a lot of flexibility for the small increase in fast-export complexity.

@Himura2la
Copy link

--filter-contents is cool. Especially if we create an example with line endings conversion using this feature.

@atykhyy
Copy link
Contributor Author

atykhyy commented Jun 11, 2018

How about this one? I pass filectx.filenode() for <hg-hash> - is that what you had in mind? I added a <binary> flag to the filter's command line arguments since Mercurial filectx provides it and it might be useful to text-processing scripts so they can pass-through binary files.

@Himura2la
Copy link

Himura2la commented Jun 12, 2018

@atykhyy can you please add this option to .sh? Seems like it only requires a documentation

@Himura2la
Copy link

Himura2la commented Jun 12, 2018

OK, thanks for quick reaction. But I think I don't understand how to use the filter. I guess the filtering script has no access to the filtered files, it simply executes in a git repo with empty index... So HOW to properly change the contents? Should I amend each file in each commit? I still don't know how to do it having only paths that does not exist in fs and hashes... To be honest, I expected that the filtering script should have an access to files directly, but if there's a way to convert line endings or make global substitutions this way, I'd be glad to know...

@atykhyy
Copy link
Contributor Author

atykhyy commented Jun 12, 2018

Yes, amend each file. There is no way the filtering script can have direct access to files in Git, they aren't imported until the output of hg-fast-export gets to git fast-import. I suppose the script can access files in Mercurial using the provided path and hash value, but this should not be necessary. If your goal is just to convert line endings, and you have no binary files in your repo, --filter-contents=dos2unix should work; if dos2unix is confused about its arguments, and/or if you have binary files, use a simple script like

#!/bin/sh
if "$3"; then dos2unix; else cat; fi

instead. Send the output of hg-fast-export to a file to check the results. I have used intermediate scripts that parse the output of hg-fast-export and pipe it on to git fast-import to split Mercurial repositories, fixing paths in .csproj files etc. In fact I'd like to add --filter-names to enable the filter script(s) to skip and rename files, functionality that is not easy to squeeze into --filter-contents as it accepts the (possibly binary) exported file as its standard input and writes the filtered file to its standard output.

@Himura2la
Copy link

Ah, so the file contents are piped into the script and the stdout is used as file contents! That's super convenient, I came up with the following script for documentation:

#!/bin/sh
#
# This script is executed in your **git** repository root
# for every file in every commit.
#
# The file contents is piped into it, and its output is used
# as file contents. It also takes three arguments:

path=$1
hash=$2
is_binary=$3

# Let's perform a line endings conversion without touching the binaries:

if [ "$is_binary" -eq "1" ]; then cat; else
    dos2unix
fi

Copy link
Owner

@frej frej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While doing the documentation, could you add a section to the README explaining the feature? Why not use the dos2unix example in from the discussion with @Himura2la?

Copy link
Owner

@frej frej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise this looks good, address the comments and squash it into a single commit, and I'll merge it. I don't know why the comment about binascii.hexlify below shows up as "outdated", but have a look at it too.

f=ctx.filectx(file)
d=f.data()
if filter_contents:
a=filter_contents + [filename,binascii.hexlify(f.filenode()),'1' if f.isbinary() else '0']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, that's why I'm asking, but I would expect that converting the raw node to hex should already be available from one of the mercurial packages, why not use it instead of pulling in an additional package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look for one.

@atykhyy
Copy link
Contributor Author

atykhyy commented Jun 15, 2018

OK. What do you think about --filter-names or similar to rename/exclude files (see above)?

@Himura2la
Copy link

Himura2la commented Jun 15, 2018

If the --filter-names could be used for this #129, I'd be happy

@frej frej changed the title Suggestion: add --as-binary option to convert line endings Filter file contents, Was: Suggestion: add --as-binary option to convert line endings Jun 15, 2018
@frej
Copy link
Owner

frej commented Jun 15, 2018

@atykhyy: I think --filter-names is a good idea, I have created #130 for it, in order to not mix up two mostly orthogonal features.

@atykhyy
Copy link
Contributor Author

atykhyy commented Jun 17, 2018

Added example to readme and squashed.

Copy link
Owner

@frej frej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some variable naming nitpicks, looks good otherwise.

README.md Outdated
# $2 = Mercurial's hash of the file
# $3 = "1" if Mercurial reports the file as binary, otherwise "0"

if "$3"; then dos2unix; else cat; fi
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try

X=1
if "$X" ; then echo true ; else echo false ; fi

and you'll see that this is not valid sh, you need to do if [ "$3" == "1" ]; ...

d=f.data()
if filter_contents:
import subprocess
a=filter_contents + [filename,node.hex(f.filenode()),'1' if f.isbinary() else '0']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a? give it a more descriptive name please, what about filter_cmd?

if encoding:
filename=file.decode(encoding).encode('utf8')
else:
filename=file
f=ctx.filectx(file)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f what is it? Call it file_info so I that I don't mix it up with all the other file related variables.

import subprocess
a=filter_contents + [filename,node.hex(f.filenode()),'1' if f.isbinary() else '0']
try:
p=subprocess.Popen(a,stdin=subprocess.PIPE,stdout=subprocess.PIPE)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subproc instead of p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@atykhyy
Copy link
Contributor Author

atykhyy commented Jun 19, 2018

Fixed variable naming.

@frej frej merged commit 89db1d9 into frej:master Jun 22, 2018
frej added a commit that referenced this pull request Jun 22, 2018
@frej
Copy link
Owner

frej commented Jun 22, 2018

Merged, thank you for your contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request This is a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants