-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for --wrap option #129
base: master
Are you sure you want to change the base?
Conversation
The previous behavior of `bagit.py` was asymmetrical with respect to how it treated long tag lines: it would handle folded lines when reading the `bag-info.txt` file, but it would not fold long lines when writing long tag values into `bag-info.txt`. This was hardwired into the function `_make_tag_file()`, which explicitly stripped line endings from tag values before writing them. Section 2.2.2 of the BagIt spec (https://tools.ietf.org/html/rfc8493) states the following: It is RECOMMENDED that lines not exceed 79 characters in length. Long values MAY be continued onto the next line by inserting a LF, CR, or CRLF, and then indenting the next line with one or more linear white space characters (spaces or tabs). Except for linebreaks, such padding does not form part of the value. This commit adds a new command-line option, `--wrap`, and a new parameter named `line_width` to the functions `make_bag()` and `save()`, to make it possible to follow the recommendation. The default value is 0, which means don't wrap (which is the original behavior). An integer value greater than 0 causes line-wrapping to be performed on a best-effort basis to limit line lengths to the given value.
1 similar comment
If the given line length is shorter than the length of a tag/property name, then we should start the value on the next line in order to try to keep to the desired length. This commit adds a test for that situation and makes line wrapping behave a little bit better.
bagit.py
Outdated
@@ -1499,6 +1527,17 @@ def _make_parser(): | |||
" without performing checksum validation to detect corruption." | |||
), | |||
) | |||
parser.add_argument( | |||
"--wrap", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea but I was wondering whether “wrap” might be a little too generic a name for the option. What do you think about something like --wrap-tags
or --max-tag-line-length
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I struggled with naming the option! The current result was the outcome of trying to satisfy two goals: (1) emulating the existing bagit.py
style of using mostly single words (for those options other than the metadata fields), and (2) accounting for the fact that the argument parser displays the associated variable name as part of the help text. The latter means that one ends up having to think about the argument name and the variable jointly. The current combination leads to the following output for bagit.py -h
:
--wrap LINE_LENGTH Limit line lengths in the tag file (bag-info.txt) by
wrapping long values and indenting subsequent lines
with a space character. (Default: don't.)
I had hoped this was clear enough, but I have to agree that by itself, the current combination leaves ambiguous what is being wrapped. I'm fine with changing it. I lean towards --wrap-tags
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on --wrap-tags
. I'd prefer clarity to single words and with that we don't to touch either dest
or add a metavar
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just mentioning it in case it wasn't obvious: all these discussed changes are in the update to the PR made a few days ago.)
bagit.py
Outdated
@@ -137,7 +138,8 @@ def find_locale_dir(): | |||
|
|||
|
|||
def make_bag( | |||
bag_dir, bag_info=None, processes=1, checksums=None, checksum=None, encoding="utf-8" | |||
bag_dir, bag_info=None, processes=1, checksums=None, checksum=None, | |||
encoding="utf-8", line_length=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether line_length
should be something like max_tag_line_length
so it's a little more obvious what it applies to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree line_length
is a bit too general. In thinking what would make sense with an option named --wrap-tags
(c.f. conversation above), I worry max_tag_line_length
might be too much:
--wrap-tags MAX_TAG_LINE_LENGTH
Limit line lengths in the tag file (bag-info.txt) by
wrapping long values and indenting subsequent lines
with a space character. (Default: don't.)
What about tag_wrap_column
? That leads to:
--wrap-tags TAG_WRAP_COLUMN
Limit line lengths in the tag file (bag-info.txt) by
wrapping long values and indenting subsequent lines
with a space character. (Default: don't.)
bagit.py
Outdated
first_break = prop_width + len(txt.split(None, 1)[0]) | ||
if line_length <= first_break: | ||
# Start value on the next line. | ||
txt = '\n'.join(textwrap.wrap(txt, width=line_length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two calls are almost identical but it takes time to confirm that. Do you think it would be worth assigning initial_indent
first so we can have an identical textwrap call? I'm also wondering whether this block of code is now long enough that we should break it into a separate function so this would just be for txt in values: txt = wrap_tag(h, txt, line_length)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the duplication: although initial_indent
could be pulled out so that there's only one call to wrap
, the problem is that there's a 2nd step after the call to wrap
that depends on the same condition (see line 1257). So, the resulting alternative code would have a conditional for setting initial_indent
, followed by the call to wrap
, followed by an identical conditional to adjust the output of wrap
. I find this alternative less readable than the current code (I tried it to see), although I realize this may be a matter of personal preference. Still, what about simply moving the parameter initial_indent
higher up in the call to wrap
, so that it's easier to spot the difference quickly?
Regarding splitting it out to a separate function: agreed. However, I might recommend naming the function differently, if you don't object, because in this case it wouldn't necessarily wrap the tag value. So perhaps something like _format_tag_value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and implemented the changes discussed above in the latest commit.
This attempts to address issues raised during the review of [pull request 129](LibraryOfCongress#129). Changes involve renaming the `--wrap` argument and associated internal variable, and factoring out the line wrapping code into a separate function.
This attempts to address issues raised during the review of [pull request 129](#129). Changes involve renaming the `--wrap` argument and associated internal variable, and factoring out the line wrapping code into a separate function.
f.write("%s: %s\n" % (h, txt)) | ||
|
||
|
||
def _format_tag_value(header, txt, tag_wrap_column): | ||
txt = force_unicode(txt) | ||
if tag_wrap_column > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be if text_wrap_column:
because all of the documentation specifies setting a non-zero value?
I pushed https://github.com/LibraryOfCongress/bagit-python/tree/wrap-tag-lines to have an updated version of your pull request which is rebased against the current version of bagit-python and has a little extra test coverage for the wrapping behaviour. |
The previous behavior of
bagit.py
was asymmetrical with respect to howit treated long tag lines: it would handle folded lines when reading
the
bag-info.txt
file, but it would not fold long lines when writinglong tag values into
bag-info.txt
. This was hardwired into thefunction
_make_tag_file()
, which explicitly stripped line endings fromtag values before writing them.
Section 2.2.2 of the BagIt spec (https://tools.ietf.org/html/rfc8493)
states the following:
This PR adds a new command-line option,
--wrap
, and a newparameter named
line_width
to the functionsmake_bag()
andsave()
, to make it possible to follow the recommendation. Thedefault value is 0, which means don't wrap (which is the original
behavior). An integer value greater than 0 causes line-wrapping to be
performed on a best-effort basis to limit line lengths to the given
value.
This addresses issue #126.