Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

形如OCLC, DBMS, OLAP这样的缩写会变成小写 #172

Open
LeeiFrankJaw opened this issue Oct 29, 2024 · 13 comments
Open

形如OCLC, DBMS, OLAP这样的缩写会变成小写 #172

LeeiFrankJaw opened this issue Oct 29, 2024 · 13 comments

Comments

@LeeiFrankJaw
Copy link

编译环境

宏包版本:gbt7714 v2.1.5
\bibliographystyle{gbt7714-numerical}

描述问题

国标第 7 页 4.6.2 的示例中OCLC保留了大写而编译的结果是却会是全小写。虽说国标用了 sentence case,但是我从来没见到过缩写也变成全小写的情况。

bib 数据库代码:

@incollection{Stonebraker15,
  title =        "New DBMS Architectures",
  author =       "Stonebraker, Michael",
  year =         2015,
  month =        "Dec",
  booktitle =    "Readings in Database Systems",
  edition =      5,
  publisher =    "redbook.io",
  chapter =      4,
  tags =         "Database, Data Warehouse, Data Management",
  languages =    "eng",
  url =          "http://www.redbook.io/ch4-newdbms.html"
}

@techreport{Codd93,
  title =        "Providing OLAP to User-Analysts: An IT Mandate",
  author =       "Edgar F. Codd and S. B. Codd and C. T. Salley",
  year =         1993,
  month =        "Jan",
  publisher =    "E.F. Codd and Associates",
  tags =         "OLAP",
  languages =    "eng",
  url =          "https://web.archive.org/web/20040127222836/http://dev.hyperion.com/resource_library/white_papers/providing_olap_to_user_analysts_0.cfm"
}

截图:
Screenshot from 2024-10-29 18-10-22

上面是编译后的结果,下面的是国标里的例子。

Screenshot from 2024-10-29 18-12-35

@LeeiFrankJaw
Copy link
Author

Anyway,我找了一个workaround的方法。将bib中的全大写缩写用花括号保护起来。

@incollection{Stonebraker15,
  title =        "New {DBMS} Architectures",
  author =       "Stonebraker, Michael",
  year =         2015,
  month =        "Dec",
  booktitle =    "Readings in Database Systems",
  edition =      5,
  publisher =    "redbook.io",
  chapter =      4,
  tags =         "Database, Data Warehouse, Data Management",
  languages =    "eng",
  url =          "http://www.redbook.io/ch4-newdbms.html"
}

@techreport{Codd93,
  title =        "Providing {OLAP} to User-Analysts: An {IT} Mandate",
  author =       "Edgar F. Codd and S. B. Codd and C. T. Salley",
  year =         1993,
  month =        "Jan",
  publisher =    "E.F. Codd and Associates",
  tags =         "OLAP",
  languages =    "eng",
  url =          "https://web.archive.org/web/20040127222836/http://dev.hyperion.com/resource_library/white_papers/providing_olap_to_user_analysts_0.cfm"
}

@LeeiFrankJaw
Copy link
Author

而且页码范围的显示方式也是不对的,这个我要另开一个issue了。

@LeeiFrankJaw
Copy link
Author

而且页码范围的显示方式也是不对的,这个我要另开一个issue了。

我看了你的changelog,这个貌似应该是国标的bug。国标真的就是一坨屎,不是国标,真的没人会用它。那么多完备的styles,谁愿意主动用国标。

@LeeiFrankJaw
Copy link
Author

后来有发现了一些问题。下图是国标的第4页。

Screenshot from 2024-10-30 11-43-57

可以看到,副标题的首字母也是小写。但是此包出来的效果,副标题的首字母会保留大写。其实“保留大写”才是符合英文的习惯,也是符合APA的规范。(APA是我知道的主流的styles之中唯一在文献列表中用sentence case的)。但是我看到国标里的示例,副标题的首字母也是小写的。

@LeeiFrankJaw
Copy link
Author

此外,关于en dash的问题。从ACM DL上下载的bib文件,其中页码分隔符默认就是unicode编码的U+2013字符(这里是一个例子),这个包只处理了分隔符为hyphen minus的情况。相信大家现在,不是用luatex就是xetex,基本上以后默认字符为unicode的情况会越来越多。

待我有空,来把这些我遇到的问题,都pull request一遍吧。

@zepinglee
Copy link
Owner

Anyway,我找了一个workaround的方法。将bib中的全大写缩写用花括号保护起来。

是的,这是 BibTeX 的正确使用方法。

两种大小写的方式分别称为 title case 和 sentence case,参见 Letter case in headings and publication titles

BibTeX 的正确使用方法是,在 bib 著录标题统一使用 title case,然后由模板根据要求选择是否转为 sentence case。 这个自动转换的过程会导致一些专有名词、符号、单位被转为不合适的小写样式。 对于这些专有名词,可以将它们用大括号括起来,然后 BibTeX 将不再处理它们。 举个例子:

@book{forster,
...
title = {Lectures on {Riemann} Surfaces},
...
}
最后会得到“Lectures on Riemann surfaces”,其中“Riemann”不会被转为小写。

在少数情况下,如果 bib 数据库的文献大部分是 sentence case,不需要 BibTeX 自动转换大小写,可以将相应的 bst 文件中将 load.config 函数中的 #1 'sentence.case.title := 改为 #0。

@zepinglee
Copy link
Owner

而且页码范围的显示方式也是不对的

有什么问题?

@zepinglee
Copy link
Owner

后来有发现了一些问题。下图是国标的第4页。

可以看到,副标题的首字母也是小写。但是此包出来的效果,副标题的首字母会保留大写。其实“保留大写”才是符合英文的习惯,也是符合APA的规范。(APA是我知道的主流的styles之中唯一在文献列表中用sentence case的)。但是我看到国标里的示例,副标题的首字母也是小写的。

但是此包出来的效果,副标题的首字母会保留大写。

是的,这是 BibTeX 的内置函数 change.case$ 的处理方式,参考其文档 btxhak.pdf 的说明。

... it converts to lower case all letters except the very first character in the string, which it leaves alone, and except the first character following any colon and then nonnull white space, which it also leaves alone;

这个内置函数不方便改写,所以结果跟国标略有差异。不过这个问题应该不大,因为国标也没有明确要求大小写方式。

其实“保留大写”才是符合英文的习惯

几个美国主流体例的 sentence case 中,冒号后是要求大写的,包括 Chicago、IEEE;但也不是绝对,比如 Vancouver 就不是大写的,见 https://www.nlm.nih.gov/bsd/uniform_requirements.html 第 4 个示例。

Forooghian F, Yeh S, Faia LJ, Nussenblatt RB. Uveitic foveal atrophy: clinical features and associations. Arch Ophthalmol. 2009 Feb;127(2):179-86. PubMed PMID: 19204236; PubMed Central PMCID: PMC2653214.

(APA是我知道的主流的styles之中唯一在文献列表中用sentence case的)

IEEE, Vancouver 也会用 sentence case。

@LeeiFrankJaw
Copy link
Author

我论文差不多都用LaTeX写完了,结果老师让我用Word重新写一遍提交,垃圾烂学校。

还有这里提到的一些问题,在我的fork底下,基本上都已经解决了。参见b3c9d79。等我有空在文档中更新一下代码说明,就可以pull request了。

@LeeiFrankJaw
Copy link
Author

LeeiFrankJaw commented Nov 21, 2024

又发现一个国标的bug。

Screenshot from 2024-10-29 18-12-35

这里的Dublin Core是专有名词,应该保留大写,参见维基百科词条和所引文章正文中的用法。

@LeeiFrankJaw
Copy link
Author

LeeiFrankJaw commented Nov 21, 2024

我那边已经可以弄成这种效果了。这里留作记录。

Bib数据库文件如下。

@techreport{Codd93,
  title =        "Providing OLAP to User-Analysts: An IT Mandate",
  author =       "Edgar F. Codd and S. B. Codd and C. T. Salley",
  year =         1993,
  month =        "Jan",
  institution =  "Arbor Software",
  address =      {Palo Alto, CA},
  publisher =    "E.F. Codd and Associates",
  url =          "https://web.archive.org/web/20040127222836/http://dev.hyperion.com/resource_library/white_papers/providing_olap_to_user_analysts_0.cfm"
}

@incollection{Stonebraker15,
  title =        "New DBMS Architectures",
  author =       "Stonebraker, Michael",
  year =         2015,
  month =        "Dec",
  booktitle =    "Readings in Database Systems",
  edition =      5,
  publisher =    "redbook.io",
  chapter =      4,
  url =          "http://www.redbook.io/ch4-newdbms.html"
}

@INPROCEEDINGS{Wang14,
  author =       {Wang, Yan-Dong and Goldstone, Robin and Yu, Wei-Kuan and
                  Wang, Teng},
  booktitle =    {28th International Parallel and Distributed Processing
                  Symposium. Phoenix, AZ, 2014},
  title =        {Characterization and Optimization of Memory-Resident
                  MapReduce on HPC Systems},
  publisher =    {IEEE},
  year =         2014,
  pages =        {799–808},
  doi =          {10.1109/IPDPS.2014.87}
}

@Standard{ISO9075-2:2023,
  title =        {Information technology — database languages SQL —
                  part 2: foundation (SQL/foundation).  ISO/IEC
                  9075-2:2023},
  url =          {https://www.iso.org/standard/76584.html},
  author =       "{ISO/IEC JTC 1/SC 32}"
}

@online{4.6.2:4,
  author       = {{Online Computer Library Center, Inc}},
  title        = {About {OCLC}: History of Cooperation},
  urldate      = {2012-03-27},
  url          = {http://www.oclc.org/about/cooperation.en.html},
}

@online{4.6.2:5,
  author       = {Hopkinson, A},
  title        = {UNIMARC and Metadata: {Dublin Core}},
  year         = {2009},
  date         = {2009-04-22},
  urldate      = {2013-03-27},
  url          = {http://archive.ifla.org/IV/ifla64/138-161e.htm},
}

@article{10-1:16,
  author =       {Kusch, P and Hessel, M M},
  title =        {Perturbations in the A {$^1\Sigma_u^+$} state of {Na$_2$}},
  journal =      {J Chem Phys},
  year =         1975,
  volume =       63,
  pages =        {4087--4088},
  url =          {https://doi.org/10.1063/1.431885}
}

生成的效果如下。

我看到 test/testbst/ 目录底下的 thu-numeric.dtxyear-suffix-overflow.dtx 文件中也有很多类似的例子。

@zepinglee
Copy link
Owner

又发现一个国标的bug。

这里的Dublin Core应该是专有名词,应该保留大写,参见维基百科词条和所引文章正文中的用法。

国标的错误有不少,这里搞错大小写已经算小的了。

我那边已经可以弄成这种效果了。这里留作记录。

我看到 test/testbst/ 目录底下的 thu-numeric.dtxyear-suffix-overflow.dtx 文件中也有很多类似的例子。

我没懂你想实现什么效果。

@LeeiFrankJaw
Copy link
Author

LeeiFrankJaw commented Nov 22, 2024

我没懂你想实现什么效果。

我是说,在我的分支那里,已经实现了下面三点。

  1. 这就是此 issue 标题中提到的那点。Bib 文件中形如

    • Providing OLAP to User-Analysts: An IT Mandate
    • New DBMS Architectures
    • Characterization and Optimization of Memory-Resident MapReduce on HPC Systems

    的标题会变成

    • Providing OLAP to user-analysts: an IT mandate
    • New DBMS architectures
    • Characterization and optimization of memory-resident MapReduce on HPC systems

    被花括号保护起来的部分,依然受到保护,例如 UNIMARC and Metadata: {Dublin Core} 变成 UNIMARC and metadata: {Dublin Core}。文件 thu-numeric.dtxyear-suffix-overflow.dtx 有不少这样的例子,例如

    • MLC
    • the A {$^1\Sigma_u^+$} state
    • CW-10
    • JOUAV
    • VertiKUL
    • HexH2O
    • VoloCity
    • ExynAero
    • DroneHunter
    • SkyTy
    • CityAirbus
    • MK
    • UAVs

    后者中,好像都是一些无人机和概念飞行器的型号。比较有意思的是前者中的第二个例子。我在实现的过程中,头脑中想到的是Plan A或者例如APL, J, C这样的程序语言,没想到测试集当中还真有这样的例子

  2. 这就是此 issue 第5楼中提到的问题。现在冒号后的第一个单词也会按照一定的规则变成小写。这个行为由变量lowercase.word.after.colon来控制,默认打开。

  3. 这就是此 issue 第6楼中提到的问题。我希望从ACM数字图书馆或其他数据库中下载的bib文件直接可用。从这些网站上下载的bib文件直接使用U+2013 (EN DASH)来作为页码分隔符而不是使用一个或两个U+002D (HYPHEN-MINUS)。我现在看到过的网站,大部分使用一个HYPHEN-MINUS, IEEE用了两个HYPHEN-MINUS,剩下的有几个用的是EN DASH,其中最出名的就是ACM的网站了。

LeeiFrankJaw added a commit to LeeiFrankJaw/gbt7714-bibtex-style that referenced this issue Dec 2, 2024
This squashed commit basically fix issue [145][1] and 3 points mentioned
in issue [172][2].  It now supports basic UTF-8 characters and perform
proper case conversion for common latin letters with diacritics.  At the
same time, it is backward compatible with most existing syntactics.
Nothing can demonstrate this better than the following examples.

```
"200 \LaTeX \ae Foö{bar}{\'o \ae}{{\'o}}" smart.upper.case top$
"Perturbations in the A $\Sigma_u^+$ state of Na$_2$" smart.sentence.case top$
```

The above code produce `200 \LaTeX \AE FOÖ{bar}{\'O \AE}{{\'o}}` and
`Perturbations in the A $\Sigma_u^+$ state of Na$_2$. {\H {c}a{d{e}}}o`.
It mostly respects `x_change_case` procedure implemented in
[_bibtex.web_][3].  One obvious difference is that commands (except
those for single letters with diacritics) at brace level 0 won't undergo
case transformation.  Brace protection is still honored.  For UTF-8
characters and simple math expression, you probably won't need them
though, as indicated in the above example.

[1]: zepinglee#145
[2]: zepinglee#172 (comment)
[3]: https://tug.org/svn/texlive/trunk/Build/source/texk/web2c/bibtex.web?revision=57915&view=markup#l8884

Squashed commit of the following:

commit bc1574f
Author: Lei Zhao <[email protected]>
Date:   Mon Dec 2 10:46:11 2024 +0800

    Rewrite normalize.page.range again

commit b54d856
Author: Lei Zhao <[email protected]>
Date:   Fri Nov 29 03:31:35 2024 +0800

    Rewrite normalize.page.range

commit bdb9fdd
Author: Lei Zhao <[email protected]>
Date:   Thu Nov 28 08:22:00 2024 +0800

    Implement functions for texchar semantics

commit a1d3f5b
Author: Lei Zhao <[email protected]>
Date:   Wed Nov 27 17:26:57 2024 +0800

    Support Latin Extended-A

commit 775f797
Author: Lei Zhao <[email protected]>
Date:   Wed Nov 27 16:33:01 2024 +0800

    Implement smart.upper.case

commit 20bffbd
Author: Lei Zhao <[email protected]>
Date:   Sat Nov 23 00:46:39 2024 +0800

    Support polymorphism when tokenizing

commit 31ef32e
Author: Lei Zhao <[email protected]>
Date:   Thu Nov 21 21:57:27 2024 +0800

    Update tests for Dublin Core entry

commit 4abe601
Author: Lei Zhao <[email protected]>
Date:   Thu Nov 21 09:08:03 2024 +0800

    Increase compatibility of font selection

commit aad1cb9
Author: Lei Zhao <[email protected]>
Date:   Thu Nov 21 06:48:52 2024 +0800

    Use page.range.separator

    Rewrite `hyphenate` and rename it to `normalize.page.range`.  Use
    `page.range.separator` to configure separator in page ranges.  Also
    update DocStrip options for added or modifed configuration variables.

commit 0450382
Author: Lei Zhao <[email protected]>
Date:   Thu Nov 21 04:49:56 2024 +0800

    Follow the existing convention

    Follow the convention of using function to define constants

commit cf3c5c6
Author: Lei Zhao <[email protected]>
Date:   Thu Nov 21 02:06:25 2024 +0800

    Refactoring

     * Now `is.all.lower` returns true for empty strings.  Following the
       convention of modern predicate logic, it assumes no existential
       import.  Update functions which depend on it.

     * The second argument of the return value of `split.first.char.from.str`
       is of polymorphic type.  It returns an empty string for an empty
       string instead of a null char.

     * Some functions are rewritten to enable short-circuit evaluation.

commit 352b89b
Author: Lei Zhao <[email protected]>
Date:   Wed Nov 20 19:30:07 2024 +0800

    Do some refactoring and renaming

commit bbf5add
Author: Lei Zhao <[email protected]>
Date:   Wed Nov 20 07:57:18 2024 +0800

    Enable lowercase.word.after.colon by default

    Also update tests for this

commit 6a7c3bc
Author: Lei Zhao <[email protected]>
Date:   Wed Nov 20 07:49:36 2024 +0800

    Update tests for smart.sentence.case

commit b3c9d79
Author: Lei Zhao <[email protected]>
Date:   Wed Nov 20 06:40:38 2024 +0800

    Add basic UTF-8 support

commit 456687e
Author: Lei Zhao <[email protected]>
Date:   Sat Nov 16 05:53:13 2024 +0800

    Remove ignore.extra.interword.space

    This feature is extraneous since the extra spaces are already
    preprocessed by the BibTeX.

commit b15d673
Author: Lei Zhao <[email protected]>
Date:   Thu Nov 14 20:33:20 2024 +0800

    Improve smart.sentence.case.lower.token

commit d7e7f53
Author: Lei Zhao <[email protected]>
Date:   Thu Nov 14 19:18:29 2024 +0800

    Basically finish the smart lowercase feature

commit 5da7b11
Author: Lei Zhao <[email protected]>
Date:   Mon Nov 11 22:22:24 2024 +0800

    Update the source dtx file

commit dc88ba8
Author: Lei Zhao <[email protected]>
Date:   Mon Nov 11 20:38:39 2024 +0800

    Add en.dash.in.pages option

    Also process UTF-8 en dash (–)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants