Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for converting from Hijri calendar to undate and undate interval #107

Open
wants to merge 25 commits into
base: develop
Choose a base branch
from

Conversation

rlskoeser
Copy link
Member

@rlskoeser rlskoeser commented Nov 21, 2024

  • lark parser for Hijri dates - supports year, year month, and year month day
  • lark transformer that turns the lark parse tree into an undate objects
  • undate converter that connects the lark parser and transformer into the undate converter class
  • unit tests for all, borrowing examples from Princeton Geniza Project (PGP) and ISMI datasets

related to #10

improvements after initial review

  • improve documentation (with example usage) and include in sphinx docs
  • add a calendar property to undate (default gregorian)
  • initialize undate object with numeric year, month, day and add optional calendar (default gregorian)
  • use calendar to convert earliest/latest to gregorian for comparison; confirm date precision is set as expected
  • remove 3-digit year constraint in parser

Questions:

  • How reasonable is it to limit support to 3+ digit Hijri years? If we can make this assumption the parsing is much faster and more efficient, AND we can support both the year month day format in your ISMI data as well as the day month year in PGP data. ... but it might be an unreasonable limitation.
  • We don't yet have a notion of a precision or size for undate intervals, but this code suggests that it would be a useful option, for when the length of the interval is known
  • I'd like to write a notebook to test this logic against PGP dates, but prefer to do that in a subsequent PR - does that make sense, or would a demo notebook make it easier to review this code?

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced HijriDateConverter for converting Hijri calendar dates to Gregorian dates.
    • Added grammar rules for parsing Hijri dates, supporting various formats.
    • Implemented HijriDateTransformer for transforming parsed Hijri dates into Undate or UndateInterval objects.
  • Bug Fixes

    • Updated EDTF date parsing rules to include the 20th day of the month.
  • Tests

    • Added comprehensive test suites for HijriDateConverter, hijri_parser, and HijriDateTransformer to ensure functionality and error handling.

Copy link
Contributor

coderabbitai bot commented Nov 21, 2024

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The pull request introduces several changes, primarily focusing on the addition of the HijriDateConverter class and related components for converting Hijri calendar dates to Gregorian dates. The pyproject.toml file is updated to modify dependencies, including the addition of convertdate and a change in the lark dependency. New grammar specifications for parsing Hijri dates are also introduced, along with tests to validate the new functionality.

Changes

File Path Change Summary
pyproject.toml Updated dependencies: added convertdate, changed lark to lark[interegular].
src/undate/converters/calendars/__init__.py Added HijriDateConverter to __all__ for public API.
src/undate/converters/calendars/hijri/__init__.py Imported HijriDateConverter and updated __all__ to include it.
src/undate/converters/calendars/hijri/converter.py Introduced HijriDateConverter class with a parse method for converting Hijri dates.
src/undate/converters/calendars/hijri/hijri.lark Added grammar rules for parsing Hijri dates, including definitions for years, months, and days.
src/undate/converters/calendars/hijri/parser.py Introduced parser.py to initialize hijri_parser using hijri.lark.
src/undate/converters/calendars/hijri/transformer.py Added HijriDateTransformer class for transforming Hijri date parse trees.
src/undate/converters/edtf/edtf.lark Updated day rule regex to include the 20th day of the month.
src/undate/converters/edtf/parser.py Changed import from os.path to pathlib, removed commented-out test cases.
src/undate/undate.py Updated comments and type annotations in UndateInterval class.
tests/test_converters/test_calendars/test_hijri/test_hijri_converter.py Added tests for HijriDateConverter functionality.
tests/test_converters/test_calendars/test_hijri/test_hijri_parser.py Introduced tests for hijri_parser to validate various Hijri date formats.
tests/test_converters/test_calendars/test_hijri/test_hijri_transformer.py Added tests for HijriDateTransformer class and its conversion logic.

Possibly related PRs

Suggested reviewers

  • ColeDCrawford
  • robcast
  • jdamerow

🐰 In the garden where dates bloom bright,
A converter hops with all its might.
From Hijri to Gregorian, it leaps with glee,
Transforming dates for you and me!
With tests in place, it’s sure to shine,
A joyful change, oh how divine! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@rlskoeser
Copy link
Member Author

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Nov 21, 2024

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Outside diff range and nitpick comments (18)
tests/test_converters/test_calendars/test_hijri/test_hijri_converter.py (3)

7-12: Consider improving test method name and documentation.

  1. The method name test_parse_ has a trailing underscore which is unusual. Consider renaming it to something more descriptive like test_parse_hijri_dates or test_parse_valid_dates.
  2. Add a docstring to explain the test's purpose and the significance of the test dates (if any).

Example improvement:

-    def test_parse_(self):
+    def test_parse_hijri_dates(self):
+        """Test parsing of valid Hijri dates in different formats.
+        
+        Tests conversion of:
+        - Specific day (7 Jumādā I 1243)
+        - Month only (Rajab 495)
+        - Year only (441)
+        """

14-18: Add context about test data selection.

Consider adding a comment explaining why these specific dates were chosen for testing. Are they edge cases or examples from real data (e.g., from Princeton Geniza Project or ISMI datasets mentioned in PR objectives)?


20-24: Consider adding more year parsing test cases.

Given the PR's question about limiting support to Hijri years with three or more digits, consider adding test cases for:

  1. Two-digit years (to verify they're rejected if that's the intended behavior)
  2. Four-digit years
  3. Edge cases around year boundaries
tests/test_converters/test_calendars/test_hijri/test_hijri_parser.py (3)

5-5: Enhance the test scope documentation.

The current comment could be more descriptive about what constitutes a "valid date" and what aspects of parsing are being tested.

Consider expanding the comment to:

-# for now, just test that valid dates can be parsed
+# For now, we only verify that the parser can handle valid Hijri dates without errors.
+# TODO: Add tests for:
+# - Invalid date validation
+# - Edge cases (e.g., leap years)
+# - Error messages for malformed inputs

7-46: Consider adding more comprehensive test cases.

While the current test cases cover various format variations well, consider adding:

  1. Boundary cases:
    • Minimum/maximum valid years
    • Edge cases for different months (e.g., months with varying days)
  2. Invalid inputs:
    • Invalid day numbers (e.g., 31st in a 30-day month)
    • Invalid month names
    • Malformed strings

Here's a suggested addition to the test cases:

# Add to testcases list:
    # boundary cases
    "1 Muḥarram 1",  # minimum year
    "29 Dhū al-Ḥijjah 9999",  # maximum year
    
    # more complex formats
    "early Ramaḍān 903",  # with period qualifier
    "mid-Shawwāl 1042",  # with period qualifier and hyphen

1-51: Consider adding test fixtures and helper functions.

To improve maintainability and reduce duplication as more test cases are added, consider:

  1. Creating fixtures for common test data
  2. Adding helper functions for result validation
  3. Organizing test cases into classes by format type

Example structure:

@pytest.fixture
def valid_hijri_dates():
    return {
        "year_only": ("521", {"year": 521}),
        "month_year": ("Muḥarram 900", {"year": 900, "month": 1}),
        # ... more test cases with expected results
    }

class TestHijriParser:
    def test_year_format(self, valid_hijri_dates):
        date_string, expected = valid_hijri_dates["year_only"]
        result = hijri_parser.parse(date_string)
        self._assert_parsed_result(result, expected)

    @staticmethod
    def _assert_parsed_result(result, expected):
        for key, value in expected.items():
            assert getattr(result, key) == value
src/undate/converters/calendars/hijri/hijri.lark (3)

10-12: Consider adding upper bound for year validation.

The current pattern /\d{3,}/ allows unlimited digits, which could accept invalid years. Consider:

-year: /\d{3,}/
+year: /\d{3,4}/  // Assuming max 4 digits for reasonable Hijri years

Also, consider adding a comment documenting the expected year range for the Hijri calendar.


28-28: Optimize day pattern for better readability.

The current day pattern can be simplified:

-day: /[1-9]/ | /[12][0-9]/ | /30/
+day: /([1-9]|[12][0-9]|30)/

33-51: Enhance month patterns for consistency and robustness.

Several improvements could be made to the month patterns:

  1. Use consistent pattern style (regex vs quotes):
-month_7: "Rajab"
+month_7: /Rajab/
  1. Consider adding case-insensitive flag for better flexibility:
-month_1: /al-Mu[ḥh]arram/ | /Mu[ḥh]arram/
+month_1: /(?i)al-mu[ḥh]arram/ | /(?i)mu[ḥh]arram/
  1. Add comments for each variant to document the source/reason for inclusion.
src/undate/converters/edtf/edtf.lark (1)

Line range hint 1-100: Clarify the relationship between EDTF and Hijri date parsing

Given that this PR aims to add Hijri calendar support, please clarify:

  1. Will the EDTF grammar serve as a base for Hijri date parsing?
  2. How will the parsing logic handle the differences between EDTF and Hijri date formats?

Consider adding documentation comments at the top of this file to explain its role in the broader date parsing architecture.

src/undate/converters/calendars/hijri/transformer.py (4)

7-9: Enhance class docstring with more details.

Consider expanding the docstring to include:

  • Input format expectations
  • Return value details
  • Examples of usage
 class HijriDateTransformer(Transformer):
-    """Transform a Hijri date parse tree and return an Undate or
-    UndateInterval."""
+    """Transform a Hijri date parse tree into Gregorian dates.
+
+    Transforms parsed Hijri dates into either a single Undate (for complete dates)
+    or an UndateInterval (for partial dates).
+
+    Args:
+        items: Parse tree containing year, month, and/or day nodes
+
+    Returns:
+        Undate: For complete dates with year, month, and day
+        UndateInterval: For partial dates missing month or day
+
+    Example:
+        >>> transformer = HijriDateTransformer()
+        >>> tree = parser.parse("450 AH")
+        >>> result = transformer.transform(tree)
+    """

28-29: Consider implementing the TODO for date precision.

The TODO raises a valid point about precision for partial dates. This could be important for academic or historical research purposes.

Would you like me to help design and implement the precision feature for UndateInterval? This could include:

  • Adding a precision enum (YEAR, MONTH, DAY)
  • Extending UndateInterval to store precision
  • Updating the transformation logic to set appropriate precision

32-35: Remove commented-out code.

The commented-out year method adds unnecessary noise. If it's not needed, remove it.

-    # this does nothing
-    # def year(self, items):
-    #     return Tree(data="year", children=[items[0]])

45-47: Add more comprehensive calendar constants.

Consider adding more calendar-specific constants for better maintainability.

 MIN_MONTH, MIN_DAY = 1, 1
 MAX_MONTH = 12
+MAX_DAY = 30  # Maximum days in Islamic calendar month
+MIN_YEAR = 1  # Minimum valid Hijri year
+MAX_YEAR = 9999  # Maximum supported year
tests/test_converters/test_calendars/test_hijri/test_hijri_transformer.py (1)

7-42: Consider adding more edge cases to the test suite.

The current test cases are well-structured and include real-world examples with proper documentation. However, consider adding these edge cases:

  • Invalid month names
  • Out-of-range day values
  • Edge cases around Islamic calendar year boundaries
  • Different variations of month name spellings
pyproject.toml (1)

12-12: Consider adding version constraints for dependencies.

While the project is in Pre-Alpha, it's good practice to specify version constraints to ensure reproducible builds and prevent compatibility issues:

-dependencies = ["lark[interegular]", "numpy", "convertdate"]
+dependencies = [
+    "lark[interegular]>=1.1.5",
+    "numpy>=1.24.0",
+    "convertdate>=2.4.0"
+]
src/undate/undate.py (1)

450-452: Enhance the TODO comment with implementation details

The TODO suggestion about adding precision/length field using DatePrecision is valuable, especially for Hijri date conversions. Consider enhancing the comment to specify:

  1. Whether the precision should be stored per interval or inherited from constituent Undates
  2. How precision would affect interval comparisons and duration calculations
  3. Examples of precision usage in Hijri date conversion scenarios

Here's a suggested enhancement for the TODO comment:

-    # TODO: let's think about adding an optional precision / length /size field
-    # using DatePrecision
+    # TODO: Consider adding an optional precision field using DatePrecision to:
+    # 1. Track precision of the interval (independent or derived from Undate precision)
+    # 2. Support precise interval comparisons and duration calculations
+    # 3. Handle varying precision levels in calendar conversions (e.g., Hijri dates)
src/undate/converters/calendars/hijri/converter.py (1)

47-48: Consider implementing reverse conversion functionality

Supporting conversion from an Undate or UndateInterval back to a Hijri date could enhance the utility of the HijriDateConverter. This feature would allow users to display dates in the Hijri calendar format when needed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c83ed49 and 778c67b.

📒 Files selected for processing (13)
  • pyproject.toml (1 hunks)
  • src/undate/converters/calendars/__init__.py (1 hunks)
  • src/undate/converters/calendars/hijri/__init__.py (1 hunks)
  • src/undate/converters/calendars/hijri/converter.py (1 hunks)
  • src/undate/converters/calendars/hijri/hijri.lark (1 hunks)
  • src/undate/converters/calendars/hijri/parser.py (1 hunks)
  • src/undate/converters/calendars/hijri/transformer.py (1 hunks)
  • src/undate/converters/edtf/edtf.lark (1 hunks)
  • src/undate/converters/edtf/parser.py (1 hunks)
  • src/undate/undate.py (1 hunks)
  • tests/test_converters/test_calendars/test_hijri/test_hijri_converter.py (1 hunks)
  • tests/test_converters/test_calendars/test_hijri/test_hijri_parser.py (1 hunks)
  • tests/test_converters/test_calendars/test_hijri/test_hijri_transformer.py (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • src/undate/converters/calendars/init.py
  • src/undate/converters/calendars/hijri/parser.py
  • src/undate/converters/edtf/parser.py
🧰 Additional context used
🪛 Ruff (0.7.0)
src/undate/converters/calendars/hijri/converter.py

45-45: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (10)
src/undate/converters/calendars/hijri/__init__.py (2)

3-3: LGTM! Clean public API declaration.

The __all__ declaration explicitly defines the public API, following Python best practices.


1-1: LGTM! Verify the import path.

The import statement follows Python package conventions and aligns with the PR objective.

Let's verify the existence and location of the imported module:

✅ Verification successful

Import path verified and correctly structured

The import statement is valid as:

  • The HijriDateConverter class exists in src/undate/converters/calendars/hijri/converter.py
  • The import path follows the correct package structure
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and location of the HijriDateConverter module

# Check if the converter module exists
fd --type f "converter.py" --exec grep -l "class HijriDateConverter"

# Verify the import path structure
fd --type d "undate/converters/calendars/hijri"

Length of output: 169

tests/test_converters/test_calendars/test_hijri/test_hijri_converter.py (1)

1-6: LGTM! Clean imports and class structure.

The imports are appropriate and the test class follows proper naming conventions.

src/undate/converters/calendars/hijri/hijri.lark (2)

1-2: LGTM: Standard whitespace handling configuration.

The whitespace configuration is correctly set up using Lark's common whitespace import.


1-53: Verify grammar integration and test coverage.

Please ensure that:

  1. All date formats are covered in the test suite
  2. Edge cases are handled (e.g., invalid dates, boundary years)
  3. Integration with the converter handles all grammar productions correctly
✅ Verification successful

Let me gather more specific information about the test coverage and integration.


Let me search for the converter implementation one more time with a different approach.


Let me check the converter implementation to complete the verification.


Grammar integration and test coverage are comprehensive

The verification shows:

  1. Extensive test coverage in test_hijri_parser.py covering all date formats:

    • Single year (e.g., "521")
    • Month-year with variants (e.g., "al-Muḥarram 900", "Safar 581")
    • Day-month-year (e.g., "7 Jumādā I 1243")
    • Year-month and year-month-day formats
  2. Proper integration with converter system:

    • HijriDateConverter properly implements parsing through the grammar
    • Transformer correctly handles all date formats and precision levels
    • Error handling for invalid inputs is implemented
    • Calendar-specific labeling is maintained
  3. Edge cases are handled:

    • Different month name variants (with/without accents)
    • Valid day ranges (1-30)
    • Three-digit year assumption is properly enforced
    • Invalid input detection via UnexpectedCharacters exception
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test files and patterns
echo "Searching for test files..."
fd -e py test_ -x grep -l "hijri" {}

# Look for test patterns in the codebase
echo "Checking test patterns..."
rg "test.*hijri" -A 5

# Verify converter integration
echo "Checking converter integration..."
ast-grep --pattern 'class $CONVERTER {
  $$$
  parse($INPUT) {
    $$$
  }
  $$$
}'

Length of output: 498


Script:

#!/bin/bash
# Check test file contents for coverage of different date formats
echo "Checking test coverage..."
cat ./tests/test_converters/test_calendars/test_hijri/test_hijri_parser.py

# Check transformer implementation
echo "Checking transformer..."
cat ./tests/test_converters/test_calendars/test_hijri/test_hijri_transformer.py

# Look for the actual converter implementation
echo "Looking for converter class..."
fd -e py -x grep -l "HijriConverter" {}

Length of output: 3628


Script:

#!/bin/bash
# Look for converter implementation using ripgrep
echo "Searching for converter implementation..."
rg -l "class.*HijriConverter"

# Check imports to find converter location
echo "Checking imports..."
rg "from.*hijri.*converter.*import"

# Look for any file containing 'hijri' in the converter directory
echo "Checking converter directory..."
fd -e py . src/undate/converters/calendars/hijri/

Length of output: 663


Script:

#!/bin/bash
# Check the converter implementation
echo "Checking converter implementation..."
cat src/undate/converters/calendars/hijri/converter.py

# Check the parser implementation
echo "Checking parser implementation..."
cat src/undate/converters/calendars/hijri/parser.py

Length of output: 2421

tests/test_converters/test_calendars/test_hijri/test_hijri_transformer.py (3)

1-6: LGTM! Well-organized imports.

The imports are logically grouped and include all necessary components for testing the Hijri date transformer.


53-54: Consider adding precision validation for UndateInterval.

The precision check is only performed for Undate objects, but UndateInterval instances should also maintain precision information.

Let's verify if UndateInterval supports precision:


1-56: Verify test coverage for the Hijri transformer implementation.

The test suite provides good coverage of the basic functionality, but let's verify the coverage metrics.

✅ Verification successful

Test coverage for the Hijri transformer is comprehensive

The test suite provides excellent coverage of the HijriDateTransformer implementation:

  • Tests cover all public methods (hijri_date and month)
  • Includes test cases for all three precision levels (day, month, year)
  • Verifies both single dates (Undate) and date intervals (UndateInterval)
  • Contains real-world examples from Princeton Geniza Project and ISMI data
  • Validates correct date conversions against external reference (muqawwim.com)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any untested public methods in the transformer
ast-grep --pattern 'class HijriDateTransformer {
  $$$
  def $method($$$) {
    $$$
  }
  $$$
}'

Length of output: 98


Script:

#!/bin/bash
# Let's first locate the transformer implementation file
fd "transformer.py" --type f

# Then examine its content and the HijriDateTransformer class
rg -A 5 "class HijriDateTransformer" 

# Also check test coverage using test file patterns
fd "test.*transformer.py" --type f

Length of output: 951


Script:

#!/bin/bash
# Let's examine the full implementation of HijriDateTransformer
cat src/undate/converters/calendars/hijri/transformer.py

# And check all test cases in the test file
cat tests/test_converters/test_calendars/test_hijri/test_hijri_transformer.py

Length of output: 4241

pyproject.toml (1)

12-12: LGTM! Dependencies align with PR objectives.

The updated dependencies support the new Hijri calendar conversion functionality:

  • lark[interegular] provides enhanced parsing capabilities for Hijri date formats
  • convertdate enables calendar conversion operations
src/undate/undate.py (1)

445-447: LGTM: Type annotations are clear and consistent

The explicit type annotations for class attributes using Union[Type, None] are well-defined and maintain compatibility with Python versions pre-3.10. They align well with the class usage and will improve type checking for the new Hijri date conversion feature.

Comment on lines +26 to +32
def test_parse_error(self):
# a string we can't parse should raise an error
with pytest.raises(ValueError):
HijriDateConverter().parse("January 2, 1991")
# empty string should also error
with pytest.raises(ValueError):
HijriDateConverter().parse("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error handling test coverage.

Consider adding more error test cases:

  1. Malformed Hijri dates (e.g., invalid month names)
  2. Out-of-range values (e.g., invalid day numbers)
  3. Incomplete dates (e.g., month without year)
  4. Mixed calendar formats

Example additions:

def test_parse_error_cases(self):
    """Test various error cases for Hijri date parsing."""
    invalid_cases = [
        "13 Jumādā I 1243",  # Invalid day
        "Foo 495",           # Invalid month
        "Rajab",             # Incomplete date
        "1243/5/7",         # Wrong format
    ]
    for invalid_date in invalid_cases:
        with pytest.raises(ValueError):
            HijriDateConverter().parse(invalid_date)

Comment on lines 1 to 32
import pytest
from undate.converters.calendars import HijriDateConverter
from undate.undate import Undate, UndateInterval


class TestHijriDateConverter:
def test_parse_(self):
# day
date_str = "7 Jumādā I 1243"
date = HijriDateConverter().parse(date_str)
assert date == Undate(1827, 11, 26)
assert date.label == f"{date_str} {HijriDateConverter.calendar_name}"

# month
date_str = "Rajab 495"
date = HijriDateConverter().parse(date_str)
assert date == UndateInterval(Undate(1102, 4, 28), Undate(1102, 5, 27))
assert date.label == f"{date_str} {HijriDateConverter.calendar_name}"

# year
date_str = "441"
date = HijriDateConverter().parse(date_str)
assert date == UndateInterval(Undate(1049, 6, 11), Undate(1050, 5, 31))
assert date.label == f"{date_str} {HijriDateConverter.calendar_name}"

def test_parse_error(self):
# a string we can't parse should raise an error
with pytest.raises(ValueError):
HijriDateConverter().parse("January 2, 1991")
# empty string should also error
with pytest.raises(ValueError):
HijriDateConverter().parse("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Test coverage needs expansion for ISMI dataset formats

The current test cases don't fully cover the date formats found in the ISMI dataset. Specifically:

  • Missing test for Muharram month format (e.g., "6. Muḥarram 888")
  • No tests for dates with diacritical marks (e.g., "Ḏu al-Ḥijjaẗ")
  • No tests for year ranges in Islamic calendar
  • No tests handling the ±2 days ambiguity mentioned in ISMI data

The test file should be expanded to include these real-world use cases from the ISMI dataset to ensure robust handling of actual manuscript dates.

🔗 Analysis chain

Verify test coverage against real-world data.

The PR objectives mention examples from Princeton Geniza Project (PGP) and ISMI datasets. Let's verify if the current test cases cover the date formats found in these datasets.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for Hijri date patterns in the codebase to find more test cases
# Look for files that might contain example dates from PGP and ISMI datasets

# Search for potential test data files
fd -e json -e csv -e txt -e md . | grep -i -E 'test.*data|pgp|ismi|hijri|dates'

# Search for Hijri date patterns in these files
rg -i "hijri|jumada|rajab|ramadan" 

Length of output: 5539


Script:

#!/bin/bash
# Let's check the ISMI sample data to understand the date formats
cat examples/use-cases/ismi/data/ismi-om4-date-samples.json

# Also check the ISMI README for any documentation about date formats
cat examples/use-cases/ismi/README.md

Length of output: 5394

src/undate/converters/calendars/hijri/hijri.lark Outdated Show resolved Hide resolved
src/undate/converters/edtf/edtf.lark Outdated Show resolved Hide resolved
src/undate/converters/calendars/hijri/transformer.py Outdated Show resolved Hide resolved
src/undate/converters/calendars/hijri/transformer.py Outdated Show resolved Hide resolved
src/undate/converters/calendars/hijri/converter.py Outdated Show resolved Hide resolved
rlskoeser and others added 7 commits November 21, 2024 17:08
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@robcast
Copy link
Contributor

robcast commented Nov 22, 2024

Should a month or year in another calendar be represented in a single Undate or should it always be a UndateInterval?

I would find it more elegant if non-Gregorian months or years could also be represented as a single Undate. Does that make other Undate functions too complex? The default constructor assumes Gregorian and it would have to be changed.

Does this change what precision=year means? Is it too weird to have an Undate "year" based on the Gregorian calendar where the Gregorian text representation does not align on a year? Maybe we should have a "non-gregorian" flag for cases when you really care about this.

@rlskoeser
Copy link
Member Author

Should a month or year in another calendar be represented in a single Undate or should it always be a UndateInterval?

I would find it more elegant if non-Gregorian months or years could also be represented as a single Undate. Does that make other Undate functions too complex? The default constructor assumes Gregorian and it would have to be changed.

Does this change what precision=year means? Is it too weird to have an Undate "year" based on the Gregorian calendar where the Gregorian text representation does not align on a year? Maybe we should have a "non-gregorian" flag for cases when you really care about this.

@robcast these are fantastic questions and I like where you are going with it. I had been intending to add a calendar property to Undate so that we could be explicit about what calendar the date is in. Can you help me sketch out what it would look like and how it would operate? I agree that this would be much more elegant and powerful.

Here's what I'm thinking based on what you're proposing: when parsing a Hijri date, instead of converting to Gregorian before creating Undate objects, we initialize the Undate with numeric year, month, day values and a calendar. This allows us to keep them a single Undate instead of an UndateInterval, and also preserves the date precision logic. There are lots of benefits to this approach. If we do that, then would we calculate the earliest/latest dates in the Hijri calendar or Gregorian? Or have options for both? Or maybe the comparison logic needs to be calendar aware and we only do the conversion when it matters or is explicitly requested? I'm leaning towards this last option, but let me know what you think!

@rlskoeser
Copy link
Member Author

rlskoeser commented Nov 22, 2024

decisions from discussion with @robcast :

  • revise this code to test out the new approach
    • add a calendar property to undate (default gregorian)
    • initialize undate object with numeric year, month, day and add optional calendar (default gregorian)
    • when initializing with hijri calendar, convert to gregorian for earliest/latest to compare
    • confirm date precision is set as expected
  • 3-digit year is not a reasonable constraint; choose a format (day month year) and document that constraint

@rlskoeser
Copy link
Member Author

@robcast I've updated this PR based on the approach we discussed. I'm really happy with how it's working and how powerful this is going to be. I went ahead and made a minimal Gregorian calendar class for consistency with the pieces of information that could be calendar specific.

To see if this will scale, I took a first pass at implementing a calendar converter for the Hebrew Anno Mundi based on the Hijri implementation and with reference to my previous geniza project code. PR for that is here, if you want to see how it builds on this: #108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants