Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scraper: replace request_page_with_retry with a retry library #105

Open
nicolassaw opened this issue Sep 2, 2024 · 7 comments
Open

scraper: replace request_page_with_retry with a retry library #105

nicolassaw opened this issue Sep 2, 2024 · 7 comments
Labels

Comments

@nicolassaw
Copy link
Contributor

scraper: replace request_page_with_retry with a retry library

@nicolassaw
Copy link
Contributor Author

I attempted to integrate this retry library with the request_page_with_retry method but hit a wall. It seems that there are some differences in the original error handling (or waiting for the request to come back?) in this customized retry procedure that is different from the retry library that I can't sort out. With the original code, it takes a lot longer to run and waits for the pages to load, but the retry implementation seems to run super fast and isn't waiting as long. It's unclear to me why.

@newswim Any ideas? Otherwise, I'm going to abandoned this ticket because I've sunk some time into it and it's not high priority or necessary for production.

If no ideas from anybody after awhile, I'll go ahead and close the ticket.

@newswim
Copy link
Collaborator

newswim commented Sep 13, 2024

I can take a look at it, I'm unfamiliar but I agree it's definitely just a nice-to-have.

Cc @Matt343 just in case there's an obvious thing we're doing wrong.

@nicolassaw
Copy link
Contributor Author

@Matt343
Copy link
Collaborator

Matt343 commented Sep 14, 2024

So the issue is that you're calling write_debug_and_quit, which in turn calls sys.exit(), which is a non-retryable situation. So to prevent that from happening, what you want to do is just print the error info and then re-raise the error with raise e instead

@nicolassaw
Copy link
Contributor Author

Hmm. I'm not so sure. If write_debug_and_quit was triggering within request_page, then you'd see "[(verification text, which I think is "Record Count")] couldn't be found in page" and the html of the failed page saved to the logging folder. But it seems the HTML search is returning a search page where the results appear 0 (I logged the HTML below and it was the proper results page but empty with zero search results), with no html of the failed page saved to the logging folder. So it's not failing the code exactly, but making the results page appear as 0, which writes no new HTML, which is not what the unittest expects. So the unit test is throwing an unexpected assertion.

image

It makes me think that the earlier scrape_search_page method within the scraper isn't working correctly because it's running too fast and then it's messing up the actual search in an odd way the returns 0 in the results page, when it should be ~81 cases for this judicial officer on this date (what happens when you run the unit test for main).

HTML example of what we're getting for the results page, with the correct verification text (so it's not really throwing an error to activate write_debug_and_quit):

<html>
  <head>
    <link rel="stylesheet" type="text/css" href="CSS/PublicAccess.css">
  </head>
  
  <body>
    <form id="SearchParameters" action="CourtCalendarSearchResults.aspx" method="post" style="display:none;">
      <input id="SearchType" name="SearchType" type="hidden" value='JUDOFFC'/>   <!-- The type of search this is: by case or by party -->
      <input id="SearchMode" name="SearchMode" type="hidden" value='JUDOFFC'/>   <!-- The specific search mode of SearchType - like Search Case by CaseNumber -->
      <input id="HearingTypeIDs" name="HearingTypeIDs" type="hidden" value=''/>
      <input id="SearchBy" name="SearchBy" type="hidden" value='3'/>
      <input id="NameTypeKy" name="NameTypeKy" type="hidden" value=''/>
      <input id="CaseCategories" name="CaseCategories" type="hidden" value='CR'/>
      <input id="CourtCaseSearchValue" name="CourtCaseSearchValue" value=''/>
      <input id="UseSoundex" name="UseSoundex" type="hidden" value=''/>
      <input id="LastName" name="LastName" type="hidden" value=''/>
      <input id="FirstName" name="FirstName" type="hidden" value=''/>
      <input id="MiddleName" name="MiddleName" type="hidden" value=''/>
      <input id="cboJudOffc" name="cboJudOffc" type="hidden" value=''/>
      <input id="cboMagist" name="cboMagist" type="hidden" value=''/>
      <input id="DateSettingOnAfter" name="DateSettingOnAfter" type="hidden" value='07/01/2024'/>
      <input id="DateSettingOnBefore" name="DateSettingOnBefore" type="hidden" value='07/01/2024'/>
      <input id="SearchParams" name="SearchParams" type="hidden" value=''/>
      <input id="SortType" name="SortType" type="hidden" value=''/>
    </form>    
    <?xml version="1.0" encoding="utf-8"?><table cellspacing="0" cellpadding="0" width="100%" border="0" style="table-layout: fixed; margin:0px; padding:0px;" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:PublicAccessUser="urn:PublicAccessUser"><tr><td class="ssHeaderTitleBanner">Court Calendar Results</td></tr></table><table cellspacing="0" cellpadding="0" width="100%" border="0" style="table-layout: fixed; margin:0px; padding:0px;" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:PublicAccessUser="urn:PublicAccessUser"><tr><td bgcolor="#000000" height="20px"><table cellspacing="0" cellpadding="0" width="100%" border="0"><tr><td align="left" style="padding-left: 5px"><font size="1"><a class="ssBlackNavBarHyperlink" href="#MainContent">Skip to Main Content</a>&nbsp;<a class="ssBlackNavBarHyperlink" href="logout.aspx">Logout</a>&nbsp;<a class="ssBlackNavBarHyperlink" href="MyAccount.aspx?ReturnURL=default.aspx">My Account</a>&nbsp;<a class="ssBlackNavBarHyperlink" href="default.aspx">Search Menu</a>&nbsp;<a class="ssBlackNavBarHyperlink" href="Search.aspx?ID=900">New Calendar Search</a>&nbsp;<a class="ssBlackNavBarHyperlink" href="Search.aspx?ID=900&amp;RefineSearch=1">Refine Search</a>&nbsp;</font></td><td align="center" class="ssBlackNavBarLocation"></td><td align="right" style="padding-right: 10px"><table cellspacing="0" cellpadding="0" border="0"><tr><td class="ssBlackNavBarLocation">
                          Location : All Courts</td><td><font size="1"><a class="ssBlackNavBarHyperlink" target="_blank" href="help.aspx">Help</a></font></td></tr></table></td></tr></table></td></tr></table><a id="MainContent" name="MainContent" tabindex="-1" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:user="http://www.tylertechnologies.com"></a><table border="0" cellpadding="0" cellspacing="0" width="100%" style="table-layout: fixed; font-size: 8pt; font-family: arial" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:user="http://www.tylertechnologies.com"><tr><td style="width:85px;"><b>Record Count: </b></td><td style="text-align:left;"><b>0</b></td></tr><tr><td id="SearchParamList" colspan="2"></td></tr></table><table border="0" cellpadding="1" cellspacing="0" width="100%" style="table-layout: fixed; font-size: 8pt; font-family: arial" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:user="http://www.tylertechnologies.com"><col width="25%" /><col width="35%" /><col width="20%" /><col width="20%" /><tr><td colspan="4"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="table-layout: fixed; font-size: 8pt; font-family: arial"><col width="25%" /><col width="35%" /><col width="20%" /><col width="20%" /><tr style="padding-top:5px;"><td colspan="4"><label for="SortBy" style="font-weight:bold;">Sort By  </label><select id="SortBy" name="SortBy" onChange="SwitchCourtSort(this.value)"><option value="CN" selected="true">
            Case Number
          </option><option value="DT">
            Date and Time
          </option><option value="DN">
            Defendant Name
          </option><option value="HT">
            Hearing Type
          </option><option value="JN">
            Judicial Officer Name
          </option><option value="PN">
            Plaintiff Name
          </option></select></td></tr><tr><td nowrap="true"> </td><td nowrap="true"> </td><td nowrap="true"> </td><td nowrap="true"><b>Date</b></td></tr><tr><td nowrap="true"><b>Case Number</b></td><td nowrap="true"></td><td nowrap="true"><b>Judicial Officer</b></td><td nowrap="true"><b>Time</b></td></tr><tr><th class="ssSearchResultHeaderBottom" nowrap="true">Type</th><th class="ssSearchResultHeaderBottom" nowrap="true">Style</th><th class="ssSearchResultHeaderBottom" nowrap="true">Physical Location</th><th class="ssSearchResultHeaderBottom" nowrap="true">Hearing Type</th></tr></table></td></tr><tr height="100"><td colspan="4" align="center"><b>No cases matched your search criteria.</b></td></tr></table>
  </body>
  
  <script language="javascript">
    function SwitchCourtSort(sSortType)
    {
      // Set sort type to form element
      document.getElementById("SortType").value = sSortType;
      
      SearchParameters.submit();
      return true;
    }

  </script>
</html>

@Matt343
Copy link
Collaborator

Matt343 commented Sep 14, 2024

Oh interesting. I wonder if it's related to removing this sleep(ms_wait / 1000 * (i + 1)) from the beginning of the request function? That would trigger a sleep on the first try, but now we don't do that. That's the only other change I can see that should have any impact, unless I'm missing something obvious :)

@nicolassaw
Copy link
Contributor Author

I tried that and had it sleep for 5 whole seconds before making each request and still no dice. Haunted, perhaps. 👻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🔖 To-do
Development

No branches or pull requests

3 participants