Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run scrapper on Google Cloud #238

Open
11 tasks
Ghemechis opened this issue Jul 2, 2024 · 0 comments
Open
11 tasks

Run scrapper on Google Cloud #238

Ghemechis opened this issue Jul 2, 2024 · 0 comments
Assignees
Labels
data pipeline Items that are related to the scrapers of the data pipeline sprint-12 Items assigned to sprint 12

Comments

@Ghemechis
Copy link

Ghemechis commented Jul 2, 2024

Item type: data pipeline

**Description:**Running a web scraper on Google Cloud Platform (GCP) to capture and store text data obtained from scraping websites. The goal is to efficiently gather and persist scraped text data on Google Cloud in docker container deployed. The scraped text files will be stored on GC and utilized for prompt engineering within the LangChain framework, facilitating AI agent response generation.

User Story

  • As a developer,
  • I want to run the scrapping functionality on Google Cloud Platform
  • so that I can collect and store text data from various sources in Google cloud, which will be used for prompt engineering in the LangChain framework to generate responses for AI agents.

Acceptance Criteria

  • Deploy the scrapping functionality on Google Cloud Platform to perform automated data extraction from multiple sources.
  • Ensure the scraper handles large-scale scraping operations reliably and efficiently.
  • Configure the scraper to save scraped text files directly to Google Cloud Storage (GCS).
  • Implement batch processing or streaming capabilities as needed to manage large volumes of data.
  • Implement error handling and retry mechanisms to ensure robust performance and data integrity.

Definition of Done

  • The feature has been fully implemented.
  • The feature has been manually tested and works as expected without critical bugs.
  • The feature code is documented with clear explanations of its functionality and usage.
  • The feature code has been reviewed and approved by at least one team member.
  • The feature branches have been merged into the main branch and closed.
  • The feature utility, function and usage have been documented in the respective project wiki on github.
@Ghemechis Ghemechis converted this from a draft issue Jul 2, 2024
@Ghemechis Ghemechis moved this from Product Backlog to Sprint Backlog in amos2024ss06-feature-board Jul 2, 2024
@Ghemechis Ghemechis moved this from Sprint Backlog to Product Backlog in amos2024ss06-feature-board Jul 2, 2024
@tubamos tubamos added this to the Part A: Data acquisition milestone Jul 3, 2024
@tubamos tubamos added data pipeline Items that are related to the scrapers of the data pipeline sprint-11 Items assigned to sprint 11 labels Jul 3, 2024
@Ghemechis Ghemechis moved this from Product Backlog to Sprint Backlog in amos2024ss06-feature-board Jul 3, 2024
@tubamos tubamos moved this from Sprint Backlog to Product Backlog in amos2024ss06-feature-board Jul 9, 2024
@tubamos tubamos moved this from Product Backlog to Sprint Backlog in amos2024ss06-feature-board Jul 10, 2024
@tubamos tubamos moved this from Sprint Backlog to Product Backlog in amos2024ss06-feature-board Jul 10, 2024
@tubamos tubamos moved this from Product Backlog to Sprint Backlog in amos2024ss06-feature-board Jul 10, 2024
@preetvadaliya preetvadaliya self-assigned this Jul 10, 2024
@tubamos tubamos added sprint-12 Items assigned to sprint 12 and removed sprint-11 Items assigned to sprint 11 labels Jul 14, 2024
@tubamos tubamos moved this from Sprint Backlog to Awaiting Review in amos2024ss06-feature-board Jul 16, 2024
@tubamos tubamos moved this from Awaiting Review to In Progress in amos2024ss06-feature-board Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data pipeline Items that are related to the scrapers of the data pipeline sprint-12 Items assigned to sprint 12
Projects
Status: In Progress
Development

No branches or pull requests

3 participants