Skip to content

Commit

Permalink
Merge pull request #11 from imohitmayank/nov_23
Browse files Browse the repository at this point in the history
Selenium and BS4 section added in DStools and more.
  • Loading branch information
imohitmayank authored Nov 15, 2023
2 parents 39c8dfc + 6b2faaa commit bdbf3e8
Show file tree
Hide file tree
Showing 6 changed files with 112 additions and 19 deletions.
2 changes: 1 addition & 1 deletion docs/audio_intelligence/whisper.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ result = model.transcribe("japanese.wav", language="Japanese", task="translate",

!!! Note
Auto language detection only works if you don't specify it explicitly using `language` param in `transcribe` function. The package uses only the first 30 secs to detect the language.
Also, whisper's translation is not that accurate hence an alternative approach could be to perform transcription using Whisper but use [another package](../python/python_snippets.md#machine-translation) to translate the transcription.
Also, whisper's translation is not that accurate hence an alternative approach could be to perform transcription using Whisper but use [another package](../data_science_tools/python_snippets.md#machine-translation) to translate the transcription.

- The package also provides CLI support, here is an example,

Expand Down
6 changes: 5 additions & 1 deletion docs/data_science_tools/python_snippets.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,10 +298,14 @@ pd.DataFrame(result).plot(x='size', y='mean')
# list all supported python versions
conda search python

# create a new conda environment (with new python version)
# create a new global conda environment (with new python version)
# note, py39 is the name of the env
conda create -n py39 python=3.9

# create a new local conda environment
# (under venv folder in current directory and with new python version)
conda create -p ./venv -n py39 python=3.9

# list all of the environments
conda info --envs

Expand Down
117 changes: 103 additions & 14 deletions docs/data_science_tools/scraping_websites.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,32 @@
# Scraping websites using Scrapy
# Scraping Websites

## Introduction

- Scraping is the process of traversing and collecting data from the web pages. People scrap websites for many reasons -- be it to get information about companies or fetch latest news or get stock prices informations or just create a dataset for the next big AI model :sunglasses:
- In this article, we will be using [Scrapy](https://docs.scrapy.org) to scrape data from the [Devgan](http://devgan.in/all_sections_ipc.php) website that hosts details of difference sections in [Indian Penal Code](https://en.wikipedia.org/wiki/Indian_Penal_Code) (IPC). I think this example provides the perfect blend of basic and advanced scraping techniques. So let's get started!
- Scraping is the process of traversing and collecting data from the web pages. People scrape websites for many reasons -- to get information about companies or fetch latest news or get stock prices informations or just create a dataset for the next big AI model :sunglasses:
- In this article, we will be focusing on two different techniques to scrape website.
- For static website, we can use [Scrapy](https://docs.scrapy.org). As an example, we will scrape data from the [Devgan](http://devgan.in/all_sections_ipc.php) website that hosts details of difference sections in [Indian Penal Code](https://en.wikipedia.org/wiki/Indian_Penal_Code) (IPC). [[Github Code]](https://github.com/imohitmayank/ipc_semantic_search/tree/main/devganscrap)
- For dynamic website, we can use [Selenium](https://selenium-python.readthedocs.io/getting-started.html) in combination with [BeautifulSoup4](https://pypi.org/project/beautifulsoup4/) (BS4). As an example, we will scrape data from Google search results. In short, Selenium is an open-source tool for browser automation and it will be used to automate the process of opening the browser and loading the website *(as dynamic websites populates the data once the website is completely loaded)*. For extracting the data from the website, we will use BS4.

!!! Tip
The complete code is also available on [GitHub](https://github.com/imohitmayank/ipc_semantic_search/tree/main/devganscrap)
!!! Warning
This article is purely for educational purpose. I would highly recommend considering website's Terms of Service (ToS) or getting website owner's permission before scraping.

## Understanding the website
## Static Website scrape using Scrapy

- Before we even start scraping, we need to understand the structure of the website. This is very important, as we want to (1) get an idea of what we want to scrap and (2) where those data are located.
### Understanding the website

- Before we even start scraping, we need to understand the structure of the website. This is very important, as we want to (1) get an idea of what we want to scrapeand (2) where those data are located.

<figure markdown>
![](../imgs/scrapy_devgan.png)
<figcaption>The flow of scraping section descriptions from Devgan website</figcaption>
</figure>

- Our goal is to scrap the description for each section in IPC. As per the website flow above, the complete process can be divided into two parts,
- Our goal is to scrapethe description for each section in IPC. As per the website flow above, the complete process can be divided into two parts,
- First, we need to traverse to the main page and extract the link of each sections.
- Next, we need to traverse to each individual section page and extract the description details present there.


## Data extraction methods
### Data extraction methods

- Now, let's also look into different methods exposed by Scrapy to extract data from the web pages. The basic idea is that scrapy downloads the web page source code in HTML and parse it using different parsers. For this, we can either use `XPaths` or `CSS` selectors. The choice is purely up to us.
- We can begin with the main page and try to find the link of each section. For this, you can open inspect option from your browser by right clicking on any of the sections and select `inspect`. This should show you the source code. Next, try to find the position of the tag where each section is defined. Refer the image below, and you can see that each section is within `<a>` tag inside the `<div id="content">` tag. The `href` component will give you the link of the section and the `<span>` tag inside gives the section name.
Expand All @@ -41,7 +45,7 @@
You can refer the [Scrapy official doc](https://docs.scrapy.org/en/latest/intro/tutorial.html#extracting-data) for more details on creating `CSS` or `XPath` queries.


## Setup Scrapy project
### Setup Scrapy project

- First, let us install the Scapy package using pip. It can be easily done by running the following command in the terminal: `pip install scrapy`. Do make sure to create your own virtual environment (VE), activate it and then install the package in that environment. For confusion regarding VE, refer my [snippets](http://mohitmayank.com/a_lazy_data_science_guide/python/python_snippets/#conda-cheat-sheet) on the same topic.
- Next, let us setup the Scrapy project. Go to your directory of choice and run the command `scrapy startproject tutorial`. This will create a project with folder structure as shown below. We can ignore most of the files created here, our main focus will be on the `spiders/` directory.
Expand All @@ -63,9 +67,9 @@ tutorial/
The above folder structure is taken from the [Scrapy Official Tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html#)


## Create your Spider
### Create your Spider

- Usually we create one spider to scrap one website. For this example we will do exactly the same for Devgan website. So let's create a spider `spiders/devgan.py`. The code is shown below,
- Usually we create one spider to scrapeone website. For this example we will do exactly the same for Devgan website. So let's create a spider `spiders/devgan.py`. The code is shown below,

``` python linenums="1"
# import
Expand Down Expand Up @@ -115,7 +119,7 @@ class DevganSpider(scrapy.Spider):
- `Line 31-35:` We extract the description from the section page using the `CSS` query. We add description detail to the metadata and return the complete data that is to be persisted.


## Executing the spider
### Executing the spider

- To run the spider, traverse to the root directory and execure the following command, `scrapy crawl devgan -O sections_details.csv -t csv`. Here, `devgan` is the name of the spider we created earlier, `-O` is used to set the output file name as `sections_details.csv`. `-t` is used to define the output format as `csv`. This will create the csv file with all details of the sections as separate columns as shown below *(only 2 rows)*

Expand All @@ -124,4 +128,89 @@ class DevganSpider(scrapy.Spider):
| IPC Section 1... | http://... | Section 1 | This Act shall be called the Indian Penal Code... |
| IPC Section 2... | http://... | Section 2 | Every person shall be liable to punishment un.... |

And that's it! Cheers! :smile:
And that's it! Cheers! :smile:

## Dynamic Website scrape using Selenium and BS4

### Understanding the website

- Well, everyone knows and has used Google atleast once in their life. Nevertheless, for an example, if we want to find all of the latest news from TechCrunch, this is how the google search will look like.

<figure markdown>
![](../imgs/dt_scrapwebsite_googlesearch.png)
<figcaption>Google search result shows the news for the 14th of Nov, 2023</figcaption>
</figure>

- On looking at the page source of the above screen, you will only see javascript code that does not contain any data as shown above. This is because Google is a dynamic website which is event-driven and created with server-side languages. Because of this, we cannot use Scrapy alone as it cannot run Javascript, what we we need is a browser to run the code. That's where Selenium and BS4 comes in.

### Selenium and BS4 Automation

- We will code a generic function to automate the process of opening Chrome browser, loading the website and extracting the data from the website. To run it for our example of TechCrunch, we just need to change the input param.

!!! Note
Before starting make sure to install the Selenium package and drivers as [explained here](https://selenium-python.readthedocs.io/getting-started.html).

``` python linenums="1"
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import csv
import time

# Set up the Selenium driver (make sure you have the Chrome WebDriver installed)
options = Options()
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options)

# Function to scrape Google using Selenium and BeautifulSoup
def scrape_google(search_query, num_pages, start_page=0):
results = []
for page in range(0, num_pages):
start = (page * 10) + start_page * 10
url = f"https://www.google.com/search?q={search_query}&start={start}"
driver.get(url)
time.sleep(2) # Sleep to ensure all scripts are loaded properly

# loading and processing the page source in BS4
soup = BeautifulSoup(driver.page_source, 'html.parser')
search_items = soup.find_all('div', class_='g')

# iterate over all items (search results)
for item in search_items:
title = item.find('h3')
link = item.find('a', href=True)
description = item.get_text().split('')[2]
if title and link:
results.append({
'title': title.get_text(),
'link': link['href'],
'description': description if description else ""
})
save_results_to_csv(results, f'google_search_results_{page}.csv')

driver.quit()
return results

# Save results to CSV
def save_results_to_csv(results, filename):
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['title', 'link', 'description']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for result in results:
writer.writerow(result)

# Use the function to scrape and save results
search_results = scrape_google("site:https://techcrunch.com/2023/11/14/", 10)
```

Let's understand the code in detail,

- `Line 1-6`: Importing the required packages.
- `Line 8-11`: Initializes the Chrome WebDriver with the specified options.
- `Line 14-40`: Defines a function to scrape Google. It takes a search query, the number of pages to scrape, and an optional starting page. Inside the function, we iterate over the number of pages specified, constructing a URL for each page of Google search results based on the query and the current page. The WebDriver is used to navigate to the URL. Then, we use BS4 to parse the page source and extracts the title, link, and description of each search result and appends it to the `results` list. At each iteration, we save the result to a CSV. Finally, we close the driver and return the `results`.
- `Line 42-49`: Defines `save_results_to_csv` function to save the scraped results to a CSV file. It uses Python's `csv` module to write the title, link, and description of each search result to a CSV file.
- `Line 52`: We call the `scrape_google` function to scrape first 10 pages of Google search results for the specified query.

And we are done! :wave:
Binary file added docs/imgs/dt_scrapwebsite_googlesearch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/machine_learning/model_compression.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@
#### Response Based Knowledge

- Here, we define the final layer output of the teacher model as the knowledge, so the idea is to train a student model that will mimic the final prediction of the teacher model. For example, for a cat vs dog image classification model, if the teacher model classifies an image as 'Cat', a good student model should also learn from the same classification and vice-versa.
- Now the final predictions could also be of multiple types - logits *(the model output)*, soft targets *(the class probabilities)* or hard targets *(the class enums)* ([refer](deep_learning_terms.md#logits-soft-and-hard-targets)). Developers can select any of the prediction types, but **usually soft targets are preferred**, as they contain more information than hard target and are not as specific or architecture dependent as logits.
- Now the final predictions could also be of multiple types - logits *(the model output)*, soft targets *(the class probabilities)* or hard targets *(the class enums)* ([refer](interview_questions.md#what-is-the-difference-between-logits-soft-and-hard-targets)). Developers can select any of the prediction types, but **usually soft targets are preferred**, as they contain more information than hard target and are not as specific or architecture dependent as logits.
- Technically, we first predict the responses of student and teacher model on a sample, then compute the distillation loss on the difference between the logits *(or other prediction)* values generated by both. The distillation loss is denoted by $L_R(z_t, z_s)$ where $L_R()$ denotes the divergence loss and $z_t$ and $z_s$ denotes logits of teacher and student models respectively. In case of soft targets, we compute the probability of logits of one class wrt to other classes using softmax function,

$$p(z_i,T) = \frac{exp(z_i/T)}{\sum{}_j exp(z_j/T)}$$
Expand Down
4 changes: 2 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ nav:
- 'Data-to-Text Generation': 'natural_language_processing/data_to_text_generation.md'
- 'Named Entity Recognition' : 'natural_language_processing/named_entity_recognition.md'
- 'Natural Language Querying': 'natural_language_processing/nlq.md'
- 'Retrieval Augmented Generation (RAG)' : 'natural_language_processing/rag.md'
# - 'Techniques':
# - 'natural_language_processing/metrics.md'
- 'Blogs':
Expand Down Expand Up @@ -126,8 +127,7 @@ nav:
- 'data_science_tools/python_good_practices.md'
- 'data_science_tools/version_control.md'
- 'data_science_tools/compute_and_ai_services.md'
- 'Blogs':
- 'data_science_tools/scraping_websites.md'
- 'data_science_tools/scraping_websites.md'

- 'Machine Learning':
- 'machine_learning/introduction.md'
Expand Down

0 comments on commit bdbf3e8

Please sign in to comment.