Skip to content

A small web app that uses NLTK's Arabic stemming algorithms to identify the roots of Arabic words

License

Notifications You must be signed in to change notification settings

Wollaston/ArabicStemmer

Repository files navigation

ArabicStemmer

A simple web app that allows the user to enter an Arabic word and retrieve stem predictions from three of NLTK's Arabic stemming algorithms.

Example usage of ArabicStemmer

This was created as a learning project to learn more about python stemming algorithms for the Arabic language, to experiment with SvelteKit, especially its API functionality, and to explore Node child processes.

How To Use

Enter an Arabic word in the prompt. Submitting the request will prompt three of NLTK's Arabic stemming algorithms and deliver the response back in table form to the user.

The user can enter words in two ways:

  • Type the word using an Arabic keyboard
  • Type using the latin script, and use the incorporated Yamli tool to select the transliterated Arabic

How It Works

The application simply takes the form entry, calls the python script, and returns the result to the user.

To do this, the web app was scaffolded using SvelteKit.

In this example, Node spawns a child process to call the python script with NLTK via an API. It takes the form entry as input and returns the predicted stems as output in JSON format. In this way, the script is coupled with the app for a convenient example, but it can also be easily decoupled and hosted elsewhere for standard API function calls.

Use the App

In its current basic form, there are a few steps required to get the app up and running.

  1. Clone the repo from GitHub
git clone [email protected]:Wollaston/ArabicStemmer.git //using ssh
  1. Create a virtual environment for working with the Python component of the app
python3 -m venv venv
  1. Install NLTK in the virtual environment
pip install nltk
  1. Install the SvelteKit and Node dependencies
npm install
  1. Launch the app using local host
npm run dev //will provide a link to the proper port

Why Three Predictions?

During experimentation, it became clear that the existing Arabic stemming algorithms from NLTK are not entirely perfect, especially when trying to accurately identify word roots, although they are generally accurate with standard vocabulary.

Therefore, the algorithm provides three predictions to give the user some choice when assessing the accuracy of the responses.

These algorithms are:

  1. ISRI Stemmer
  2. ARLSTem Stemmer
  3. ARLSTem2 Stemmer

Next Steps

  • Explore additional Arabic stemming algorithms and incorporate accordingly
  • Decouple the Python script from the App for efficient hosting options
  • Create a local desktop app for a standalone client
  • Provide the ability to link stemmed responses to a root-based Arabic dictionary and/or provide examples of words based on that stem and root
  • Add additional tooling and guidance to the App, for example the Buckwalter Arabic Morphological Analyzer
  • Program proper error checking
  • Incorporate stem and root verifiers, and warn the user accordingly if the predicted stem does not match an established Arabic stem or root
    • This may be useful when working with roots that are not three letters, or with hamzated/geminated/assimilated words