Day-1:- GenAI NLP(Natural Language Processing)

  User Query

   ↓

FastAPI/Flask (API Backend)

   ↓

NLP Pipeline (Hugging Face Transformers)

   ↓

Vector Database (Pinecone, Weaviate, or FAISS)

   ↓

LLM (OpenAI GPT, Hugging Face, or LLaMA-based models)

   ↓

LangChain (Workflow Orchestration)

   ↓

Post-Processing (Python, Formatting, Export Options)

   ↓

Frontend (Streamlit/React) or API Response


 Roadmap NLP:

Open source libraries:

NLTK:- https://www.nltk.org/

Spacy: https://spacy.io/

- **Text Preprocessing**:


*********https://www.nltk.org/api/nltk.tokenize.html


Technics types:-

Below are part ML.

1.Tokenization --> converts our text to meaningful vectors

  •     Stemming
  •    lemmatization
  •    stop words

2.Bag of Words (BoW)**, **TF-IDF**

3.**Word Embeddings**: Word2Vec, GloVe, and FastText (Learn how embeddings improve similarity detection)


Below are part of deep learning we can acheive by tensorflow or pytorch

4  **Word Embeddings**: Word2Vec, GloVe, and FastText (Learn how embeddings improve similarity detection)

5.- **Sentence Embeddings**: Sentence-BERT (SBERT), Universal Sentence Encoder


============================================================


Tokenization:-

Tokenization in NLP is the process of breaking down text into smaller, manageable pieces called tokens. Tokens can represent words, subwords, sentences, or even characters, depending on the level of granularity required for the task. Tokenization is essential as it allows NLP models to process and analyze text by breaking it down into understandable units.


  • Corpus -- Called as paragraphs
  • Documents -- Sen tense
  • Vocabulary - Unique words
  • Words

### **Tokenization in Practice**


- **Tools for Tokenization**:

  - **NLTK**: Provides basic word, sentence, and regex-based tokenizers.

  - **spaCy**: Efficient tokenization for various languages, including handling punctuation, entities, and special cases.

  - **Hugging Face Transformers**: Offers tokenizers tailored for transformer models (e.g., BERT, GPT) with options for WordPiece, BPE, and more.

  - **SentencePiece**: Unsupervised tokenization with subword units, compatible with various models like T5 and XLM.


---


### **Considerations When Tokenizing Text**


- **Language Dependency**: Tokenization rules differ by language; for instance, Chinese and Japanese require segmentation methods as they lack whitespace.

- **Special Cases**: Handling of numbers, dates, URLs, or abbreviations (like “Dr.” or “U.S.”) varies by tokenizer.

- **Efficiency**: Tokenization methods can significantly impact the speed and memory requirements of NLP models.


---

Tokenization serves as the foundation for various downstream NLP tasks by creating the “input” that models learn from. The choice of tokenization technique impacts model performance and adaptability, especially in language models designed for generative AI.


Ref:- https://www.nltk.org/howto.html

https://jupyter.org/try-jupyter/lab/

import nltk
# nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

input_text = "Welcome, llfv , kjsfnkv to Educative"
senttest = sent_tokenize(input_text) # it will convert sentense to paragraph
individual_words = word_tokenize(input_text) # para will divided into word

print(individual_words)
print(senttest)

for word in senttest:
    print(word)

#############Text Pre processing######################

Stemming:

Nothing but avoid similar words like when in para where we are taking about go going go will here go is common

to find the word the process calling as Stemming.

Types:

  • Porter Stemmer.
  • Regex Stemmer
  • Snowball Stemmer -- its good when compare to Port Stemmer

Disadvantages:

Its lightly changing the meaning of the word like studying --> studi

import nltk
# nltk.download('punkt_tab')
from nltk.stem import *
steam_words = ["running", "runs", "runner", "jumped", "jumping", "easily", "happier", "happiness", "children", "child", "studies", "studying", "beautiful", "going", "cars", "fishery", "watched", "watching"]

stemming=PorterStemmer()

for word in steam_words:
    print(word+"----->"+stemming.stem(word))


#Snowball will give better stem when compared to the Port Stem


#Snowball will give better stem when compared to the Port Stem
from nltk.stem import SnowballStemmer
SnowballStemmer= SnowballStemmer('english')
steam_words = ["running", "runs", "runner", "jumped", "jumping", "easily", "happier", "happiness", "children", "child", "studies", "studying", "beautiful", "going", "cars", "fishery", "watched", "watching"]

for word in steam_words:
    print(SnowballStemmer.stem(word))


But we have still some kind of issues while tokenization to do that we need to lemmatization

-----------Text Processing Lemmatization------------------

Word net lemmatize 

WordNetLemmatizer.lemmatize(word, pos) uses a specified part of speech (POS) to improve accuracy. The get_wordnet_pos function helps convert POS tags from nltk.pos_tag into WordNet's format.

For each word, this code finds its POS and lemmatizes it accordingly, reducing it to its base form.

Example Output

With the code above, the words list ["running", "cats", "better", "ate"] would be lemmatized as follows:

python

Copy code

Lemmatized Words: ['run', 'cat', 'good', 'eat']

This process is essential for natural language processing, as it helps standardize words into a consistent form for analysis or machine learning tasks.

# Types:
# 1.WordNetLemmatizer ref: https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet
#Note: Lemmatization will take lot of time to process the request
import nltk
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
steam_words = ["running", "runs", "runner", "jumped", "jumping", "easily", "happier", "happiness", "children", "child", "studies", "studying", "beautiful", "going", "cars", "fishery", "watched", "watching"]

for word in steam_words:
    print(word+"----->"+wnl.lemmatize(word, pos='v')) # In pos we can mention verb ad v and adjective as a and noun as n

----------------------------------------------------Stop Words--------------------------------------------------

#Before tokenization we can use stop words you can use port or snow or lemmatization.
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
SnowballStemmer= SnowballStemmer('english')
# nltk.download('stopwords') #Load the all stop words in english
# stopwords.words('english')
paragraph = """I have three visions for India. In 3000 years of our history, people from all over
               the world have come and invaded us, captured our lands, conquered our minds.
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours.
               Yet we have not done this to any other nation. We have not conquered anyone.
               We have not grabbed their land, their culture,
               their history and tried to enforce our way of life on them.
               Why? Because we respect the freedom of others.That is why my
               first vision is that of freedom. I believe that India got its first vision of
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India
               stands up to the world, no one will respect us. Only strength respects strength. We must be
               strong not only as a military power but also as an economic power. Both must go hand-in-hand.
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life.
               I see four milestones in my career"""
sentense = nltk.sent_tokenize(paragraph) #Above one is able to convert paragraph into sentense

#####Now apply the stopwords and then apply the tokenization
for i in range(len(sentense)):
    words = nltk.word_tokenize(sentense[i])
    words = [SnowballStemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentense[i] = ' '.join(words) #converting words into sentense
    print(sentense)




import nltk
nltk.download('averaged_perceptron_tagger_eng') # pre reqs for pos tag
nltk.download('maxent_ne_chunker_tab') # Pre reqs for chunks
sentance = "A paragraph is a collection of words combined together to make a longer unit than a sentence. It's a set of sentences that are well-organized and coherent"
words = nltk.word_tokenize(sentance)
tag_element = nltk.pos_tag(words) #to find the name and adjective and other like parts of speech
nltk.ne_chunk(tag_element).draw()

==============================Text to vector===========================

Natural Language Processing (NLP) uses various techniques to convert text into vector representations, essential for machine learning models to understand and process text data. Here are some popular methods for text-to-vector conversion:


### 1. **One-Hot Encoding**

   - **Description**: Each word in the vocabulary is represented by a vector with all zeros except for a single "1" at the index corresponding to the word.

   - **Limitations**: Doesn’t capture semantic relationships between words. The vector dimension can become extremely large as the vocabulary grows.


### 2. **Bag of Words (BoW)**

   - **Description**: Represents text as a fixed-size vector, with each element corresponding to the frequency of a word in the document.

   - **Limitations**: Ignores word order and doesn’t capture the contextual meaning.


### 3. **TF-IDF (Term Frequency-Inverse Document Frequency)**

   - **Description**: Assigns a weight to each word based on its frequency in a document and how common it is across documents. Helps to reduce the influence of frequently occurring, less informative words.

   - **Limitations**: Similar to BoW, it doesn’t capture word order or semantic relationships.


### 4. **Word Embeddings**

   - **Description**: Dense vector representations of words in a continuous vector space, where semantically similar words have similar representations. Examples include:

     - **Word2Vec**: Trained using a neural network model to capture semantic relationships by predicting context words (skip-gram) or the central word (CBOW) in a context.

     - **GloVe** (Global Vectors for Word Representation): Generates word vectors by factorizing the word co-occurrence matrix, aiming to capture global statistical information.

     - **FastText**: Extends Word2Vec by treating each word as a bag of character n-grams, useful for capturing subword information and handling out-of-vocabulary words.

   - **Limitations**: Word embeddings are typically pre-trained on large corpora, and fine-tuning may be required for specific domains.


### 5. **Document Embeddings**

   - **Description**: Extends word embeddings to represent entire documents rather than single words. Examples include:

     - **Doc2Vec**: Builds on Word2Vec by training vectors for documents alongside word vectors.

     - **Sent2Vec**: Averages word embeddings to produce a vector for sentences or paragraphs.

   - **Limitations**: Works well for capturing global meanings but may miss out on finer, sentence-level details.

---------------------------------One Hot Encoding----------------------------


0 Comments

Search This Blog