Methods for Semantic Text Segmentation Prior to Generating Text Embeddings (Vectorization)

6 min readMar 6, 2023

Some of the popular applications of large language models (LLMs) like ChatGPT, Flan, etc. involve knowledge bases and chatbots that answer users’ questions based on corpora of documents. Under the hood, these applications rely on semantically searching text via embeddings stored in vector databases. In this article, we’ll discuss methods of text segmentation/splitting/chunking to improve document retrieval from vector databases based on similarity search functions.

Introduction

A simplified system diagram for ingested text documents into an LLM knowledge base is as follows.

We begin with a denormalized document, which is fed into a chunking algorithm. The chunking algorithm in the example performs sliding window segmentation, which takes the first 3 sentences, groups them, skips 1, and repeats. These chunks are then fed into our embedding model, which generates vectors that we map to our original chunk IDs. The vectors are ultimately stored in our vector database.

For completeness, let’s look at a simple querying system diagram.

In the querying flow, we begin by asking a question to our knowledge base. The embeddings for our question are generated from the same embedding model, which is then queried on cosine similarity or Euclidian distance in the vector database. The results returned are ordered by semantic similarity, which lets us retrieve the most similar text results from our relational database by chunk ID. These results are then appended into a well-crafted prompt that encourages the LLM to only use the supporting text chunks in its answer. Finally, it answers our question.

For the knowledge base task, the purpose of the LLM is not to generate new information, but to extract useful content from the sources that are provided in the prompt via a similarity search on the corpus. It is thus important to optimize the procedure of chunk formation prior to vectorization.

In this article, we’ll gradually build and optimize a split_text function in Python, which takes a string argument and returns a list of strings (chunks) to be vectorized.

Naive Methods

1. No segmentation

This method involves simply generating embeddings for an entire document before adding it to the vector store. Since even the most novel embedding models only generate vectors of modest lengths, such as 384 for all-MiniLM-L6-v2, it’s difficult to capture semantic information from large documents.

def split_text(text: str) -> List[str]:
  return [text]

2. No segmentation, but remove stop words

Stop words are words with little semantic meaning such as “a,” “the,” “your,” and “by.” You can find a comprehensive list of stop words here. By removing stop words, we can conserve space in the embedding vector for more important concepts. We’ll use the popular spaCy NLP library for removing stop words.

import spacy

nlp = spacy.load("en_core_web_sm")

def remove_stop_words(text: str) -> str:
  doc = nlp(text)
  text_parts = [token.text for token in doc if not token.is_stop]
  return "".join(text_parts)

def split_text(text: str) -> List[str]:
  return [remove_stop_words(text)]

3. Split at the sentence level

Our similarity search results are improving, but are still very poor when querying on massive documents. If we were to split documents at an arbitrary level, such as sentences, we can start indexing documents of any length (given the computation tradeoff).

import spacy

nlp = spacy.load("en_core_web_sm")

def remove_stop_words(text: str) -> str:
  doc = nlp(text)
  text_parts = [token.text for token in doc if not token.is_stop]
  return "".join(text_parts)

def split_sentences(text: str) -> List[str]:
  doc = nlp(text)
  sentences = [sent.text for sent in doc.sents]
  return sentences
  
def split_text(text: str) -> List[str]:
  text_no_stop_words = remove_stop_words(text)
  return split_sentences(text_no_stop_words)

4. Group on multiple sentences

Our next roadblock is that we lose context and semantic meaning when splitting a logical piece of text by sentence. For instance, suppose we were to split the following logical idea by sentence.

“It turns out the dog was not in a cantankerous mood. It was just hungry.”

It would be impossible to determine what mood the dog was really in by reading each sentence out of context. Hence, we can try to segment sentences in groups of 3.

import spacy

nlp = spacy.load("en_core_web_sm")

def remove_stop_words(text: str) -> str:
  doc = nlp(text)
  text_parts = [token.text for token in doc if not token.is_stop]
  return "".join(text_parts)

def split_sentences(text: str) -> List[str]:
  doc = nlp(text)
  sentences = [sent.text for sent in doc.sents]
  return sentences

def group_sentences(sentences: List[str], size: int) -> List[str]:
  sent_idx = 0
  segments = []
  while sent_idx + size < len(sentences):
    segments.append(" ".sentences[sent_idx:sent_idx + size])
    sent_idx += size
  segments.append(sentences[sent_idx:])
  return segments

def split_text(text: str) -> List[str]:
  text_no_stop_words = remove_stop_words(text)
  sentences = split_sentences(text_no_stop_words)
  return group_sentences(sentences, 3)

Better approaches

1. Sliding window

You’ve probably guessed where this was going. Grouping by sentences with no overlap will introduce context loss in the same fashion as sentence segmentation. One of the most popular workarounds to this is sliding window segmentation, which I touched on in the introduction. We do this by adding the overlap parameter, which determines how many sentences two subsequent chunks share. From my experience, grouping by 5 sentences with a 2-sentence overlap works well in English.

import spacy

nlp = spacy.load("en_core_web_sm")

def remove_stop_words(text: str) -> str:
  doc = nlp(text)
  text_parts = [token.text for token in doc if not token.is_stop]
  return "".join(text_parts)

def split_sentences(text: str) -> List[str]:
  doc = nlp(text)
  sentences = [sent.text for sent in doc.sents]
  return sentences

def group_sentences_with_overlap(sentences: List[str], size: int, overlap: int) -> List[str]:
  sent_idx = 0
  segments = []
  while sent_idx + size < len(sentences):
    segments.append(" ".sentences[sent_idx:sent_idx + size])
    sent_idx += (size - overlap)
  return segments

def split_text(text: str) -> List[str]:
  text_no_stop_words = remove_stop_words(text)
  sentences = split_sentences(text_no_stop_words)
  return group_sentences_with_overlap(sentences, 5, 2)

2. Semantic segmentation

Similar to semantic segmentation of images, videos, and scenes, text can be segmented based on meaning. This is a better bucketing technique for vectors because chunks will have sufficiently different meanings and hence significantly different vectors. We can use spaCy to vectorize sentences (using word2vec under the hood), and then group sentences together if their similarity falls above a threshold. We’ll use a value of 0.8, which is the result of taking the dot product of two vectors and then dividing it by the product of the vectors’ norms.

import spacy

nlp = spacy.load("en_core_web_sm")

def remove_stop_words(text: str) -> str:
  doc = nlp(text)
  text_parts = [token.text for token in doc if not token.is_stop]
  return "".join(text_parts)

def split_sentences(text: str) -> List[str]:
  doc = nlp(text)
  sentences = [sent.text for sent in doc.sents]
  return sentences

def group_sentences_semantically(sentences: List[str], threshold: int) -> List[str]:
  docs = [nlp(sentence) for sentence in sentences]
  segments = []

  start_idx = 0
  end_idx = 1
  segment = [sentences[start_idx]]
  while end_idx < len(docs):
    if docs[start_idx].similarity(docs[end_idx]) >= threshold:
      segment.append(docs[end_idx])
    else:
      segments.append(" ".join(segment))
      start_idx = end_idx
      segment = [sentences[start_idx]]
    end_idx += 1

  if segment:
    segments.append(" ".join(segment))

  return segments

def split_text(text: str) -> List[str]:
  text_no_stop_words = remove_stop_words(text)
  sentences = split_sentences(text_no_stop_words)
  return group_sentences_semantically(sentences, 0.8)

While there are other text segmentation techniques for specific types of data, such as clausal segmentation for legal datasets and time bucketing for time series, the aforementioned techniques work generally well across datasets. They may even be used in tandem if computation and storage constraints allow.

That’s a wrap for this article, though I’ll continue providing updates as more novel methods for text segmentation surface. If you have any questions, feel free to leave them below.

This article was not made in affiliation with any entities, and all content and opinions in this article are my own.