TextSegmentation

Documentation for TextSegmentation.

TextSegmentation.Utils.calculate_cosin_similarityMethod
calculate_cosin_similarity(elements_dct_1, elements_dct_2) -> Float64

Calculates the cosine similarity between two dictionaries.

Arguments

  • elements_dct_1: Token dictionary contained in the left block including the reference sentence.
  • elements_dct_2: Token dictionary contained in the block to the right of the reference sentence.
source
TextSegmentation.Utils.count_elementsMethod
count_elements(sequence) -> Dict{String, Int64}

Counts the number of elements per token from a word segmented sentence.

Arguments

  • sequence: Word Segmented Sequence.
source
TextSegmentation.Utils.merge_elementsMethod
merge_elements(dct)

Merge a dictionary that counts tokens in a sentence by element.

Arguments

  • dct: A dictionary with the number of elements counted for each token.
source
TextSegmentation.Utils.tokenizeMethod
tokenize(sentence) -> Vector{SubString{String}}

Perform preprocessing such as erasing symbols, converting uppercase letters to lowercase, word segmentation, etc.

Arguments

  • sentence: A sentence in a document.
source
TextSegmentation.TextTiling.SegmentObjectType
TextTiling.SegmentObject(window_size, do_smooth, smooth_window_size, tokenizer)

TextTiling is a method for finding segment boundaries based on lexical cohesion and similarity between adjacent blocks.

Arguments

  • window_size: Sliding window size.
  • do_smooth: If true, smoothing depth scores.
  • smooth_window_size: Window size for smoothing depth scores.
  • tokenizer: Tokenizer for word segmentation.
source
TextSegmentation.TextTiling.segmentFunction
TextTiling.segment(seg, document, [num_topics]) -> String

Performs the splitting of the document entered in the document argument.

Arguments

  • seg: Segment object.
  • document: The document to be text segmented.
  • num_topics: numtopics is the number of topics in the document. If this value is specified, segment boundaries are determined by the number of numtopics, starting with the highest depth score.

Examples

using TextSegmentation

window_size = 2
do_smooth = false
smooth_window_size = 1
num_topics = 3
tt = TextTiling.SegmentObject(window_size, do_smooth, smooth_window_size, Utils.tokenize)
result = TextTiling.segment(tt, document, num_topics)
println(result)
00010001000
source
TextSegmentation.C99.SegmentObjectType
C99.SegmentObject(window_size, similarity_matrix, rank_matrix, sum_matrix, std_coeff, tokenizer)

C99 is a method for determining segment boundaries through segmented clustering.

Arguments

  • window_size: window_size is used to create a rank matrix and specifies the range of adjacent sentences to be referenced.
  • similarity_matrix: Matrix of calculated cosine similarity between sentences.
  • rank_matrix: Each value in the similarity matrix is replaced by a rank in the local domain. A rank is the number of neighboring elements with lower similarity score.
  • sum_matrix: Sum of rank matrix in segment regions i to j.
  • std_coeff: std_coeff is used for the threshold that determines the segment boundary. μ and v are the mean and variance of the gradient δD(n) of the internal density.
  • tokenizer: Tokenizer for word segmentation.
source
TextSegmentation.C99.segmentMethod
C99.segment(seg, document, n) -> String

Performs the splitting of the document entered in the document argument.

Arguments

  • seg: Segment object.
  • document: The document to be text segmented.
  • n: Document Length.

Examples

using TextSegmentation

n = length(document)
init_matrix = zeros(n, n)
window_size = 2
std_coeff = 1.2

c99 = C99.SegmentObject(window_size, init_matrix, init_matrix, init_matrix, std_coeff, Utils.tokenize)
result = C99.segment(c99, document, n)
println(result)
00010001000
source
TextSegmentation.TopicTiling.SegmentObjectType
TopicTiling.SegmentObject(window_size, do_smooth, smooth_window_size, lda_model, dictionary)

TopicTiling is an extension of TextTiling that uses the topic IDs of words in a sentence to calculate the similarity between blocks.

Arguments

  • window_size: Sliding window size.
  • do_smooth: If true, smoothing depth scores.
  • smooth_window_size: Window size for smoothing depth scores.
  • lda_model: Trained LDA topic model.
  • dictionary: A dictionary showing word-id mappings.
source
TextSegmentation.TopicTiling.segmentFunction
TopicTiling.segment(seg, document, [num_topics]) -> String

Performs the splitting of the document entered in the document argument.

Arguments

  • seg: Segment object.
  • document: The document to be text segmented.
  • num_topics: numtopics is the number of topics in the document. If this value is specified, segment boundaries are determined by the number of numtopics, starting with the highest depth score.

Examples

using TextSegmentation

# LDA Topic Model
pygensim = pyimport("gensim")
# train_document
# Data to be used when training the LDA topic model.
# Data from the same domain as the text to be segmented is preferred.

function read_file(file_path)
    f = open(file_path, "r")
    return filter((i) -> length(i) > 5, split.(lowercase(replace.(read(f, String), "
"=>"")), "
"))
    close(f)
end

file_path = [
    "/data/Relativity the Special and General Theory.txt",
    "/data/On Liberty.txt",
    "/data/Dream Psychology Psychoanalysis for Beginners.txt",
]

train_document = []
for i in file_path
    append!(train_document, read_file(i))
end

tokenized_train_document = [Utils.tokenize(i) for i in train_document]
dictionary = pygensim.corpora.Dictionary(tokenized_train_document)
corpus = [dictionary.doc2bow(text) for text in tokenized_train_document]
lda_model = pygensim.models.ldamodel.LdaModel(
    corpus = corpus,
    id2word = dictionary,
    minimum_probability = 0.0001,
    num_topics = 3,
    random_state=1234,
)

# TopicTiling
window_size = 2
do_smooth = false
smooth_window_size = 1
num_topics = 3
to = TopicTiling.SegmentObject(window_size, do_smooth, smooth_window_size, lda_model, dictionary)
result = TopicTiling.segment(to, document, num_topics)
println(result)
00010010000
source