Home · TextSegmentation.jl

TextSegmentation.Utils.calculate_cosin_similarity — Method

calculate_cosin_similarity(elements_dct_1, elements_dct_2) -> Float64

Calculates the cosine similarity between two dictionaries.

Arguments

elements_dct_1: Token dictionary contained in the left block including the reference sentence.
elements_dct_2: Token dictionary contained in the block to the right of the reference sentence.

source

TextSegmentation.Utils.count_elements — Method

count_elements(sequence) -> Dict{String, Int64}

Counts the number of elements per token from a word segmented sentence.

Arguments

sequence: Word Segmented Sequence.

source

TextSegmentation.Utils.merge_elements — Method

merge_elements(dct)

Merge a dictionary that counts tokens in a sentence by element.

Arguments

dct: A dictionary with the number of elements counted for each token.

source

TextSegmentation.Utils.tokenize — Method

tokenize(sentence) -> Vector{SubString{String}}

Perform preprocessing such as erasing symbols, converting uppercase letters to lowercase, word segmentation, etc.

Arguments

sentence: A sentence in a document.

source

TextSegmentation.TextTiling.SegmentObject — Type

TextTiling.SegmentObject(window_size, do_smooth, smooth_window_size, tokenizer)

TextTiling is a method for finding segment boundaries based on lexical cohesion and similarity between adjacent blocks.

Arguments

window_size: Sliding window size.
do_smooth: If true, smoothing depth scores.
smooth_window_size: Window size for smoothing depth scores.
tokenizer: Tokenizer for word segmentation.

source

TextSegmentation.TextTiling.segment — Function

TextTiling.segment(seg, document, [num_topics]) -> String

Performs the splitting of the document entered in the document argument.

Arguments

seg: Segment object.
document: The document to be text segmented.
num_topics: numtopics is the number of topics in the document. If this value is specified, segment boundaries are determined by the number of numtopics, starting with the highest depth score.

Examples

using TextSegmentation

window_size = 2
do_smooth = false
smooth_window_size = 1
num_topics = 3
tt = TextTiling.SegmentObject(window_size, do_smooth, smooth_window_size, Utils.tokenize)
result = TextTiling.segment(tt, document, num_topics)
println(result)
00010001000

source

TextSegmentation.C99.SegmentObject — Type

C99.SegmentObject(window_size, similarity_matrix, rank_matrix, sum_matrix, std_coeff, tokenizer)

C99 is a method for determining segment boundaries through segmented clustering.

Arguments

window_size: window_size is used to create a rank matrix and specifies the range of adjacent sentences to be referenced.
similarity_matrix: Matrix of calculated cosine similarity between sentences.
rank_matrix: Each value in the similarity matrix is replaced by a rank in the local domain. A rank is the number of neighboring elements with lower similarity score.
sum_matrix: Sum of rank matrix in segment regions i to j.
std_coeff: std_coeff is used for the threshold that determines the segment boundary. μ and v are the mean and variance of the gradient δD(n) of the internal density.
tokenizer: Tokenizer for word segmentation.

source

TextSegmentation.C99.segment — Method

C99.segment(seg, document, n) -> String

Performs the splitting of the document entered in the document argument.

Arguments

seg: Segment object.
document: The document to be text segmented.
n: Document Length.

Examples

using TextSegmentation

n = length(document)
init_matrix = zeros(n, n)
window_size = 2
std_coeff = 1.2

c99 = C99.SegmentObject(window_size, init_matrix, init_matrix, init_matrix, std_coeff, Utils.tokenize)
result = C99.segment(c99, document, n)
println(result)
00010001000

source

TextSegmentation.TopicTiling.SegmentObject — Type

TopicTiling.SegmentObject(window_size, do_smooth, smooth_window_size, lda_model, dictionary)

TopicTiling is an extension of TextTiling that uses the topic IDs of words in a sentence to calculate the similarity between blocks.

Arguments

window_size: Sliding window size.
do_smooth: If true, smoothing depth scores.
smooth_window_size: Window size for smoothing depth scores.
lda_model: Trained LDA topic model.
dictionary: A dictionary showing word-id mappings.

source

TextSegmentation.TopicTiling.segment — Function

TopicTiling.segment(seg, document, [num_topics]) -> String

Performs the splitting of the document entered in the document argument.

Arguments

seg: Segment object.
document: The document to be text segmented.
num_topics: numtopics is the number of topics in the document. If this value is specified, segment boundaries are determined by the number of numtopics, starting with the highest depth score.

Examples

using TextSegmentation

# LDA Topic Model
pygensim = pyimport("gensim")
# train_document
# Data to be used when training the LDA topic model.
# Data from the same domain as the text to be segmented is preferred.

function read_file(file_path)
    f = open(file_path, "r")
    return filter((i) -> length(i) > 5, split.(lowercase(replace.(read(f, String), "
"=>"")), "
"))
    close(f)
end

file_path = [
    "/data/Relativity the Special and General Theory.txt",
    "/data/On Liberty.txt",
    "/data/Dream Psychology Psychoanalysis for Beginners.txt",
]

train_document = []
for i in file_path
    append!(train_document, read_file(i))
end

tokenized_train_document = [Utils.tokenize(i) for i in train_document]
dictionary = pygensim.corpora.Dictionary(tokenized_train_document)
corpus = [dictionary.doc2bow(text) for text in tokenized_train_document]
lda_model = pygensim.models.ldamodel.LdaModel(
    corpus = corpus,
    id2word = dictionary,
    minimum_probability = 0.0001,
    num_topics = 3,
    random_state=1234,
)

# TopicTiling
window_size = 2
do_smooth = false
smooth_window_size = 1
num_topics = 3
to = TopicTiling.SegmentObject(window_size, do_smooth, smooth_window_size, lda_model, dictionary)
result = TopicTiling.segment(to, document, num_topics)
println(result)
00010010000

source