TextSegmentation
Documentation for TextSegmentation.
TextSegmentation.Utils.calculate_cosin_similarity
TextSegmentation.Utils.count_elements
TextSegmentation.Utils.merge_elements
TextSegmentation.Utils.tokenize
TextSegmentation.TextTiling.SegmentObject
TextSegmentation.TextTiling.segment
TextSegmentation.C99.SegmentObject
TextSegmentation.C99.segment
TextSegmentation.TopicTiling.SegmentObject
TextSegmentation.TopicTiling.segment
TextSegmentation.Utils.calculate_cosin_similarity
— Methodcalculate_cosin_similarity(elements_dct_1, elements_dct_2) -> Float64
Calculates the cosine similarity between two dictionaries.
Arguments
elements_dct_1
: Token dictionary contained in the left block including the reference sentence.elements_dct_2
: Token dictionary contained in the block to the right of the reference sentence.
TextSegmentation.Utils.count_elements
— Methodcount_elements(sequence) -> Dict{String, Int64}
Counts the number of elements per token from a word segmented sentence.
Arguments
sequence
: Word Segmented Sequence.
TextSegmentation.Utils.merge_elements
— Methodmerge_elements(dct)
Merge a dictionary that counts tokens in a sentence by element.
Arguments
dct
: A dictionary with the number of elements counted for each token.
TextSegmentation.Utils.tokenize
— Methodtokenize(sentence) -> Vector{SubString{String}}
Perform preprocessing such as erasing symbols, converting uppercase letters to lowercase, word segmentation, etc.
Arguments
sentence
: A sentence in a document.
TextSegmentation.TextTiling.SegmentObject
— TypeTextTiling.SegmentObject(window_size, do_smooth, smooth_window_size, tokenizer)
TextTiling is a method for finding segment boundaries based on lexical cohesion and similarity between adjacent blocks.
Arguments
window_size
: Sliding window size.do_smooth
: If true, smoothing depth scores.smooth_window_size
: Window size for smoothing depth scores.tokenizer
: Tokenizer for word segmentation.
TextSegmentation.TextTiling.segment
— FunctionTextTiling.segment(seg, document, [num_topics]) -> String
Performs the splitting of the document entered in the document
argument.
Arguments
seg
: Segment object.document
: The document to be text segmented.num_topics
: numtopics is the number of topics in the document. If this value is specified, segment boundaries are determined by the number of numtopics, starting with the highest depth score.
Examples
using TextSegmentation
window_size = 2
do_smooth = false
smooth_window_size = 1
num_topics = 3
tt = TextTiling.SegmentObject(window_size, do_smooth, smooth_window_size, Utils.tokenize)
result = TextTiling.segment(tt, document, num_topics)
println(result)
00010001000
TextSegmentation.C99.SegmentObject
— TypeC99.SegmentObject(window_size, similarity_matrix, rank_matrix, sum_matrix, std_coeff, tokenizer)
C99 is a method for determining segment boundaries through segmented clustering.
Arguments
window_size
: window_size is used to create a rank matrix and specifies the range of adjacent sentences to be referenced.similarity_matrix
: Matrix of calculated cosine similarity between sentences.rank_matrix
: Each value in the similarity matrix is replaced by a rank in the local domain. A rank is the number of neighboring elements with lower similarity score.sum_matrix
: Sum of rank matrix in segment regions i to j.std_coeff
: std_coeff is used for the threshold that determines the segment boundary. μ and v are the mean and variance of the gradient δD(n) of the internal density.tokenizer
: Tokenizer for word segmentation.
TextSegmentation.C99.segment
— MethodC99.segment(seg, document, n) -> String
Performs the splitting of the document entered in the document
argument.
Arguments
seg
: Segment object.document
: The document to be text segmented.n
: Document Length.
Examples
using TextSegmentation
n = length(document)
init_matrix = zeros(n, n)
window_size = 2
std_coeff = 1.2
c99 = C99.SegmentObject(window_size, init_matrix, init_matrix, init_matrix, std_coeff, Utils.tokenize)
result = C99.segment(c99, document, n)
println(result)
00010001000
TextSegmentation.TopicTiling.SegmentObject
— TypeTopicTiling.SegmentObject(window_size, do_smooth, smooth_window_size, lda_model, dictionary)
TopicTiling is an extension of TextTiling that uses the topic IDs of words in a sentence to calculate the similarity between blocks.
Arguments
window_size
: Sliding window size.do_smooth
: If true, smoothing depth scores.smooth_window_size
: Window size for smoothing depth scores.lda_model
: Trained LDA topic model.dictionary
: A dictionary showing word-id mappings.
TextSegmentation.TopicTiling.segment
— FunctionTopicTiling.segment(seg, document, [num_topics]) -> String
Performs the splitting of the document entered in the document
argument.
Arguments
seg
: Segment object.document
: The document to be text segmented.num_topics
: numtopics is the number of topics in the document. If this value is specified, segment boundaries are determined by the number of numtopics, starting with the highest depth score.
Examples
using TextSegmentation
# LDA Topic Model
pygensim = pyimport("gensim")
# train_document
# Data to be used when training the LDA topic model.
# Data from the same domain as the text to be segmented is preferred.
function read_file(file_path)
f = open(file_path, "r")
return filter((i) -> length(i) > 5, split.(lowercase(replace.(read(f, String), "
"=>"")), "
"))
close(f)
end
file_path = [
"/data/Relativity the Special and General Theory.txt",
"/data/On Liberty.txt",
"/data/Dream Psychology Psychoanalysis for Beginners.txt",
]
train_document = []
for i in file_path
append!(train_document, read_file(i))
end
tokenized_train_document = [Utils.tokenize(i) for i in train_document]
dictionary = pygensim.corpora.Dictionary(tokenized_train_document)
corpus = [dictionary.doc2bow(text) for text in tokenized_train_document]
lda_model = pygensim.models.ldamodel.LdaModel(
corpus = corpus,
id2word = dictionary,
minimum_probability = 0.0001,
num_topics = 3,
random_state=1234,
)
# TopicTiling
window_size = 2
do_smooth = false
smooth_window_size = 1
num_topics = 3
to = TopicTiling.SegmentObject(window_size, do_smooth, smooth_window_size, lda_model, dictionary)
result = TopicTiling.segment(to, document, num_topics)
println(result)
00010010000