Sci-Blogger consists of a two-stage pipeline, which is described in detail as follows:
We employ a heuristic-based function which takes the title and abstract of the research paper and extracts relevant information to feed it into the next step. This is done by experimenting with various heuristics as we will describe below. The output from the previous step is fed into a sequence-to-sequence neural generation model in order to generate the title of the blog post.
For stage 1 - given our dataset, T = (bt, pt, abs), where, bt is the blog title, pt is the paper title and abs is the abstract, we define a heuristic function H(pt, abs) which takes a paper title and abstract as parameters and outputs a sequence s. We train our seq2seq models SS to take s as input and generates bt' as output with a loss function L (bt, bt')
The various heuristic functions H we explored are outlined below:
• H (pt, abs) = pt : In this heuristic, we assume that the paper title will encapsulate sufficient information to generate the blog title.
• H (pt, abs) = RP (abs) : Here, we define RP (abs) as the most representative sentence in abs. We used the sum of TF-IDF values of words in a sentence as representativeness of a sentence and follow a similar procedure like the previous approach.
• H(pt, abs) = RPD(abs): Let RD(abs) and RP(abs) be the normalized readability and representativeness of a sentence respectively, where normalization is performed across all sentences. We define RPD(abs) = nRD(abs) x nRP(abs).
We also experimented with different combinations of the above heuristics: H (pt, abs) = pt | RD(abs), H (pt, abs) = pt | RP (abs), H (pt, abs) = pt | RPD (abs) and H (pt, abs) = pt | abs; where | implies concatenation of the associated heuristics.
In stage 2 - we leverage a competent sequence-to-sequence (seq2seq) architecture for generating the blog titles using the intermediate output sequence from stage 1.
Sequence-to-sequence networks have been successfully applied to summarization and neural machine translation tasks where an attention is defined over the input sequence to allow the network to focus on specific parts of the input text to generate the text.
One of the recent advancements in this direction is the pointer-generator framework where the model extends over the attention-based frameworks to compute a probability Pgen to decide whether the next word in sequence should be copied from the source or generated from the rest of the vocabulary. Such a framework aids in copying factual information from the source, and we hypothesize that this will be useful when generating blog titles.Hence, we use this pointer-generator model as our sequence-to-sequence framework for the 2nd stage.