literature
paper_from_response
def paper_from_response(openalex_response)
generate a paper object from an openalex response
Arguments:
openalex_response: openalex response json
Returns:
a paper object
paper_from_link
def paper_from_link(link)
generate a paper object from an openalex link, this is useful for references and related works
Arguments:
link: openalex link
Returns:
a paper object
LitSearch Objects
class LitSearch()
__init__
def __init__(pubmed_api_key=None, email=None, sort_by="relevance")
create the necessary framework for searching
Arguments:
email: email to use for pubmed apisort_by: relevance or pub+datepubmed_api_key:
search
def search(query, database="pubmed", results="id", max_results=1000)
search pubmed and arxiv for a query, this is just keyword search no other params are implemented at the moment
Arguments:
query: this is a string that is passed to the search, as long as it is a valid query it will work and other fields can be specifieddatabase: pubmed or arxivresults: what to return, default is paper id PMID and arxiv idmax_results: max number of results to return default 1000
Returns:
paper ids specific to the database
PaperInfo Objects
@dataclass
class PaperInfo()
Dataclass to hold information about a paper, this is constructed inside the Paper class and desined to be compatible with semantic search and embedding distance searches
Paper Objects
class Paper()
__init__
def __init__(paper_id, id_type="pubmed", get_abstract=True)
This class is used to download and process a paper from a given id, it can also be used to process a paper from a file
Arguments:
paper_id:id_type: pubmed or arxivfilepath: if you already have the pdf file you can pass it here, mutually exclusive with paper_idcitations: if you want to get the citations for the paper, need paper id, cannot do it with pdfreferences: if you want to get the references for the paper, need paper id, cannot do it with pdfrelated_works: if you want to get the related works for the paper, need paper id, cannot do it with pdf
get_abstract
def get_abstract()
get the abstract of the paper from pubmed or arxiv
Returns:
fill in the paper info abstract, title, authors
search_info
def search_info()
search openalex for the paper info and download link
Returns:
fill in the paper info openalex_info and download_link
download
def download(destination)
download the paper pdf to the destination folder
Arguments:
destination: where to download the paper, it must exist, the folder will not be created or checked for existence
Returns:
download the paper pdf to the destination folder
get_references
def get_references()
get the references of the paper from openalex
Returns:
fill in the paper info references
get_related_works
def get_related_works()
get the related works of the paper from openalex
Returns:
fill in the paper info related_works
get_cited_by
def get_cited_by(cursor="*")
get the papers that cite this paper from openalex
Arguments:
cursor: the used does not need to worry about this, it is used for pagination and recursive calls
Returns:
fill in the paper info cited_by
paper_processor
PaperProcessor Objects
class PaperProcessor()
paper processor class, this is the main class for extracting text figures and generating embeddings for the papers the pipeline method is the main caller where you can specify which steps you would like to run all the necessary parameters are passed in a config dict so there are no hard coded values and no values to fill
extract
def extract(model, file_path, zoom=2)
extract text and images from a pdf, this model gets all the figures and tables from the pdf and returns them as images
as well as extracting the pdf text using tesseract.
Arguments:
file_path: pdf file path
Returns:
text, figures and tables as pillow images
text_embeddings
def text_embeddings(chunker, model, text, splitting_strategy="semantic")
genereate text embeddings using a chunking strategy and an embedding model. The model is a huggingface sentence transformer
and the chunker is a chonkie semantic chunker
Arguments:
text: text to embedchunker: chonkie semantic chunkersplitting_strategy: whether to use semantic chunking or notembedding_model: sentence transformer model
Returns:
chunks and embeddings if not chunked then the whole text and its embedding
image_embeddings
def image_embeddings(images, processor, model)
generate image embeddings using a vision model and its processor
Arguments:
images: images, these can be tables or figuresprocessor: image processor this is a huggingface processormodel: the vl model this is a huggingface model
Returns:
return the embeddings as a list. Depending on the kind of model used this can be a 1D or 2D embedding, the current implementaion of this function does not care but your knowledgebase and the project class will break if the necessary changes are not made to accomodate the embedding shape
interpret_image
def interpret_image(image, prompt, model, processor, max_tokens=100)
This function takes an image and a prompt, and generates a text description of the image using a vision-language model.
the default model is Qwen2_5_VL.
Arguments:
image: PIL image, no need to save to diskprompt: image prompt, see configs for defaultprocessor: processor class from huggingfacemodel: model class from huggingfacemax_tokens: number of tokens to generate, more tokens = more text but does not mean more informationdevice: gpu or cpu, if cpu keep it short
Returns:
string
pipeline
def pipeline(papers,
extract=True,
embed_text=True,
embed_images=True,
interpret_images=False,
embed_iterpretations=False)
whole paper processing pipeline
Arguments:
papers: list of papers see literature.paper for detailsextract: extract text, figues and tables (the latter two are images)embed_text: chunk and embed the pdf textembed_images: embed imagesinterpret_images: run a vision language model on the images to generate textembed_iterpretations: embed the interpretations of the images
Returns:
paper class instance with all the attributes filled
text_score
def text_score(query, papers)
score papers based on text similarity to a query, this is used in the project class to rank papers based on their relevance to a project description
Arguments:
query: a description of what you are looking forpapers: a list of paper instances
Returns:
a list of scores corresponding to the papers one per paper
utils
extract_pdfs_from_tar
def extract_pdfs_from_tar(file, destination)
extract all pdf files from a tar.gz file to a destination folder and return the paths to the extracted pdf files
this is there to process pmc tar.gz files
Arguments:
file: downloaded tar.gz filedestination: where to extract the pdf files
Returns:
a list of paths to the extracted pdf files
filter_openalex_response
def filter_openalex_response(response,
fields=[
"id", "ids", "doi", "title", "topics",
"keywords", "concepts", "mesh",
"best_oa_location", "referenced_works",
"related_works", "cited_by_api_url",
"datasets"
])
filters the openalex response to only include the specified fields
Arguments:
response: openalex responsefields: which fields to include a list of strings
Returns:
new response with only the specified fields
search_openalex
def search_openalex(id_type, paper_id, fields=None)
api call for openalex to retrieve paper information
Arguments:
id_type: pubmed or arxivpaper_id: the idfields: which field to get, passed to filter_openalex_response
search_semantic_scholar
def search_semantic_scholar(paper_id, id_type, api_key=None, fields=None)
api call for semantic scholar to retrieve paper information, requires an api key
Arguments:
paper_id: paper idid_type: id type, doi, arxiv, mag, pubmed, pmcid, ACLapi_key: api key for semantic scholarfields: which fields to retrieve, list of strings
Returns:
a dict with the paper information, this is currently not used not it is compatible with the paper class or other supporting functions