Paper

The Paper class handles downloading and processing individual papers. All the paper information is stored in a python dataclass under the paper.info attribute.

Usage

from benchmate.literature.literature import Paper

# Initialize from PubMed ID
paper = Paper(
    paper_id="12345678",
    id_type="pubmed",
    citations=True,      # Get citation data
    references=True,     # Get reference data
    related_works=True   # Get related papers
)

# Initialize from arXiv ID 
paper = Paper(
    paper_id="2101.12345",
    id_type="arxiv"
)

# Initialize from local PDF file
paper = Paper(
    paper_id=None,
    filepath="/path/to/paper.pdf"
)

# If you use an arxiv or pubmed id the abstract and the paper title will be automatically extracted
print(paper.info.title)
print(paper.info.abstract)

# you can additional information about the paper via openalex

paper.search_info()
paper.get_references()
paper.get_cited_by()
paper.get_related_works()

These methods will modify the paper class in place. The paper_info dataclass stores all the relevant information about the paper. about the paper. Openalex provides a lot of information, including whether a paper is available via open access. If this is the case there will be a link to the PDF that is stored in the paper.info.pdf_link attribute.

To download the PDF to a location of your choice, you can use the download_pdf method.

paper.download(destination="/path/to/destination")

There are a few limitations to downloading papers. Due to NCBI api key restrictions (that is I don’t have one). I cannot write an additional method to download papers using pubmed. Therefore I have not written the code for that. And since I do not have a pubmed API key, I am limited to open access papers, that are indexed by openalex. Even among these there are restrictions for making simple get requests that sometimes (or all the times) the publishers may refuse to return data for such requests. There is nothing I can do about this as I do not control what different publishers do with their servers. That said if you have a list of id=pdf key value store you can fill in the paper.info.download_link attribute manually and continue with paper processing discussed under paper processor module

PaperInfo Dataclass

We created the PaperInfo dataclass to store all the information that is associated with that paper. This includes all the information that is generated after processing. The dataclass looks like this:

@dataclass
class PaperInfo:
    """
    Dataclass to hold information about a paper, this is constructed inside the Paper class and desined to be compatible with
    semantic search and embedding distance searches
    """
    id: str
    id_type: str
    title: Optional[str] = None
    authors: Optional[list] = None
    abstract: Optional[str] = None
    abstract_embeddings: Optional[np.ndarray]  = None
    text: Optional[str] = None
    text_chunks: Optional[list] = None
    chunk_embeddings: Optional[np.ndarray] = None
    figures: Optional[list] = None
    figure_embeddings: Optional[np.ndarray] = None
    tables: Optional[list] = None
    table_embeddings: Optional[np.ndarray] = None
    figure_interpretation: Optional[str] = None
    table_interpretation: Optional[str] = None
    download_link: str = None
    downloaded: bool = False
    file_path: str = None
    openalex_info: Optional[dict] = None
    references: Optional[list] = None
    related_works: Optional[list] = None
    cited_by: Optional[list] = None

Most of these are populated by the PaperProcessor