Skip to content

Documentation

Sergio Castillo Lara edited this page Sep 12, 2017 · 3 revisions

Documentation of ppaxe.core and ppaxe.report classes.

Table of Contents

PMQuery

Attributes

ids: list no default

List of PubMed identifiers to query.

database: str default = "PMC"

Database to download the articles or abstracts from. PMC or PUBMED.

articles: list no default

List of downloaded Article objects.

found: set no default

PubMed identifiers of the articles found in database.

notfound: set no default

PubMed identifiers of the articles not found in database.

Methods

get_articles()

Retrieves the Fulltext or the abstracts of the specified Articles

Article

Attributes

pmid: str no default

PubMed identifier of the article.

pmcid: str no default

PubMedCentral identifier of the article.

journal: str no default

Journal id of the article.

fulltext: str no default

Whole text of the article.

abstract: str no default

Abstract of the article.

sentences: list no default

List of Sentence objects in article (fulltext or abstract).

Methods

as_html()

Writes tokenized sentences as HTML

count_genes()

Returns how many times each gene appears. Dictionary of gene objects with counts as values

extract_sentences(mode="split", source="fulltext")

Finds sentence boundaries and saves them as sentence objects in the attribute "sentences" as a list of Sentence objects.

  • mode str optional default = "split"

Split the sentences ("split") or use the whole "source" as a single sentence ("no-split"). Useful for developing and debugging.

  • source str optional default = "fulltext"

Use the "fulltext" or the "abstract" to extract sentences.

make_wordcloud()

Creates a wordcloud image

predict_interactions(source="fulltext")

Simple wrapper method to avoid calls to multiple methods.

  • source str optional default = "fulltext"

Retrieve the interactions in the article from the source (fulltext or abstract).

Sentence

Attributes

originaltext: str no default

Original text string of the sentence.

tokens: list no default

List of tokens retrieved from StanfordCoreNLP. Each element is a dictionary with keys:

    "index" : Position of token (1-Indexed).
    "word"  : Word of the token.
    "lemma" : Lemma of the token.
    "ner"   : Protein ("P") or Other ("O").
    "pos"   : Part-of-Speech tag.

candidates: list no default

List of Candidate objects in sentence.

proteins: list no default

List of Protein objects found in sentence.

Methods

annotate()

Annotates the genes/proteins in the sentence using StanfordCoreNLP trained NER tagger. Will add a list of tokens to the attribute "tokens".

get_candidates()

Gets interaction candidates candidates for sentence (attribute: candidates) and all the proteins (attribute: proteins).

to_html()

Sentence to HTML string tagging the proteins and the verbs using tags.

Protein

Attributes

symbol: str no default

Symbol of the protein or the gene.

positions: list no default

List of the position of the protein in the tokenized sentence (1-Indexed).

sentence: Sentence no default

Sentence object to which the protein belongs.

synonym: str no default

Synonymous symbol of the protein/gene.

count: int no default

Length of position list.

Methods

disambiguate()

Method for disambiguating the gene (convert it to the approved symbol if possible).

InteractionCandidate

Attributes

prot1: Protein no default

Protein object of the first protein involved in the possible interaction.

prot2: Protein no default

Protein object of the second protein involved in the possible interaction.

between_idxes: tuple no default

Indexes of end of Protein_1 and start of Protein_2 (1-Indexed).

label: bool no default

Label of Candidate when prediction is performed. True for interacting proteins and False for non-interacting proteins. True if votes >= 0.55.

votes: float no default

Percentage of votes of the Random Forest Classifier.

feat_cols: list no default

Feature column indexes of the non-zero features computed for Candidate.

feat_current_col: int no default

Store the current feature column index that has been computed.

feat_vals: list no default

Values of the non-zero features.

features_sparse: sparse.coo_matrix no default

Sparse Coo matrix with features for Candidate.

Methods

compute_features()

Computes all the necessary features to predict if this InteractionCandidate is a real interaction. Fills attribute features_sparse, which is a Scipy sparse matrix.

features_todense()

Returns features as a plain python list. Used for testing.

predict()

Computes the votes (prediction) of the candidate by using the Random Forest classifier trained with scikitlearn.

to_html()

Transforms candidate to html with only involved proteins tagged and only verbs between proteins tagged.

ReportSummary

Attributes

articles: list or PMQuery no default

List of Article objects or PMQuery with Article objects in attribute "articles".

protsummary: ProteinSummary no default

ProteinSummary object of the analysis.

graphsummary: GraphSummary no default

GraphSummary object of the analysis.

Methods

create_pdf(outfile)

Creates a pdf out of an html file.

  • outfile str required no default

Output filename of the pdf report. Will append ".pdf" to it.

make_report(outfile="report")

Makes all the necessary steps to make the report.

  • outfile str optional default = "report"

Filename of the output file. Will append ".html" or ".pdf".

write_html(outfile)

Writes an html with the report to outfile.

  • outfile str required no default

Output filename of the html report. Will append ".html" to it.

ProteinSummary

Attributes

articles: list no default

List of Article objects with Article objects in attribute "articles".

prot_table: dict no default

Dictionary of dictionary with information about protein counts in articles. Keys:

        symbol: symbol of the protein
            'totalcount' : total number of ocurrencies of protein.
            'int_count'
                'left'  : Ocurrencies of protein on left hand side of interaction.
                'right' : Ocurrencies of protein on right hand side of interaction.

Methods

makesummary()

Makes the summary of the proteins found using the NER

table_to_html(sorted_by="totalcount", reverse=True)

Returns an html string with the desired protein count table.

  • sorted_by str optional default = "totalcount"

Sort table by total number of ocurrences of protein in sentences (sorted_by="totalcount"), by total number of ocurrences in interactions (sorted_by="int_count"), by ocurrences in left hand side of interaction (sorted_by="left") or righ hand side (sorted_by="right").

  • reverse bool optional default = True

Sort proteins in reverse order (from bigger to smaller) according to the sorted_by rule if True. Reverse (smaller to bigger) if False.

GraphSummary

Attributes

articles: list no default

List of Article objects with Article objects in attribute "articles".

interactions: list no default

List of lists with interactions in articles. Elements:

        [
            [
                votes,
                prot1.symbol,
                prot1.disambiguate(),
                prot2.symbol,
                prot2.disambiguate(),
                candidate.to_html(),
                article.pmid
            ],
            ...
        ]

numinteractions: int no default

Number of interactions in articles.

uniqinteractions: set no default

Set with symbols of interactions in articles to remove redundant interactions.

uniqinteractions_count: int no default

Number of unique interactions in articles.

Methods

graph_to_json()

Returns a json string with the graph prepared for cytoscape

makesummary()

Makes the summary of the interactions retrieved.

table_to_html()

Returns a string in html with the interactions sorted by votes/confidence

Clone this wiki locally