Skip to content

Commit

Permalink
Merge pull request #11 from deepcurator/development
Browse files Browse the repository at this point in the history
Merge for Milestone 7
  • Loading branch information
dmitra79 authored Sep 30, 2019
2 parents f7c267e + db5a4fa commit 3052e73
Show file tree
Hide file tree
Showing 6,505 changed files with 1,096,776 additions and 44,860 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ install:

script:
- cd script
- python script_run_lightweight_method.py -opt 3 -ipt ..\core\test\fashion_mnist
- python script_lightweight.py -opt 3 -ip ..\core\test\fashion_mnist
17 changes: 12 additions & 5 deletions conf/conf.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,29 @@
# All paths are relative to code level, ex: from {root}/src/text2graph/ or {root}/src/image2graph/

RAW_DATA_FOLDER: ../../Data/raw_data/


# folder with raw pdfs
PDF_PATH: 'papers_pdf/'



# location where to extract text files for further analysis:
EXTRACT_TEXT_PATH: '../../Data/extracted_text/'
# location where to extract abstracts for further analysis:
EXTRACT_ABSTRACT_PATH: '../../Data/extracted_abstracts/'

# location where to extract text files for further analysis:
EXTRACT_IMAGE_PATH: 'extracted_images/'

# text2graph parameters
PAPERS_IN_XML_PATH: 'Data/Papers-In-XML-format/'
ANNOTATED_TEXT_PATH: 'Data/Abstracts-annotated-whole/'
SENTENCE_ANNOTATED_TEXT_PATH: 'Data/Abstracts-annotated/'
SENTENCE_ANNOTATED_TEXT_PATH: 'Data/Papers-annotated-Brat/'
SENTENCE_ANNOTATED_TEXT_PATH_SEMEVAL: 'Data/Papers-annotated-SemEval/'
TEST_DATA_PATH: 'Data/TestData/'
MODEL_PATH: 'Models/'
TEXT_OUTPUT_PATH: 'Output/'


# location to CSO:
CSO_PATH: '../../../Ontologies'
Binary file not shown.
Binary file not shown.
Binary file not shown.
7 changes: 6 additions & 1 deletion src/code2graph/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,9 @@ data_tf/*
rdf_triples/*
rdf_graphs/*
dataset/*
data_tf/*
data_tf/*
UCI_TF_Papers/*
core/pyan_temp/*
event_files/*
comp_rdf/*
lightweight_rdf*
72 changes: 42 additions & 30 deletions src/code2graph/README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,18 @@
# Code2graph

The code2graph is a Python module that aims to transform the source code related to Deep Learning Architectures and methodologies into RDF graphs. In code2graph, the building blocks of the pipeline are implemented with a flexible architecture.
The code2graph is a module in project [Deep Code Curator](https://github.com/deepcurator/DCC) (DCC) which aims to extract the information from scientific publications and the corresponding source code related to Deep Learning architectures and methodologies.
The code2graph is a sub-module in DCC ([Deep Code Curator](https://github.com/deepcurator/DCC)) which aims to extract sementic information from text, images, code and equation accompanied with scientific DL papers. The purpose of code2graph is to build a pipeline of methodologies to extract Resource Description Framework (RDF) graphs, particularly from the code repositories related to DL publications. The figure below illustrates the current architecture.

Currently, two methodogies are included in code2graph.
1. Computation-based Approach, see [graphHandler.py](https://github.uci.edu/AICPS/code2graph/blob/master/core/graphHandler.py).
2. The Lightweight Approach, see [graphlightweight.py](https://github.uci.edu/AICPS/code2graph/blob/master/core/graphlightweight.py).
![](https://github.com/louisccc/DCC/blob/master/src/code2graph/figs/architecture.jpg?raw=true)

Computation-based Approach (MNist) | The Lightweight Approach (VGG)
:-------------------------:|:-------------------------:
![](https://github.uci.edu/AICPS/code2graph/blob/master/figs/Sample_Output_0.png?raw=true) | ![](https://github.uci.edu/AICPS/code2graph/blob/master/figs/Sample_Output_1_.png?raw=true)
Two methodogies are studied in code2graph.
1. The Computational Graph-Based Approach ([graphHandler.py](https://github.com/deepcurator/DCC/blob/master/src/code2graph/core/graphHandler.py))
2. The Lightweight Approach ([graphlightweight.py](https://github.com/deepcurator/DCC/blob/master/src/code2graph/core/graphlightweight.py))

The following figure illustrates the current pipeline architecture of code2graph:
![](https://github.uci.edu/AICPS/code2graph/blob/master/figs/architecture.jpg?raw=true)
You can find details from [Technical Report on Code2Graph](http://cecs.uci.edu/files/2019/05/TR-19-01.pdf). A sample visualization of the graphs generated from both methods is shown below: (using [fashion MNIST program example](https://github.com/deepcurator/DCC/blob/master/src/code2graph/test/fashion_mnist/testGraph_extensive.py))

To understand the pipeline of code2graph better, you can refer to
- [Deep Code Curator - Technical Report on Code2Graph](http://cecs.uci.edu/files/2019/05/TR-19-01.pdf)
Computational Graph-based Approach (MNist) | The Lightweight Approach (MNist)
:-------------------------:|:-------------------------:
<img src="https://github.com/louisccc/DCC/blob/master/src/code2graph/figs/Sample_Output_0.png?raw=true">|<img src="https://github.com/louisccc/DCC/blob/master/src/code2graph/figs/Sample_Output_1_.png?raw=true" width="850">

## Software Dependencies

Expand All @@ -26,11 +23,10 @@ To understand the pipeline of code2graph better, you can refer to

## Installation Guide

Step 1: Clone the git respository by running one of the commands shown in the following snippets.
Step 1: Clone the git respository by running the command below.

```shell
git clone https://github.uci.edu/AICPS/code2graph.git
git clone [email protected]:AICPS/code2graph.git
git clone https://github.com/deepcurator/DCC.git
```

Step 2: Create a python virtual environment using your favorite package management system (conda, virtualenv, etc).
Expand All @@ -48,31 +44,47 @@ Step 3: Install the required packages to your virtual environment.
```shell
pip install -r requirements.txt
```

## Package Dependencies

* jupyter==1.0.0 => Jupyter notebook.
* jupyter-console==5.0.0 => Jupyter notebook.
* ipython==5.3.0 => Jupyter notebook.
* pyvis==0.1.6.0 => RDF graph visualization.
* astor==0.7.1 => AST manipulation and printing.
* beautifulsoup4==4.7.1 => Webscraping.
* Keras==2.2.4 => Compile Keras projects.
* tensorflow==1.13.1 => Compile tensorflow projects.
* matplotlib==3.0.2
* networkx==2.2
* rdflib==4.2.2 => RDF graph construction.
* requests==2.21.0 => Webscraping.
* scikit-learn==0.20.2
* selenium==3.141.0 => Webscraping.
* urllib3==1.24.1 => Webscraping.
* wget==3.2 => Webscraping.
* lxml==4.3.4 => Webscraping.
* showast==0.2.4 => Visualizing AST.
* autopep8==1.4.4 => Preprocess data.
* apscheduler==3.6.1 => Scheduler for web crawler.

## Usage Examples
### Running Computation-Based Approach
Under Construction, or you can also refer to the [notebook](testScript/computational_graph_based.ipynb).
Refer to the [notebook](testScript/computational_graph_based.ipynb).

### Running Lightweight Approach
Run the follwing command, or you can also refer to the [notebook](testScript/light_weight.ipynb).

```shell
python script_run_lightweight_method.py -ipt [PATH_TO_CODE] -opt [N [N ...]] --arg
```
-ipt: Path to directory that contains the source code of your machine learining model.
Refer to the [notebook](testScript/light_weight.ipynb).

-opt: Types of output: 1 = call graph, 2 = call tress, 3 = RDF graphs, 4 = TensorFlow sequences.
## Dataset

--arg: Show arguments on graph (Hidden by default).
--url: Show url/is_type relations on graph (Hidden by default).
Using our script we scraped around 600 papers from paperswithcode.com website. Out of 600 papers, 120 of them have tensorflow implementation. We ran the lightweight method on those TensorFlow papers we scraped from Paperswithcode website. The lightweight method was successful on half of the tensorflow repositories. You can download the RDF graphs and triples we generated [here](https://osf.io/zrusg/?view_only=f6ed10613af94c6d8050796a30f1568b).

### Running Webscraper for Paperswithcode website

```shell
python script_scrape_paperswithcode.py -cd [PATH_TO_CHROMEDRIVER]
python script_service_pwc_scraper.py -cd [PATH_TO_CHROMEDRIVER] -sp [SAVE_PATH]
```

-cd: Path to ChromeDriver. To get the ChromeDriver compatible with your browser go to the following website - http://chromedriver.chromium.org/downloads and download the ChromeDriver for the version of Chrome you are using.
-cd: Path to ChromeDriver. To get the ChromeDriver compatible with your browser go to the following website - [ChromeDriver](http://chromedriver.chromium.org/downloads) and download the ChromeDriver for the version of Chrome you are using.

### Running The Summary File Extractor
### Running Computation-Based Approach
-sp: The script will save the scraped data in this path.
File renamed without changes.
128 changes: 91 additions & 37 deletions src/code2graph/config/config.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
from pathlib import Path
from argparse import ArgumentParser
import os

try:
from pathlib import Path
except ImportError:
from pathlib2 import Path

class LightWeightMethodArgParser:
'''
config class argument parser used solely for lightweight method.
Expand All @@ -11,23 +15,34 @@ def __init__(self):
self.parser = ArgumentParser(
description='The parameters for the Lightweight Approach.')

# default code_path is pointed to fashion mnist example.
self.parser.add_argument('-ipt', dest='code_path',
# default code_path is pointed to fashion mnist example.
self.parser.add_argument('-ip', dest='input_path',
default='../test/fashion_mnist', type=str,
help='Path to the source code. Default: ../test/fashion_mnist')

self.parser.add_argument('-r', dest='recursive',
action='store_true',
help='Recursively apply Lightweight method on all the papers in the code path.')
self.parser.set_defaults(recursive=False)
self.parser.add_argument('-dp', '--dest_path',
default='../rdf', type=str,
help='Path to store generated triples/graphs.')
self.parser.add_argument('--ct', dest='combined_triples_only',
action='store_true',
help='Only save the combined_triples in destination path.')
self.parser.set_defaults(combined_triples_only=False)
self.parser.add_argument('-opt', dest='output_types',
metavar='N', type=int,
nargs='+', choices={1, 2, 3, 4, 5},
default={1},
help='Types of output: 1 = call graph, 2 = call tress, 3 = RDF graphs, 4 = TensorFlow sequences, 5 = Extract triples.')
nargs='+', choices={1, 2, 3, 4, 5, 6},
default={5},
help='Types of output: 1 = Call graph, 2 = Call trees, 3 = RDF graph (html format),'
'4 = TensorFlow sequences, 5 = Extract triples, 6 = RDF graph (turtle format).')
self.parser.add_argument('--arg', dest='show_arg',
action='store_true',
help='Show arguments on graph')
self.parser.set_defaults(Pshow_arg=False)
help='Show arguments on graph.')
self.parser.set_defaults(show_arg=False)
self.parser.add_argument('--url', dest='show_url',
action='store_true',
help='Show url on graph')
help='Show url on graph.')
self.parser.set_defaults(show_url=False)

def get_args(self, args):
Expand All @@ -40,8 +55,10 @@ class LightWeightMethodConfig:
'''

def __init__(self, arg):
self.code_path = Path(arg.code_path)
self.code_path = self.code_path.resolve()
self.input_path = Path(arg.input_path).resolve()
self.recursive = arg.recursive
self.dest_path = Path(arg.dest_path).resolve()
self.combined_triples_only = arg.combined_triples_only
self.output_types = arg.output_types
self.show_arg = arg.show_arg
self.show_url = arg.show_url
Expand All @@ -59,8 +76,8 @@ def __init__(self):
type=str,
help="Path to code directory.")

def get_args(self):
return self.parser.parse_args()
def get_args(self, args):
return self.parser.parse_args(args)


class GraphHandlerArgParser:
Expand All @@ -79,32 +96,69 @@ def __init__(self):
type=str,
help='directory for saved graph')

def get_args(self):
return self.parser.parse_args()
def get_args(self, args):
return self.parser.parse_args(args)

class PWCConfigArgParser:

class PaperswithcodeArgParser:
'''
Argument Parser for Paperswithcode script
Argument Parser for Paperswithcode service.
'''

def __init__(self):
self.parser = ArgumentParser(
description="The parameters for Paperswithcode script.")
default_path = Path("../")/"core"/"chromedriver"
default_path = default_path.resolve()
self.parser.add_argument('-cd', '--chromedriver',
default=default_path,
type=str,
help="Path to chromedirver.")
self.parser.add_argument('-url',
default="https://paperswithcode.com/latest",
type=str,
help="URL to Paperswithcode website")
self.parser.add_argument('-limit',
default=-1,
type=int,
help="Number of paper/code to download.")

def get_args(self):
return self.parser.parse_args()
self.parser = ArgumentParser(description="The parameters for PWC service.")

self.parser.add_argument('-cp', dest="chromedriver", default="../core/chromedriver", type=str, help='path of chromedriver.')
self.parser.add_argument('-sp', dest="save_path", default="./data", type=str, help="path of storing data.")
self.parser.add_argument('-cred', dest="cred_path", default="../config/credentials.cfg", type=str, help='Path to .cfg file with email credentials.' )

def get_args(self, args):
return self.parser.parse_args(args)


class PWCConfig:
''' Config for Paperswithcode service '''

def __init__(self, args):
self.chrome_driver_path = Path(args.chromedriver)
self.chrome_driver_path = str(self.chrome_driver_path.resolve())
self.cred_path = str(Path(args.cred_path).resolve())
self.storage_path = Path(args.save_path)
self.tot_paper_to_scrape_per_shot = 1

class GraphASTArgParser:
'''
Argument Parser for graphast script.
'''

def __init__(self):
self.parser = ArgumentParser(description='The parameters for graphast method.')

self.parser.add_argument('-ip', dest='input_path',
default='../test/fashion_mnist', type=str,
help='Path to the source code. Default: ../test/fashion_mnist')
self.parser.add_argument('-r', dest='recursive',
action='store_true',
help='Recursively apply graphast method on all papers in the input path.')
self.parser.set_defaults(recursive=False)
self.parser.add_argument('-dp', '--dest_path',
default='../graphast_output', type=str,
help='Path to save output files.')
self.parser.add_argument('-res', dest='resolution',
default='function', type=str,
help='Processing resolution of the method: function or method. Default: function.')

def get_args(self, args):
return self.parser.parse_args(args)


class GraphASTConfig:
'''
Config class for graphast method.
'''

def __init__(self, arg):
self.input_path = Path(arg.input_path).resolve()
self.recursive = arg.recursive
self.dest_path = Path(arg.dest_path).resolve()
self.resolution = arg.resolution
14 changes: 14 additions & 0 deletions src/code2graph/config/credentials.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[PWCScraper_email]
# modify to sender gmail address
email_address = [email protected]
# modify to sender gmail password
password = sender_password
# modify to recipient gmail address (delimited by comma)
recipients = [email protected],[email protected]

[Database]
user = code2graph
password = testing123
host = 127.0.0.1
port = 5432
database = code2graph
Loading

0 comments on commit 3052e73

Please sign in to comment.