-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenHub API Interfacing for Project Search #317
Comments
Hi @beydlern, This is what I originally sent an e-mail awhile back:
Here's an example of our project listed there: https://openhub.net/p/kaiaulu you can also take a look at others from our project config files like OpenSSL, etc. This is a task I know less, so part of the issue is to assess the viability of what we want as part of the task itself. For instance, there are a few things @rnkazman would like to know when considering a project (Rick feel free to chime in):
What you want to do for Kaiaulu is try to create one function per endpoint to begin with. For example, if you look at R/github.R in Kaiaulu, you will see that even the docs of the function tells you what endpoint is accessing. So, start by documenting on your specification if you can get the information above (which is displayed on the interface of OpenHub), and afterwards any other information I did not consider (or you can point me to a PDF or page with all endpoints). Remember github.R, jira.R both use APIs (I'd recommend github as you are using it as part of this project and can relate), so much of the code you may need is already there for you to use as example. Reusing code logic will also automatically help you ensure consistency. mail.R is not an API (what @daomcgill is working on), so i'd recommend against use that as reference. Depending on your findings, we may also simply add a few more endpoints to github.R to collect some of this information. However, OpenHub is preferred because they can extract info beyond GitHub itself. Let me know if you have questions. |
@carlosparadis Before I can take an in depth look at the XML-formatted data, which is the response format after I make a project request, I must register for an API key. What should I put under the |
I'm not sure what it wants as redirect uri, but you can put app name as ics 496 kaiaulu. Description can be capstone class project. |
From my understanding, the Ohloh API allows users with a valid API key to request an XML-formatted data in response to an HTTP GET request for a project. This XML file for the specific project contains an analysis section that holds general information about the project, such as the total LOC, the main language, and the number of contributors who made at least one commit in the last year. This analysis section comes with its unique ID,
To my knowledge, as long as the OpenHub website has computed and stored these statistics in an analysis, these requests are possible with Ohloh API. However, the analysis for the current date may be slightly inaccurate as the OpenHub website must be given time to compute the analysis as the latest month it has computed may refer to an older month than our current month (as shown in the |
https://github.com/blackducksoftware/ohloh_api/blob/main/README.md#xml-response-format I see, nice finding. So the reference folder basically contain on every file the format of xml you will get if you go after that endpoint, is that right? |
I took a quick look on the wiki and i don't see an example file. Could you try retrieving the analysis XML for kaiaulu so we could take a look? It seems some XML are summary statistics coming out of this file too so this may be all we need. Unfortunately they delete old files:
So in that sense, Ohloh API will never serve to comprehensively analyze a project history, but it does conveniently offer summary statistics. I also want to remind you that our goal here is to survey "the sea of projects" for the criteria we want, rather than use Ohloh to analyze them on our behalf. For example, rather than analysis.md, what we may need is something like this: https://openhub.net/orgs/apache/projects https://github.com/blackducksoftware/ohloh_api/blob/main/reference/portfolio_projects.md That gives us a list of projects. Another thing that would be useful is knowing which type of issue tracker a given project uses: https://openhub.net/p/apache/links you can imagine if that was returned in the XML, we could parse the URL from issue tracker for certain words to find if it is bugzilla, JIRA, or GitHub, and then report that in a table for the user. I do not know if OpenHub will let us search all the projects they index, or if we can only search at most per organization. Could you check what else in OpenHub could give us a bird eyes view of all the projects? We could still create a two step pipeline to maybe first obtain the name of the projects via one endpoint, and then make more API calls for the analysis to obtain the detailed information, although this would be less than ideal. |
@carlosparadis I was able to take a look at kaiaulu's project information. Here is the XML file data for the project: <response>
<status>success</status>
<result>
<project>
<id>760420</id>
<name>kaiaulu</name>
<url>https://openhub.net/p/kaiaulu.xml</url>
<html_url>https://openhub.net/p/kaiaulu</html_url>
<created_at>2021-09-27T02:32:26Z</created_at>
<updated_at>2024-10-14T05:19:13Z</updated_at>
<description>A data model for Software Engineering data analysis</description>
<homepage_url>http://itm0.shidler.hawaii.edu/kaiaulu</homepage_url>
<download_url>https://github.com/sailuh/kaiaulu</download_url>
<url_name>kaiaulu</url_name>
<vanity_url>kaiaulu</vanity_url>
<medium_logo_url>https://s3.amazonaws.com/cloud.ohloh.net/attachments/94361/logo_med.png</medium_logo_url>
<small_logo_url>https://s3.amazonaws.com/cloud.ohloh.net/attachments/94361/logo_small.png</small_logo_url>
<user_count>0</user_count>
<average_rating/>
<rating_count>0</rating_count>
<review_count>0</review_count>
<analysis_id>207699501</analysis_id>
<tags>
<tag>code_analysis</tag>
<tag>codemanagement</tag>
<tag>mining-software-repositories</tag>
<tag>socialnetwork</tag>
<tag>softwareengineering</tag>
<tag>static_analysis</tag>
</tags>
<analysis>
<id>207699501</id>
<url>https://openhub.net/p/kaiaulu/analyses/207699501.xml</url>
<project_id>760420</project_id>
<updated_at>2024-10-14T05:19:13Z</updated_at>
<oldest_code_set_time>2024-10-13T17:03:09Z</oldest_code_set_time>
<min_month>2020-05-01</min_month>
<max_month>2024-10-01</max_month>
<twelve_month_contributor_count>6</twelve_month_contributor_count>
<total_contributor_count>14</total_contributor_count>
<twelve_month_commit_count>18</twelve_month_commit_count>
<total_commit_count>186</total_commit_count>
<total_code_lines>5085</total_code_lines>
<factoids>
<factoid type="FactoidCommentsVeryHigh">
Very well-commented source code </factoid>
<factoid type="FactoidAgeOld">
Well-established codebase </factoid>
<factoid type="FactoidTeamSizeAverage">
Average size development team </factoid>
<factoid type="FactoidActivityDecreasing">
Decreasing Y-O-Y development activity </factoid>
</factoids>
<languages graph_url="https://openhub.net/p/kaiaulu/analyses/207699501/languages.png">
<language percentage="100" color="198CE7" id="65">
R </language>
</languages>
<main_language_id>65</main_language_id>
<main_language_name>R</main_language_name>
</analysis>
<similar_projects>
<project>
<id>360</id>
<name>FindBugs</name>
<vanity_url>findbugs</vanity_url>
</project>
<project>
<id>712198</id>
<name>Prospector (Python)</name>
<vanity_url>landscapeio-prospector</vanity_url>
</project>
<project>
<id>733309</id>
<name>SpotBugs</name>
<vanity_url>spotbugs</vanity_url>
</project>
<project>
<id>1865</id>
<name>GNU cflow</name>
<vanity_url>cflow</vanity_url>
</project>
</similar_projects>
<licenses>
</licenses>
<project_activity_index>
<value>30</value>
<description>Very Low</description>
</project_activity_index>
</project>
</result>
</response> On the topic of searching for projects, we are able to write a query to request a set of projects filtered through some specification. For the issue trackers inquiry, if the project page has the <links>
<link>
<title>Current Release Docs</title>
<url>http://httpd.apache.org/docs/current/</url>
<category>Documentation</category>
</link>
<link>
<title>Next release "coming soon" docs</title>
<url>http://httpd.apache.org/docs/trunk/</url>
<category>Documentation</category>
</link>
<link>
<title>Apache Bugzilla</title>
<url>https://issues.apache.org/bugzilla/</url>
<category>Issue Trackers</category>
</link>
<link>
<title>Bugzilla Search</title>
<url>https://issues.apache.org/bugzilla/query.cgi</url>
<category>Issue Trackers</category>
</link>
</links> We could parse through these links to see what issue trackers the project is using. |
This is going at the right direction, thank you for the additional information! In regards to what you said:
I looked at the URL and saw:
Could you check what exactly what we can query for? Seems we can query by language across all OpenHub, if so this is already a great start. It is not the end of the world to do follow-up check on other API endpoints if that is the only way forward. However, this then begs the question: Let's say that our query returns 300 or so projects. For every project, in order for us to find if the project is or not jira, and also the other information i said above (n contributors, LOC, etc), how many API calls will that require per project? Also, what was the limit again of API calls? And what was the time period? (per day it resets?) |
For project queries, the project reference documentation states:
I believe that we can query for anything, to be specific, the query string acts as a search pattern and the Ohloh API searches through every tag to check if the query string is contained. When the Ohloh API returns the XML data for a list of projects (if the query returns projects), it returns a maximum of ten per page (through personal testing) and it also lists the total number of items (projects) available. An example of querying a list of projects with the query string: <status>success</status>
<items_returned>10</items_returned>
<items_available>80</items_available>
<first_item_position>0</first_item_position> According to the documentation:
The next set of ten projects are listed on the next pages, where we may do simple arithmetic with the
In this case, with ten projects listed per page, it would take 330 API calls. This is because it would take 30 API calls to look at each project because there are 30 pages of projects (ten projects displayed per page), and another API call is added for each project using its
The number of API calls a user can make per API key is 1000, and this resets every 24 hours. |
When you say we could query anything, would we be able to create for example, a query that asks for LOC >= 50k? And would we be able to add "And" conditions, e.g. LOC >50k & n.contributors >= 20? Equally curious if can also add language to the query. If you could share on the shared drive and email me the URL of the longer version of the file in a single page it would help me understand a bit further. I am slightly confused on the query for all projects still. |
@carlosparadis There is no extra functionality with the query collection request parameter, so there is no Boolean logic nor mathematical relationships. To clarify, when searching through projects, the query command takes a query string, which is just an alphanumeric string:
|
@beydlern I looked at the XML you sent, thanks! Did you query the name property? because I see only "bugzilla" named projects in it. Would you be able to send me something that the query is a project that is written in java? I think at the very least you should start with the organization one: https://github.com/blackducksoftware/ohloh_api/blob/main/reference/organization.md And try on Apache Software Foundation. The analysis endpoint also seems promising. For the pagination, you can take a look on the GitHub and JIRA downloaders, I believe both implement similar logic. Might as well reuse for consistency. If you could send me an example of both XML, that would be great. Just remember what type of information we are after in our search, and consider how we can get there via endpoints. |
It looks like we can't query the name tag/property specifically or query a search for a pattern in any tag specifically. External code (in config.R) may be needed to complement the "bugzilla" query search to look at each
Example: To get a project with its primary language as Java, starting with a given organization, "Apache Software Foundation", I request the organization's XML data, viewing its portfolio projects, to get the ...
<detailed_page_url>/orgs/apache/projects</detailed_page_url>
... Another request for this new page url's XML data will give us a paginated list of portfolio projects belonging to the organization (a sample with one project returned): <response>
<status>success</status>
<items_returned>20</items_returned>
<items_available>320</items_available>
<first_item_position>0</first_item_position>
<result>
<portfolio_projects>
<project>
<name>Apache Tomcat</name>
<activity>High </activity>
<primary_language>java</primary_language>
<i_use_this>1684</i_use_this>
<community_rating>4.2</community_rating>
<twelve_mo_activity_and_year_on_year_change>
<commits>1059</commits>
<change_in_commits>-35</change_in_commits>
<percentage_change_in_commits>3</percentage_change_in_commits>
<contributors>24</contributors>
<change_in_contributors>-14</change_in_contributors>
<percentage_change_in_committers>36</percentage_change_in_committers>
</twelve_mo_activity_and_year_on_year_change>
</project>
...
</portfolio_projects>
</result>
</response> The portfolio projects entity doesn't allow collection request commands, such as queries or sorting, so external code may be necessary to read each project to find the desired information, and in this case external code is needed: External code (in config.R) is needed (there is no external code yet, this is just an example, and the pagination logic needed for this is found in the GitHub and JIRA downloaders) in this example to cycle through each page to find a project that contains the string "Java" or "java" in a project's <response>
<status>success</status>
<items_returned>10</items_returned>
<items_available>38</items_available>
<first_item_position>0</first_item_position>
<result>
<project>
<id>3562</id>
<name>Apache Tomcat</name>
<url>https://openhub.net/p/tomcat.xml</url>
<html_url>https://openhub.net/p/tomcat</html_url>
<created_at>2006-11-12T20:40:37Z</created_at>
<updated_at>2024-10-20T08:17:44Z</updated_at>
<description>The Apache Tomcat software is an open source implementation of the Java Servlet, JavaServer Pages, Java Expression Language and Java WebSocket technologies.</description>
<homepage_url>http://tomcat.apache.org/</homepage_url>
<download_url>http://tomcat.apache.org/download-60.cgi</download_url>
<url_name>tomcat</url_name>
<vanity_url>tomcat</vanity_url>
<medium_logo_url>https://s3.amazonaws.com/cloud.ohloh.net/attachments/831/tomcat_med.png</medium_logo_url>
<small_logo_url>https://s3.amazonaws.com/cloud.ohloh.net/attachments/831/tomcat_small.png</small_logo_url>
<user_count>1684</user_count>
<average_rating>4.23101</average_rating>
<rating_count>316</rating_count>
<review_count>4</review_count>
<analysis_id>208382336</analysis_id>
<tags>
...
</tags>
<similar_projects>
...
</similar_projects>
<licenses>
...
</licenses>
<project_activity_index>
...
</project_activity_index>
<links>
...
</links>
</project>
...
</result>
</response> Every project name is unique in OpenHub, so once we find a matching <response>
<status>success</status>
<result>
<analysis>
<id>208382336</id>
<url>https://openhub.net/p/tomcat/analyses/208382336.xml</url>
<project_id>3562</project_id>
<updated_at>2024-10-20T08:17:44Z</updated_at>
<oldest_code_set_time>2024-10-19T15:16:40Z</oldest_code_set_time>
<min_month>2006-03-01</min_month>
<max_month>2024-10-01</max_month>
<twelve_month_contributor_count>24</twelve_month_contributor_count>
<total_contributor_count>181</total_contributor_count>
<twelve_month_commit_count>1059</twelve_month_commit_count>
<total_commit_count>26695</total_commit_count>
<total_code_lines>474323</total_code_lines>
<factoids>
<factoid type="FactoidAgeVeryOld">
Mature, well-established codebase </factoid>
<factoid type="FactoidTeamSizeLarge">
Large, active development team </factoid>
<factoid type="FactoidCommentsAverage">
Average number of code comments </factoid>
<factoid type="FactoidActivityStable">
Stable Y-O-Y development activity </factoid>
</factoids>
<languages graph_url="https://openhub.net/p/tomcat/analyses/208382336/languages.png">
<language percentage="82" color="9A63AD" id="5">
Java </language>
<language percentage="9" color="555555" id="3">
XML </language>
<language percentage="7" color="556677" id="35">
XML Schema </language>
<language percentage="2" color="000000" id="">
10 Other </language>
</languages>
<main_language_id>5</main_language_id>
<main_language_name>Java</main_language_name>
</analysis>
</result>
</response> This path from endpoint to endpoint allows us to get all the relevant information on a project in at least 4 API calls. This example focused on requesting a project in a specified organization where its primary language is in Java. How is this approach? |
I am not clear what endpoints you are actually using from your responses above, could you edit the message to make that more clear (pointing to the .md documentation), and then post a comment to let me know you edited? |
@carlosparadis |
Thanks this makes a lot more sense to me now.
Is there no way to go from the project name from the portfolio, straight into its analysis instead? Having to do a global search for the project seems redundant. Also:
Do we need to perform this global search? Can't we use the name of the project from the organization search on the project.md endpoint to retrieve it? https://github.com/blackducksoftware/ohloh_api/blob/main/reference/project.md I guess one thing I am still confused is what on the .md says that you can query or not. Is it just doing a ctrl+f on the entire output against whatever you query? I don't understand yet how you are specifying a tag. All things considered above, you can go ahead and start the function for the organization search, and include a parameter where you can specify the language we are looking for. The notebook can exemplify Apache, since that is often studied. Remember to update the specification too on the first issue. Do let me know on the two questions on this comment so we can sort out the final path here, but at least that lets you start going on the code. I'd recommend using the R/github.R as reference on how to do the pagination. Try to reuse the code as much as possible so it stays consistent to everything else. Thanks! |
To get the latest analysis collection for a project (the latest analysis is the current best analysis collection for a single project), you need its
To access a specific project's collection, we need to specify its
From the project collection's supported query request, it states:
Following this and my experiences with this querying command, the command seems like it just does a ctrl+f on the entire output. I'm not specifying a specific tag to query, I'm hoping to use external code that will aid in paginating and "actually query" for the tag that I am looking for. The query command they use is helpful to narrow down a list of matches, but my external code (that I will write) will use this narrowed list to look for the tag and information that I am interested in. I'll get started on the code! |
Sounds good, thank you for confirming the strange ctrl+f mechanism it uses. I would 100% document this on the function that deals with this endpoint. Your last comment also has much of what I think should be on the notebook, as you explain why you are calling the functions in that order. Please remember to update your issue specification with the function signatures, and a bit of the summarized rationale of your comment above (I would also put a reference to this particular comment, since that was the one that did the trick to help me make sense). Remember, the questions I asked here are likely some of the questions someone reading the notebook will have, so you can use them as guidance on what to put in the Notebook (if it addresses all my questions, then it is off to a good start!). It should also help you consider what to go on the function documentation. In one of the calls we can consider what else OpenHub offers, but for now at least we can search per organization, as I expect Apache will be heavily utilized due to often using JIRA for issue tracker (which in turn means bugs are documented). Nice work! |
- The functions, openhub_api_analyses, openhub_api_iterate_pages, openhub_api_organizations, openhub_api_portfolio_projects, openhub_api_projects, openhub_parse_analyses, openhub_parse_organizations, openhub_parse_portfolio_projects, and openhub_parse_projects, were implemented (WITHOUT DOCUMENTATION). - A notebook, openhub_project_search, demonstrating the capability of interacting with the OpenHub project database through the Ohloh API to search for projects has been implemented.
- Documentation builder functions incorrectly updated RoxygenNote version, this commit reverts this change.
- Documentation builder functions incorrectly updated RoxygenNote version, this commit reverts this change.
- Added function documentation to all openhub API interfacing functions.
- openhub_project_search.Rmd had documentation added to describe the project gathering process. - R/config.R removed and commented out print statements for cleaner output from the openhub_* functions.
@carlosparadis If a configuration file is to be used, appropriate getter functions may be needed. Let me know if this is something I should also do too and on which issue I should do this. |
I think the first thing is discussing with me what the config should look like. The configs I think already have an openhub section (or maybe not)? If so we need to revise that first. All the config file formats should be updated accordingly, which, this time around, will only affect the get() functions being added or edited, rather than the notebooks, right? I am hoping the m1 merge for that will get done tomorrow. Sorry it has taken so long. |
@beydlern mentioned me chiming in for the formatting so here is my suggestion.
Suggested new format:
These are my suggestions, not sure if Nick would want to use them as is, or if he'll be making further modifications to suit the purpose of the new fields. |
This comment is a good learning opportunity: Not everything that is a parameter in the Notebook goes to the config file. We have to account for what is the "granularity" of the config (which is a project), and also whether said information can be obtained automatically or typed manually (which is more human time consuming). The openhub_url for example is consistent to the project granularity, and it is something you have to specify as a starting point. The language, however, is something we could ask Kaiaulu to get from OpenHub API, right? I would suspect that the organization name is something we also can infer. Some endpoints that we care a lot about are inconsistent to the granularity of a project configuration file, for example, the organization being Apache. This means the parameter will be hardcoded in the notebook (in a code block right below the config loads so it is easier for people to find). Maybe in the future, if Kaiaulu supports various organization level analysis we could consider a section for config, or an entire new config. There is a trade-off in this decision on the exec scripts, since they generally should take as input the config file rather than additional parameters. Which is to say, I expect this notebook to not have an exec/. In this case at least this somewhat makes sense, since, contrary to a downloader where we want to throw in a server downloading data, it strikes me a bit odd that someone would be using the OpenHub API to do that. The OpenHub API in Kaiaulu case is to help us select the project, but nothing beyond that. Let me know if this makes sense. |
That makes sense, I take this would be similar to how Dao removed the month_year parameter for her mailing list from the config previously. |
To be clear: I am agreeing with your organization here: #317 (comment) Perhaps what is missing is: openhub:
website: https://www.openhub.net/p/ambari
organization_folder_path: ../../rawdata/openhub/studyname/organization/
portfolio_project_folder_path: ../../rawdata/openhub/studyname/portfolio/
project_folder_path: ../../rawdata/openhub/studyname/project/
analysis_folder_path: ../../rawdata/openhub/studyname/analysis/ I added a Now, we discussed on call that the openhub should go in a config file, but in hindsight, it can't go in a project configuration file, as it is not project specific. It also means @crepesAlot should not handle openhub folder creation, since that function is to initialize a project specific folder organization, of which this code logic is not. Ultimately, this is uncharted territory in Kaiaulu. Everything up to this point was project specific, not "study" specific. My final suggestion is thus that you simply leave a code block at the start of the notebook that acknowledged the rationale said here: The notebook is unique among its peers in that it concerns itself not with one project, but their selection, and thus it does not use the project configuration file architecture nor the project initialization. The functions to create the folder are thus hardcoded in the notebook itself. Which I think it is OK for now. If this module evolves, as did Kaiaulu evolve over time, we can refactor that notebook into something else, as we did with config.R this session. Let me know if anything is not clear. The main difference is just hardcoding the |
Should the function that create the folders be defined in the notebook itself rather than config.R? For clarification on the refresh mechanic, in this case, I'm planning on deleting the old response files in the specific folder once a new download of response files is going to occur in that selected folder. |
Yes, I would just hardcode the path. Do check the R/io.R function as Kaiaulu should already have functions to create folders. So don't re-create that. You can see how they are used in R/example.R I believe or if not the unit tests that use example.R. I think it is OK you just save the file but do so with a unix timestamp to avoid conflict. Deletion is dangerous since it may catch people off guard and wipe their files. We just need to make sure we only read the files of that particular unix timestamp when reading back into the notebook. You can do that by using the native R function list.files (I think that what it is called) over the folder, then matching the paths associated to the timestamp of interest. Confirm with me what the file names will look like. This may look like a refresh in the surface, but it is not, since we do not guarantee the downloaded files are cumulative: They should only be cumulative among the ones obtained in a given API request (+ pagination). |
The file names will take the form: With this method, how should my notebook handle the user running the notebook multiple times in a day? Duplicate entries are bound to occur, so I suggest that I filter the resultant data tables from each endpoint for uniqueness (for example, names of projects are unique, so the output data table will be sanitized by the Another simpler way (let me know if this is correct): |
Downloader logic should be simple: Get the raw data, get some useful identifying information of the raw data and store in the file. Parser should be simple: Convert JSON, XML or whatever it is to table. Let the Notebook handle the rest. Adding assumptions to the parser complicates assumptions made when users make use of the API. Your file name depends on what the endpoint is doing, it can generally just be endpointname_someuniqueinformation_unixtimestamp.xml (I think you said it was XML?). Unix timestamp is not YYMMDD, it actually captures in a numeric information YYMMDDHHMMSS so there is no way there will be file overlap. The associated parser function can assume that file is the input, and will parse it. User can worry about which of the files they want to use and code the logic around it. The someuniqueinformation is dependent on the data. For example, the endpoint that gets project under an organization can have someuniqueinformation as being the organization the projects are from, and also have a page number information. The unix timestamp would likely need to be when the request for project information was made. This file name does not make sense if your endpoint is to get a project information. Then it could be endpointname_projectname_page_unixtimestamp). Try to give me each end point, what it does in summary and your suggestion of file name depending on what makes sense for someone to see on the file. Seeing the unixtimestamp when you run for projects under an organization helps someone know if they did recently or not the query, and the organization name helps them know what org they queried in a glance for example. What would be most helpful seeing in a file name for each endpoint? |
organization endpoint:
portfolio project endpoint:
Carlos Edit: organization_name here is for example project endpoint:
analysis endpoint:
Carlos Edit: updated_at parameter has to be parsed and used as unit timestamp here! On the function docs you must clarify that this timestamp behavior captures the created_at. Which is very different than all the other unix timestamp. |
Organization makes sense. I can't make sense of portfolio project or project endpoint. Too many projects and the endpoints are all inter-used in some field or column which, unable to see right now, makes hard to understand. Do you have some time today for a quick call we can go over this? Delaying to tomorrow is probably going to leave you with little time to address this. In an ideal world, every endpoint maps to something sane: An issue, a commit, a project, a person, etc. Here my conceptual understanding is just organization with projects, then portfolio project (whatever that is), then project again. Analysis is also a very generic name. I understand these names are not what you chose, but this is what I am struggling to know what best to represent on the file name. If you are unable to meet, I suggest editing the comment above and add a sample of the first 2 rows to every table, so when you say "I used this column to call this other endpoint" i can make sense. Also, note unix timestamp is just a single number. So the file format is more like _dddddddd.xml instead of YYMMDDHHMMSS.xml. You need to call a function to convert to and from unix timestamp. Please edit the file name above accordingly. |
@carlosparadis Yes we can meet, I sent you an email. |
- The current working version of the notebook and functions to search for projects (Downloaders and Parsers are not yet generalized and still contain overloading functionality, which is to be resolved and implemented in the notebook, openhub_project_search.Rmd).
- In config.R, the organization downloader and parser as well as the portfolio project downloader and parser were refactored to become more generalized and work with the iterator function, openhub_api_iterate_pages().
- In config.R, the project downloader and parser as well as the analysis downloader and parser were refactored to become more generalized and work with the iterator function, openhub_api_iterate_pages().
- config.R, the openhub interfacing functions had their documentation updated. - openhub_project_search.Rmd, revised the notebook with up-to-date information.
The analysis endpoint parser function is still overloaded for the As for the Other than these two outliers, the revisions you requested have been made. |
Can you elaborate on this? Trying to remember what we discussed as far as the overload is concerned. I think you are referring to my request on not adding hardcoded link extraction and instead return to the user. What is the code line this refers to (you can paste the copied permalink and hold shift to pick the region to post here so it references me to it).
Yes, return every link available, and then move your code logic to check for specifics for mailing list, etc in the Notebook where you can explain in english right before what you are doing and why. It is useful to have, but I want to make sure it is at Notebook level in case in the future others are looking for anything else, the function will make available for them. Can you give me an example of the table before and after would look like? You can mockup a table with markdown here too. |
Here is the permalink to the analysis section of code for the Lines 349 to 358 in e439c12
Using Apache Tomcat for the first row for both tables and a fictional project without links: The table before has a column for
The table after will have the
I'm not sure how to accurately place newline characters or line break elements to separate the different project links to be nicely displayed in the data table using the gt() package, so for now they don't wrap and are separated by commas. |
- Modified openhub_parse_projects() overloaded functionality (using regex pattern to search for mailing list) to simply extract the list of project links (if available).
hmm.. the proper way to do this is to get the project_links, use stringi::stri_split, and make a separate table where every row is of the form "project_name" "project_link" your function would return a named list of two data.tables, the first would be every row per project with the information you already have, the second would be this one where every row is a project link. Anyone could then merge both or do as they prefer. Raven code on Scitools, that I just merged, does this for the transform_ function, where the function returns node_list and edge_list. |
- The openhub_parse_projects() function returns two tables, one about the project's specific data and the other about the project's external links. The notebook, openhub_project_search.Rmd, has been updated to reflect these changes.
- Incorrect table was being merged, fixed in notebook.
@carlosparadis |
Thanks! Can you give me some screenshot of the output or paste them here just to make sure we are on the same page? |
@carlosparadis |
@beydlern this is perfect. I'd suggest adding the "id" column to the |
- openhub_parse_projects() the project_links data table had a field added (column added), which is the project ID. The notebook, openhub_project_search.Rmd, has been updated to reflect these changes.
@carlosparadis |
@beydlern looks good! is it reaady for review? send me a code review request when ready |
@carlosparadis |
- All description tags for function documentations were removed in config.R. - Data tables are displayed using pipe operators with gt() function and head() function in openhub_project_search.Rmd. - Parser sections have their "eval = FALSE" removed for their code blocks in openhub_project_search.Rmd. - Removed "invisible()" function from all functions in config.R.
- Added OpenHub API interfacing functions to _pkgdown.yml. - Added missing parameter function documentation to openhub_download function in config.R. - Removed debug print statements openhub_api_iterate_pages in config.R.
1. Purpose
OpenHub is a website that indexes open-source projects with their respective project information (i.e. lines of code, contributors, etc). The purpose of this task is to extend
R/config.R
to host a collection of functions that interface with OpenHub's API, Ohloh, to help facilitate in locating open source projects for analysis.2. Process
Create a collection of functions implemented in
R/config.R
, where each function will grab one endpoint (item of project information, such as the number of lines of code). Create a notebook to demonstrate how to use theseR/config.R
Ohloh API interfacing functions to request information on an open-source project on OpenHub.Checklist for Extractable Project Information
name
: The name of the project.id
: The project's unique ID on OpenHub.primary_language
: The primary code language used by the project.activity
: The project's activity level (Very Low, Low, Moderate, High, and Very High).html_url
: The project's URL on OpenHub.project_links
: The project's links, which may contain the mailing list URL link (may be "N/A" if unable to find project links, checkinghtml_url
of the project to verify is advised).min_month
: OpenHub's first recorded year and month of the project's data (typically the date of the project's first commit, YYYY-MM format).twelve_month_contributor_count
: The number of contributors who made at least one commit to the project source code in the past twelve months.total_contributor_count
: The total number of contributors who made at least one commit to the project source code since the project's inception.twelve_month_commit_count
: The total number of commits to the project source code in the past twelve months.total_commit_count
: The total number of commits to the project source code since the project's inception.total_code_lines
: The most recent total count of all source code lines.code_languages
: A language breakdown with percentages for each substantial (as determined by OpenHub, less contributing languages are grouped and renamed as "Other") contributing language in the project's source code.Example Endpoint Pathing
This specific comment in this issue thread details the endpoint pathing process to look for a specific project's analysis collection under an organization's portfolio projects, specified by project name (project names are unique in OpenHub).
3. Task List
R/config.R
.vignettes/openhub_project_search.Rmd
, to demonstrate how to use the newR/config.R
functions that interface with Ohloh API to extract useful information about a project and search through OpenHub's database of projects to search for a project based on a set of filters for analysis.Function Signatures
The text was updated successfully, but these errors were encountered: