Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenHub API Interfacing for Project Search #317

Open
29 tasks done
beydlern opened this issue Oct 11, 2024 · 65 comments · May be fixed by #325
Open
29 tasks done

OpenHub API Interfacing for Project Search #317

beydlern opened this issue Oct 11, 2024 · 65 comments · May be fixed by #325
Assignees

Comments

@beydlern
Copy link
Collaborator

beydlern commented Oct 11, 2024

1. Purpose

OpenHub is a website that indexes open-source projects with their respective project information (i.e. lines of code, contributors, etc). The purpose of this task is to extend R/config.R to host a collection of functions that interface with OpenHub's API, Ohloh, to help facilitate in locating open source projects for analysis.

2. Process

Create a collection of functions implemented in R/config.R, where each function will grab one endpoint (item of project information, such as the number of lines of code). Create a notebook to demonstrate how to use these R/config.R Ohloh API interfacing functions to request information on an open-source project on OpenHub.

Checklist for Extractable Project Information

  • name: The name of the project.
  • id: The project's unique ID on OpenHub.
  • primary_language: The primary code language used by the project.
  • activity: The project's activity level (Very Low, Low, Moderate, High, and Very High).
  • html_url: The project's URL on OpenHub.
  • project_links: The project's links, which may contain the mailing list URL link (may be "N/A" if unable to find project links, checking html_url of the project to verify is advised).
  • min_month: OpenHub's first recorded year and month of the project's data (typically the date of the project's first commit, YYYY-MM format).
  • twelve_month_contributor_count: The number of contributors who made at least one commit to the project source code in the past twelve months.
  • total_contributor_count: The total number of contributors who made at least one commit to the project source code since the project's inception.
  • twelve_month_commit_count: The total number of commits to the project source code in the past twelve months.
  • total_commit_count: The total number of commits to the project source code since the project's inception.
  • total_code_lines: The most recent total count of all source code lines.
  • code_languages: A language breakdown with percentages for each substantial (as determined by OpenHub, less contributing languages are grouped and renamed as "Other") contributing language in the project's source code.

Example Endpoint Pathing

This specific comment in this issue thread details the endpoint pathing process to look for a specific project's analysis collection under an organization's portfolio projects, specified by project name (project names are unique in OpenHub).

3. Task List

  • Apply for an API key for Ohloh API.
  • Understand how to form a request to Ohloh API.
  • Understand the response XML format after an HTTP GET request.
  • Create the interfacing functions to extract the extractable project information and search for projects; Implement these functions in R/config.R.
  • Create a notebook, vignettes/openhub_project_search.Rmd, to demonstrate how to use the new R/config.R functions that interface with Ohloh API to extract useful information about a project and search through OpenHub's database of projects to search for a project based on a set of filters for analysis.

Function Signatures

  • openhub_api_organizations()
  • openhub_api_portfolio_projects()
  • openhub_api_projects()
  • openhub_api_analyses()
  • openhub_parse_organizations()
  • openhub_parse_portfolio_projects()
  • openhub_parse_projects()
  • openhub_parse_analyses()
  • openhub_download()
  • openhub_retrieve()
  • openhub_api_iterate_pages()
@beydlern beydlern self-assigned this Oct 11, 2024
@carlosparadis
Copy link
Member

Hi @beydlern,

This is what I originally sent an e-mail awhile back:

Add module to interface with OpenHub API to facilitate locating open source projects for studies. API details here. May complement with extracting information from GitHub hosted projects.

Here's an example of our project listed there: https://openhub.net/p/kaiaulu you can also take a look at others from our project config files like OpenSSL, etc.

This is a task I know less, so part of the issue is to assess the viability of what we want as part of the task itself. For instance, there are a few things @rnkazman would like to know when considering a project (Rick feel free to chime in):

  • LOC on current date (so we know the size of the project)
  • Commits per month over the last year (or any time range available): So we know the project is still alive
  • Contributors per month: Good to contrast with LOC to know if this is a one person project
  • Language (so we know the language of the project)

What you want to do for Kaiaulu is try to create one function per endpoint to begin with. For example, if you look at R/github.R in Kaiaulu, you will see that even the docs of the function tells you what endpoint is accessing. So, start by documenting on your specification if you can get the information above (which is displayed on the interface of OpenHub), and afterwards any other information I did not consider (or you can point me to a PDF or page with all endpoints).

Remember github.R, jira.R both use APIs (I'd recommend github as you are using it as part of this project and can relate), so much of the code you may need is already there for you to use as example. Reusing code logic will also automatically help you ensure consistency. mail.R is not an API (what @daomcgill is working on), so i'd recommend against use that as reference.


Depending on your findings, we may also simply add a few more endpoints to github.R to collect some of this information. However, OpenHub is preferred because they can extract info beyond GitHub itself. Let me know if you have questions.

@beydlern
Copy link
Collaborator Author

beydlern commented Oct 15, 2024

@carlosparadis Before I can take an in depth look at the XML-formatted data, which is the response format after I make a project request, I must register for an API key. What should I put under the Application Name, Redirect URI, and Description sections of this API key request application?

@carlosparadis
Copy link
Member

I'm not sure what it wants as redirect uri, but you can put app name as ics 496 kaiaulu. Description can be capstone class project.

@beydlern
Copy link
Collaborator Author

@carlosparadis

. . . if you can get the information above (which is displayed on the interface of OpenHub), and afterwards any other information I did not consider (or you can point me to a PDF or page with all endpoints).

From my understanding, the Ohloh API allows users with a valid API key to request an XML-formatted data in response to an HTTP GET request for a project. This XML file for the specific project contains an analysis section that holds general information about the project, such as the total LOC, the main language, and the number of contributors who made at least one commit in the last year. This analysis section comes with its unique ID, id, that may be used to locate its children, the size_facts and activity_facts. The size_facts statistics provides monthly (month is explicitly shown YYYY-MM-DD) running totals of LOC, commits, and developer effort expressed through cumulative total months of effort by all contributors on the project (man_months), and activity_facts statistics provides the changes to LOC, commits, and contributors per month (also shown in YYYY-MM-DD).

  • LOC on current date (so we know the size of the project)
  • Commits per month over the last year (or any time range available): So we know the project is still alive
  • Contributors per month: Good to contrast with LOC to know if this is a one person project
  • Language (so we know the language of the project)

To my knowledge, as long as the OpenHub website has computed and stored these statistics in an analysis, these requests are possible with Ohloh API. However, the analysis for the current date may be slightly inaccurate as the OpenHub website must be given time to compute the analysis as the latest month it has computed may refer to an older month than our current month (as shown in the analysis.md in Ohloh's API documentation max_month: The last month for which monthly historical statistics are available for this project. Depending on when this analysis was prepared, max_month usually refers to the current month, but it may be slightly older.)

@carlosparadis
Copy link
Member

https://github.com/blackducksoftware/ohloh_api/blob/main/README.md#xml-response-format

I see, nice finding. So the reference folder basically contain on every file the format of xml you will get if you go after that endpoint, is that right?

@carlosparadis
Copy link
Member

I took a quick look on the wiki and i don't see an example file. Could you try retrieving the analysis XML for kaiaulu so we could take a look? It seems some XML are summary statistics coming out of this file too so this may be all we need.

Unfortunately they delete old files:

An individual Analysis never changes. When a Project’s source code is modified, a completely new Analysis is generated for that Project. Eventually, old analyses are deleted from the database. Therefore, you should always obtain the ID of the best current analysis from the project record before requesting an analysis.

So in that sense, Ohloh API will never serve to comprehensively analyze a project history, but it does conveniently offer summary statistics.

I also want to remind you that our goal here is to survey "the sea of projects" for the criteria we want, rather than use Ohloh to analyze them on our behalf. For example, rather than analysis.md, what we may need is something like this:

https://openhub.net/orgs/apache/projects

https://github.com/blackducksoftware/ohloh_api/blob/main/reference/portfolio_projects.md

That gives us a list of projects.

Another thing that would be useful is knowing which type of issue tracker a given project uses: https://openhub.net/p/apache/links you can imagine if that was returned in the XML, we could parse the URL from issue tracker for certain words to find if it is bugzilla, JIRA, or GitHub, and then report that in a table for the user.

I do not know if OpenHub will let us search all the projects they index, or if we can only search at most per organization.

Could you check what else in OpenHub could give us a bird eyes view of all the projects? We could still create a two step pipeline to maybe first obtain the name of the projects via one endpoint, and then make more API calls for the analysis to obtain the detailed information, although this would be less than ideal.

@beydlern
Copy link
Collaborator Author

beydlern commented Oct 15, 2024

@carlosparadis I was able to take a look at kaiaulu's project information. Here is the XML file data for the project:

<response>
  <status>success</status>
  <result>
<project>
  <id>760420</id>
  <name>kaiaulu</name>
  <url>https://openhub.net/p/kaiaulu.xml</url>
  <html_url>https://openhub.net/p/kaiaulu</html_url>
  <created_at>2021-09-27T02:32:26Z</created_at>
  <updated_at>2024-10-14T05:19:13Z</updated_at>
  <description>A data model for Software Engineering data analysis</description>
  <homepage_url>http://itm0.shidler.hawaii.edu/kaiaulu</homepage_url>
  <download_url>https://github.com/sailuh/kaiaulu</download_url>
  <url_name>kaiaulu</url_name>
  <vanity_url>kaiaulu</vanity_url>
  <medium_logo_url>https://s3.amazonaws.com/cloud.ohloh.net/attachments/94361/logo_med.png</medium_logo_url>
  <small_logo_url>https://s3.amazonaws.com/cloud.ohloh.net/attachments/94361/logo_small.png</small_logo_url>
  <user_count>0</user_count>
  <average_rating/>
  <rating_count>0</rating_count>
  <review_count>0</review_count>
  <analysis_id>207699501</analysis_id>
  <tags>
    <tag>code_analysis</tag>
    <tag>codemanagement</tag>
    <tag>mining-software-repositories</tag>
    <tag>socialnetwork</tag>
    <tag>softwareengineering</tag>
    <tag>static_analysis</tag>
  </tags>
<analysis>
  <id>207699501</id>
  <url>https://openhub.net/p/kaiaulu/analyses/207699501.xml</url>
  <project_id>760420</project_id>
  <updated_at>2024-10-14T05:19:13Z</updated_at>
  <oldest_code_set_time>2024-10-13T17:03:09Z</oldest_code_set_time>
  <min_month>2020-05-01</min_month>
  <max_month>2024-10-01</max_month>
  <twelve_month_contributor_count>6</twelve_month_contributor_count>
  <total_contributor_count>14</total_contributor_count>
  <twelve_month_commit_count>18</twelve_month_commit_count>
  <total_commit_count>186</total_commit_count>
  <total_code_lines>5085</total_code_lines>
  <factoids>
    <factoid type="FactoidCommentsVeryHigh">
Very well-commented source code    </factoid>
    <factoid type="FactoidAgeOld">
Well-established codebase    </factoid>
    <factoid type="FactoidTeamSizeAverage">
Average size development team    </factoid>
    <factoid type="FactoidActivityDecreasing">
Decreasing Y-O-Y development activity    </factoid>
  </factoids>
  <languages graph_url="https://openhub.net/p/kaiaulu/analyses/207699501/languages.png">
    <language percentage="100" color="198CE7" id="65">
R    </language>
  </languages>
  <main_language_id>65</main_language_id>
  <main_language_name>R</main_language_name>
</analysis>
  <similar_projects>
    <project>
      <id>360</id>
      <name>FindBugs</name>
      <vanity_url>findbugs</vanity_url>
    </project>
    <project>
      <id>712198</id>
      <name>Prospector (Python)</name>
      <vanity_url>landscapeio-prospector</vanity_url>
    </project>
    <project>
      <id>733309</id>
      <name>SpotBugs</name>
      <vanity_url>spotbugs</vanity_url>
    </project>
    <project>
      <id>1865</id>
      <name>GNU cflow</name>
      <vanity_url>cflow</vanity_url>
    </project>
  </similar_projects>
  <licenses>
  </licenses>
  <project_activity_index>
    <value>30</value>
    <description>Very Low</description>
  </project_activity_index>
</project>
  </result>
</response>

On the topic of searching for projects, we are able to write a query to request a set of projects filtered through some specification.
https://www.openhub.net/projects.xml?api_key={api_key}&page={number}&sort={keyword}&query={keyword}
If the query returns multiple projects, we get a filtered collection response xml format that we can use to get the name and id of each project, which we can use to get their analyses through another request to get general information on each project, but like you said, this is less than ideal. However, this allows us to search through ALL projects that OpenHub indexes.

For the issue trackers inquiry, if the project page has the links section, the project's XML response will also contain a links section. For example, the page, https://openhub.net/p/apache/links can be shown in the project's XML file:

<links>
  <link>
    <title>Current Release Docs</title>
    <url>http://httpd.apache.org/docs/current/</url>
    <category>Documentation</category>
  </link>
  <link>
    <title>Next release "coming soon" docs</title>
    <url>http://httpd.apache.org/docs/trunk/</url>
    <category>Documentation</category>
  </link>
  <link>
    <title>Apache Bugzilla</title>
    <url>https://issues.apache.org/bugzilla/</url>
    <category>Issue Trackers</category>
  </link>
  <link>
    <title>Bugzilla Search</title>
    <url>https://issues.apache.org/bugzilla/query.cgi</url>
    <category>Issue Trackers</category>
  </link>
</links>

We could parse through these links to see what issue trackers the project is using.

@carlosparadis
Copy link
Member

This is going at the right direction, thank you for the additional information!

In regards to what you said:

On the topic of searching for projects, we are able to write a query to request a set of projects filtered through some specification.

I looked at the URL and saw:

query - Results will be filtered by the provided string. Only items that contain the query string in their names or descriptions will be returned. Filtering is case insenstive. Only alphanumeric characters are accepted. All non-alphanumeric characters will be replaced by spaces. Filtering is not available on all API methods, and the searched text depends on the type of object requested. Check the reference documentation for specifics.

Could you check what exactly what we can query for? Seems we can query by language across all OpenHub, if so this is already a great start. It is not the end of the world to do follow-up check on other API endpoints if that is the only way forward. However, this then begs the question: Let's say that our query returns 300 or so projects. For every project, in order for us to find if the project is or not jira, and also the other information i said above (n contributors, LOC, etc), how many API calls will that require per project?

Also, what was the limit again of API calls? And what was the time period? (per day it resets?)

@beydlern
Copy link
Collaborator Author

beydlern commented Oct 15, 2024

@carlosparadis

For project queries, the project reference documentation states:

query If supplied, only Projects matching the query string will be returned. A Project matches if its name, description, or any of its tags contain the query string.

I believe that we can query for anything, to be specific, the query string acts as a search pattern and the Ohloh API searches through every tag to check if the query string is contained.

When the Ohloh API returns the XML data for a list of projects (if the query returns projects), it returns a maximum of ten per page (through personal testing) and it also lists the total number of items (projects) available.

An example of querying a list of projects with the query string: bugzilla
(The result information (list of projects) are not shown here as it's too long)

<status>success</status>
<items_returned>10</items_returned>
<items_available>80</items_available>
<first_item_position>0</first_item_position>

According to the documentation:

page - In most cases, the Ohloh API returns at most 25 items per request. Pass this parameter to request subsequent items beyond the first page. This parameter is one-based, with a default value of 1. If you pass a value outside the range of available pages, you will receive the first page.

The next set of ten projects are listed on the next pages, where we may do simple arithmetic with the items_available tag value divided by the non-zero items_returned value, making sure to take the ceiling of this value to get the number of pages that we may increment through to get the rest of the projects XML data.

Let's say that our query returns 300 or so projects. For every project, in order for us to find if the project is or not jira, and also the other information i said above (n contributors, LOC, etc), how many API calls will that require per project?

In this case, with ten projects listed per page, it would take 330 API calls. This is because it would take 30 API calls to look at each project because there are 30 pages of projects (ten projects displayed per page), and another API call is added for each project using its analysis_id to grab its corresponding analysis to extract the project information (LOC, number of contributors, etc).

Also, what was the limit again of API calls? And what was the time period? (per day it resets?)

The number of API calls a user can make per API key is 1000, and this resets every 24 hours.

@carlosparadis
Copy link
Member

When you say we could query anything, would we be able to create for example, a query that asks for LOC >= 50k? And would we be able to add "And" conditions, e.g. LOC >50k & n.contributors >= 20? Equally curious if can also add language to the query. If you could share on the shared drive and email me the URL of the longer version of the file in a single page it would help me understand a bit further. I am slightly confused on the query for all projects still.

@beydlern
Copy link
Collaborator Author

@carlosparadis
My mistake, I wasn't clear by what I meant by "anything". I meant that when we search for projects, we can search for "any string pattern" that can be in the query string to search through the properties for each project. The number of properties we can search through is limited, unless I also do API call for each project to open its analysis child to search through its properties, where the LOC and number of contributors is found. However, this would quickly add up in the number of API calls.

There is no extra functionality with the query collection request parameter, so there is no Boolean logic nor mathematical relationships. To clarify, when searching through projects, the query command takes a query string, which is just an alphanumeric string:

query If supplied, only Projects matching the query string will be returned. A Project matches if its name, description, or any of its tags contain the query string.

@carlosparadis
Copy link
Member

@beydlern I looked at the XML you sent, thanks! Did you query the name property? because I see only "bugzilla" named projects in it. Would you be able to send me something that the query is a project that is written in java?

I think at the very least you should start with the organization one: https://github.com/blackducksoftware/ohloh_api/blob/main/reference/organization.md

And try on Apache Software Foundation.

The analysis endpoint also seems promising.

For the pagination, you can take a look on the GitHub and JIRA downloaders, I believe both implement similar logic. Might as well reuse for consistency. If you could send me an example of both XML, that would be great.

Just remember what type of information we are after in our search, and consider how we can get there via endpoints.

@beydlern
Copy link
Collaborator Author

beydlern commented Oct 21, 2024

@carlosparadis

query If supplied, only Projects matching the query string will be returned. A Project matches if its name, description, or any of its tags contain the query string.

It looks like we can't query the name tag/property specifically or query a search for a pattern in any tag specifically. External code (in config.R) may be needed to complement the "bugzilla" query search to look at each <name> tag.

Would you be able to send me something that the query is a project that is written in java? ... And try on Apache Software Foundation.

Example: To get a project with its primary language as Java, starting with a given organization, "Apache Software Foundation", I request the organization's XML data, viewing its portfolio projects, to get the <detailed_page_url> field:
https://openhub.net/orgs/apache.xml?api_key={api_key}&view=portfolio_projects

...
<detailed_page_url>/orgs/apache/projects</detailed_page_url>
...

Another request for this new page url's XML data will give us a paginated list of portfolio projects belonging to the organization (a sample with one project returned):
https://openhub.net/orgs/apache/projects.xml?api_key={api_key}

<response>
  <status>success</status>
  <items_returned>20</items_returned>
  <items_available>320</items_available>
  <first_item_position>0</first_item_position>
  <result>
    <portfolio_projects>
      <project>
        <name>Apache Tomcat</name>
        <activity>High </activity>
        <primary_language>java</primary_language>
        <i_use_this>1684</i_use_this>
        <community_rating>4.2</community_rating>
        <twelve_mo_activity_and_year_on_year_change>
          <commits>1059</commits>
          <change_in_commits>-35</change_in_commits>
          <percentage_change_in_commits>3</percentage_change_in_commits>
          <contributors>24</contributors>
          <change_in_contributors>-14</change_in_contributors>
          <percentage_change_in_committers>36</percentage_change_in_committers>
        </twelve_mo_activity_and_year_on_year_change>
      </project>
      ...
    </portfolio_projects>
  </result>
</response>

The portfolio projects entity doesn't allow collection request commands, such as queries or sorting, so external code may be necessary to read each project to find the desired information, and in this case external code is needed: External code (in config.R) is needed (there is no external code yet, this is just an example, and the pagination logic needed for this is found in the GitHub and JIRA downloaders) in this example to cycle through each page to find a project that contains the string "Java" or "java" in a project's <primary_language> tag. For further analysis, we will copy the selected project's <name> tag too, where we will then go to the global projects paginated list in XML format and query for the name of the project (querying for the name of the project will also return every project if that project has any tag or description that has either the word "Apache" or "Tomcat", which is why there were 38 projects returned):
https://openhub.net/p.xml?api_key={api_key}&query=Apache%20Tomcat

<response>
  <status>success</status>
  <items_returned>10</items_returned>
  <items_available>38</items_available>
  <first_item_position>0</first_item_position>
  <result>
<project>
  <id>3562</id>
  <name>Apache Tomcat</name>
  <url>https://openhub.net/p/tomcat.xml</url>
  <html_url>https://openhub.net/p/tomcat</html_url>
  <created_at>2006-11-12T20:40:37Z</created_at>
  <updated_at>2024-10-20T08:17:44Z</updated_at>
  <description>The Apache Tomcat software is an open source implementation of the Java Servlet, JavaServer Pages, Java Expression Language and Java WebSocket technologies.</description>
  <homepage_url>http://tomcat.apache.org/</homepage_url>
  <download_url>http://tomcat.apache.org/download-60.cgi</download_url>
  <url_name>tomcat</url_name>
  <vanity_url>tomcat</vanity_url>
  <medium_logo_url>https://s3.amazonaws.com/cloud.ohloh.net/attachments/831/tomcat_med.png</medium_logo_url>
  <small_logo_url>https://s3.amazonaws.com/cloud.ohloh.net/attachments/831/tomcat_small.png</small_logo_url>
  <user_count>1684</user_count>
  <average_rating>4.23101</average_rating>
  <rating_count>316</rating_count>
  <review_count>4</review_count>
  <analysis_id>208382336</analysis_id>
  <tags>
    ...
  </tags>
  <similar_projects>
    ...
  </similar_projects>
  <licenses>
    ...
  </licenses>
  <project_activity_index>
    ...
  </project_activity_index>
  <links>
    ...
  </links>
</project>
...
  </result>
</response>

Every project name is unique in OpenHub, so once we find a matching <name> tag, we may take its <id>, its unique project id, to find its latest analysis https://openhub.net/p/3562/analyses/latest.xml?api_key={api_key}:

<response>
  <status>success</status>
  <result>
<analysis>
  <id>208382336</id>
  <url>https://openhub.net/p/tomcat/analyses/208382336.xml</url>
  <project_id>3562</project_id>
  <updated_at>2024-10-20T08:17:44Z</updated_at>
  <oldest_code_set_time>2024-10-19T15:16:40Z</oldest_code_set_time>
  <min_month>2006-03-01</min_month>
  <max_month>2024-10-01</max_month>
  <twelve_month_contributor_count>24</twelve_month_contributor_count>
  <total_contributor_count>181</total_contributor_count>
  <twelve_month_commit_count>1059</twelve_month_commit_count>
  <total_commit_count>26695</total_commit_count>
  <total_code_lines>474323</total_code_lines>
  <factoids>
    <factoid type="FactoidAgeVeryOld">
Mature, well-established codebase    </factoid>
    <factoid type="FactoidTeamSizeLarge">
Large, active development team    </factoid>
    <factoid type="FactoidCommentsAverage">
Average number of code comments    </factoid>
    <factoid type="FactoidActivityStable">
Stable Y-O-Y development activity    </factoid>
  </factoids>
  <languages graph_url="https://openhub.net/p/tomcat/analyses/208382336/languages.png">
    <language percentage="82" color="9A63AD" id="5">
Java    </language>
    <language percentage="9" color="555555" id="3">
XML    </language>
    <language percentage="7" color="556677" id="35">
XML Schema    </language>
    <language percentage="2" color="000000" id="">
10 Other    </language>
  </languages>
  <main_language_id>5</main_language_id>
  <main_language_name>Java</main_language_name>
</analysis>
  </result>
</response>

This path from endpoint to endpoint allows us to get all the relevant information on a project in at least 4 API calls. This example focused on requesting a project in a specified organization where its primary language is in Java. How is this approach?

@carlosparadis
Copy link
Member

I am not clear what endpoints you are actually using from your responses above, could you edit the message to make that more clear (pointing to the .md documentation), and then post a comment to let me know you edited?

@beydlern
Copy link
Collaborator Author

@carlosparadis
I edited the message for clarity. Each URL has turned into a link that points to its respective .md documentation page in Ohloh API.

@carlosparadis
Copy link
Member

Thanks this makes a lot more sense to me now.

The portfolio projects entity doesn't allow collection request commands, such as queries or sorting, so external code may be necessary to read each project to find the desired information, and in this case external code is needed: External code (in config.R) is needed (there is no external code yet, this is just an example, and the pagination logic needed for this is found in the GitHub and JIRA downloaders) in this example to cycle through each page to find a project that contains the string "Java" or "java" in a project's <primary_language> tag. For further analysis, we will copy the selected project's tag too, where we will then go to the global projects paginated list in XML format and query for the name of the project (querying for the name of the project will also return every project if that project has any tag or description that has either the word "Apache" or "Tomcat", which is why there were 38 projects returned):

Is there no way to go from the project name from the portfolio, straight into its analysis instead? Having to do a global search for the project seems redundant.

Also:

or further analysis, we will copy the selected project's tag too, where we will then go to the global projects paginated list in XML format and query for the name of the project (querying for the name of the project will also return every project if that project has any tag or description that has either the word "Apache" or "Tomcat", which is why there were 38 projects returned): https://openhub.net/p.xml?api_key={api_key}&query=Apache%20Tomcat

Do we need to perform this global search? Can't we use the name of the project from the organization search on the project.md endpoint to retrieve it? https://github.com/blackducksoftware/ohloh_api/blob/main/reference/project.md

I guess one thing I am still confused is what on the .md says that you can query or not. Is it just doing a ctrl+f on the entire output against whatever you query? I don't understand yet how you are specifying a tag.


All things considered above, you can go ahead and start the function for the organization search, and include a parameter where you can specify the language we are looking for. The notebook can exemplify Apache, since that is often studied. Remember to update the specification too on the first issue.

Do let me know on the two questions on this comment so we can sort out the final path here, but at least that lets you start going on the code. I'd recommend using the R/github.R as reference on how to do the pagination. Try to reuse the code as much as possible so it stays consistent to everything else.

Thanks!

@beydlern
Copy link
Collaborator Author

@carlosparadis

Is there no way to go from the project name from the portfolio, straight into its analysis instead? Having to do a global search for the project seems redundant.

To get the latest analysis collection for a project (the latest analysis is the current best analysis collection for a single project), you need its project_id tag (called id in the project collection), which is only found in the project collection, and a good unique key that portfolio_projects and project both have is the name tag. The name tag then allows me to jump from the portfolio_projects endpoint to the project endpoint to get the project_id tag to access the latest analysis collection for that project: https://www.openhub.net/projects/{project_id}/analyses/latest.xml.

Do we need to perform this global search? Can't we use the name of the project from the organization search on the project.md endpoint to retrieve it? https://github.com/blackducksoftware/ohloh_api/blob/main/reference/project.md

To access a specific project's collection, we need to specify its project_id:
https://www.openhub.net/projects/{project_id}.xml
We must perform this global search, but the term "global" may be misleading because almost always the maximum number of API calls on this global project list is 1 (From personal experimentation, each data request to the global project list is ten projects). Querying using the name of the project allows us to almost always get the correct project on the first page of the project return list, so we may get the 'project_id' tag.

I guess one thing I am still confused is what on the .md says that you can query or not. Is it just doing a ctrl+f on the entire output against whatever you query? I don't understand yet how you are specifying a tag.

From the project collection's supported query request, it states:

query If supplied, only Projects matching the query string will be returned. A Project matches if its name, description, or any of its tags contain the query string.

Following this and my experiences with this querying command, the command seems like it just does a ctrl+f on the entire output. I'm not specifying a specific tag to query, I'm hoping to use external code that will aid in paginating and "actually query" for the tag that I am looking for. The query command they use is helpful to narrow down a list of matches, but my external code (that I will write) will use this narrowed list to look for the tag and information that I am interested in.

I'll get started on the code!

@carlosparadis
Copy link
Member

Sounds good, thank you for confirming the strange ctrl+f mechanism it uses. I would 100% document this on the function that deals with this endpoint. Your last comment also has much of what I think should be on the notebook, as you explain why you are calling the functions in that order.

Please remember to update your issue specification with the function signatures, and a bit of the summarized rationale of your comment above (I would also put a reference to this particular comment, since that was the one that did the trick to help me make sense).

Remember, the questions I asked here are likely some of the questions someone reading the notebook will have, so you can use them as guidance on what to put in the Notebook (if it addresses all my questions, then it is off to a good start!). It should also help you consider what to go on the function documentation.

In one of the calls we can consider what else OpenHub offers, but for now at least we can search per organization, as I expect Apache will be heavily utilized due to often using JIRA for issue tracker (which in turn means bugs are documented).

Nice work!

beydlern added a commit that referenced this issue Oct 27, 2024
- The functions, openhub_api_analyses, openhub_api_iterate_pages, openhub_api_organizations, openhub_api_portfolio_projects, openhub_api_projects, openhub_parse_analyses, openhub_parse_organizations, openhub_parse_portfolio_projects, and openhub_parse_projects, were implemented (WITHOUT DOCUMENTATION).
- A notebook, openhub_project_search, demonstrating the capability of interacting with the OpenHub project database through the Ohloh API to search for projects has been implemented.
@beydlern beydlern linked a pull request Oct 27, 2024 that will close this issue
beydlern added a commit that referenced this issue Oct 27, 2024
- Documentation builder functions incorrectly updated RoxygenNote version, this commit reverts this change.
beydlern added a commit that referenced this issue Oct 27, 2024
- Documentation builder functions incorrectly updated RoxygenNote version, this commit reverts this change.
beydlern added a commit that referenced this issue Nov 1, 2024
- Added function documentation to all openhub API interfacing functions.
beydlern added a commit that referenced this issue Nov 2, 2024
- openhub_project_search.Rmd had documentation added to describe the project gathering process.
- R/config.R removed and commented out print statements for cleaner output from the openhub_* functions.
@beydlern
Copy link
Collaborator Author

beydlern commented Nov 2, 2024

@carlosparadis
Did you want me to create a configuration file or edit an existing one (e.g. kaiaulu.yml) to add in my configuration variables from the notebook, openhub_project_search.Rmd (the code language to search for, language, and the organization name, organization_name)?

If a configuration file is to be used, appropriate getter functions may be needed. Let me know if this is something I should also do too and on which issue I should do this.

@carlosparadis
Copy link
Member

I think the first thing is discussing with me what the config should look like. The configs I think already have an openhub section (or maybe not)? If so we need to revise that first. All the config file formats should be updated accordingly, which, this time around, will only affect the get() functions being added or edited, rather than the notebooks, right?

I am hoping the m1 merge for that will get done tomorrow. Sorry it has taken so long.

@crepesAlot
Copy link
Collaborator

@beydlern mentioned me chiming in for the formatting so here is my suggestion.
The current format in the config file is:

project:
  website: https://thrift.apache.org
  openhub: https://www.openhub.net/p/thrift

Suggested new format:

project_url: https://thrift.apache.org

api:
  openhub:
    # URL of project on OpenHub 
    openhub_url: https://www.openhub.net/p/thrift
    # Name of the organization
    organization_name: temp_name
    # What language to filter for
    language: java
  • Moved the project url to be separate from the openhub section as that isn't specific only to openhub.
  • Made a broader api section so that it would be easier if any future api might need specifications in the config files.

These are my suggestions, not sure if Nick would want to use them as is, or if he'll be making further modifications to suit the purpose of the new fields.

@carlosparadis
Copy link
Member

This comment is a good learning opportunity: Not everything that is a parameter in the Notebook goes to the config file. We have to account for what is the "granularity" of the config (which is a project), and also whether said information can be obtained automatically or typed manually (which is more human time consuming).

The openhub_url for example is consistent to the project granularity, and it is something you have to specify as a starting point. The language, however, is something we could ask Kaiaulu to get from OpenHub API, right? I would suspect that the organization name is something we also can infer.

Some endpoints that we care a lot about are inconsistent to the granularity of a project configuration file, for example, the organization being Apache. This means the parameter will be hardcoded in the notebook (in a code block right below the config loads so it is easier for people to find). Maybe in the future, if Kaiaulu supports various organization level analysis we could consider a section for config, or an entire new config. There is a trade-off in this decision on the exec scripts, since they generally should take as input the config file rather than additional parameters. Which is to say, I expect this notebook to not have an exec/.

In this case at least this somewhat makes sense, since, contrary to a downloader where we want to throw in a server downloading data, it strikes me a bit odd that someone would be using the OpenHub API to do that. The OpenHub API in Kaiaulu case is to help us select the project, but nothing beyond that.

Let me know if this makes sense.

@crepesAlot
Copy link
Collaborator

That makes sense, I take this would be similar to how Dao removed the month_year parameter for her mailing list from the config previously.

@carlosparadis
Copy link
Member

To be clear: I am agreeing with your organization here: #317 (comment)

Perhaps what is missing is:

openhub:
  website: https://www.openhub.net/p/ambari
  organization_folder_path: ../../rawdata/openhub/studyname/organization/
  portfolio_project_folder_path: ../../rawdata/openhub/studyname/portfolio/
  project_folder_path: ../../rawdata/openhub/studyname/project/
  analysis_folder_path: ../../rawdata/openhub/studyname/analysis/

I added a studyname folder. One select projects to conduct a study. In that way, the rawdata/openhub folder can have your notebook being used to "sample projects for different studies".

Now, we discussed on call that the openhub should go in a config file, but in hindsight, it can't go in a project configuration file, as it is not project specific. It also means @crepesAlot should not handle openhub folder creation, since that function is to initialize a project specific folder organization, of which this code logic is not.

Ultimately, this is uncharted territory in Kaiaulu. Everything up to this point was project specific, not "study" specific. My final suggestion is thus that you simply leave a code block at the start of the notebook that acknowledged the rationale said here: The notebook is unique among its peers in that it concerns itself not with one project, but their selection, and thus it does not use the project configuration file architecture nor the project initialization. The functions to create the folder are thus hardcoded in the notebook itself. Which I think it is OK for now. If this module evolves, as did Kaiaulu evolve over time, we can refactor that notebook into something else, as we did with config.R this session.

Let me know if anything is not clear. The main difference is just hardcoding the studyname folder to your notebook as I did above.

@beydlern
Copy link
Collaborator Author

beydlern commented Dec 4, 2024

Should the function that create the folders be defined in the notebook itself rather than config.R?

For clarification on the refresh mechanic, in this case, I'm planning on deleting the old response files in the specific folder once a new download of response files is going to occur in that selected folder.

@carlosparadis
Copy link
Member

Yes, I would just hardcode the path. Do check the R/io.R function as Kaiaulu should already have functions to create folders. So don't re-create that. You can see how they are used in R/example.R I believe or if not the unit tests that use example.R.

I think it is OK you just save the file but do so with a unix timestamp to avoid conflict. Deletion is dangerous since it may catch people off guard and wipe their files. We just need to make sure we only read the files of that particular unix timestamp when reading back into the notebook. You can do that by using the native R function list.files (I think that what it is called) over the folder, then matching the paths associated to the timestamp of interest.

Confirm with me what the file names will look like. This may look like a refresh in the surface, but it is not, since we do not guarantee the downloaded files are cumulative: They should only be cumulative among the ones obtained in a given API request (+ pagination).

@beydlern
Copy link
Collaborator Author

beydlern commented Dec 4, 2024

The file names will take the form: response_etag_organizationname_YYYYMMDD across the different folders. etag, entity tag, is an HTTP response header field that is used for unique file names. organizationname is the name of the organization in the search criteria (sanitized by having all lower case letters with no spaces). Since the endpoints follow one another (starting from organization to portfolio to projects to analysis) once an organization response is selected, the notebook will obtain its' timestamp and use that timestamp to limit its selection of files that match that timestamp for the rest of the notebook.

With this method, how should my notebook handle the user running the notebook multiple times in a day? Duplicate entries are bound to occur, so I suggest that I filter the resultant data tables from each endpoint for uniqueness (for example, names of projects are unique, so the output data table will be sanitized by the unique function). However, if the user, on the same day, changes the organization name but keeps the code language in the search criteria, the notebook will use the projects belonging to another organization in the data tables. Thus, the organization name is used in the name of the file too.


Another simpler way (let me know if this is correct):
Using studyname for the folders, the user might just run one study for each studyname folder. If this is the case, the files can then just take the form response_etag_YYYYMMDD, as the same search criteria will be used, but the same set of data can be acquired at different timestamps (for example, user specifies a studyname, then runs the notebook, then 3 months later the user runs the same notebook under the same studyname, then the notebook will still work).

@carlosparadis
Copy link
Member

Downloader logic should be simple: Get the raw data, get some useful identifying information of the raw data and store in the file.

Parser should be simple: Convert JSON, XML or whatever it is to table. Let the Notebook handle the rest. Adding assumptions to the parser complicates assumptions made when users make use of the API.

Your file name depends on what the endpoint is doing, it can generally just be endpointname_someuniqueinformation_unixtimestamp.xml (I think you said it was XML?). Unix timestamp is not YYMMDD, it actually captures in a numeric information YYMMDDHHMMSS so there is no way there will be file overlap. The associated parser function can assume that file is the input, and will parse it. User can worry about which of the files they want to use and code the logic around it.

The someuniqueinformation is dependent on the data. For example, the endpoint that gets project under an organization can have someuniqueinformation as being the organization the projects are from, and also have a page number information. The unix timestamp would likely need to be when the request for project information was made.

This file name does not make sense if your endpoint is to get a project information. Then it could be endpointname_projectname_page_unixtimestamp).

Try to give me each end point, what it does in summary and your suggestion of file name depending on what makes sense for someone to see on the file. Seeing the unixtimestamp when you run for projects under an organization helps someone know if they did recently or not the query, and the organization name helps them know what org they queried in a glance for example. What would be most helpful seeing in a file name for each endpoint?

@beydlern
Copy link
Collaborator Author

beydlern commented Dec 4, 2024

organization endpoint:

  • Summary: using the input name of the organization, we find the specific organization and its API response page (a single page returned using the API's query system (may return more than one organization in that page as the query system is a glorified ctrl+F search matching the organization name to every tag and field))
  • File Name Suggestion: organization_orgname_unixtimestamp.xml. orgname is the short, unique, handle for the organization (e.g. "apache").

portfolio project endpoint:

  • Summary: Using the parsed organization's data table, select an organization's html_url_projects field, the url containing pages of projects that belong to that specific organization (called portfolio projects). Every API response will only contain one page of projects using the organization's html_url_projects project list, so if multiple pages are requested, multiple files will be returned, where the next file corresponds to the next page.
  • File Name Suggestion: portfolio_unixtimestamp_orgname_pagenumber.xml. orgname is the short, unique, handle for the organization these projects belong to, and the page number corresponds to the current page.

Carlos Edit: organization_name here is for example apache which is the org_name parameter needed to query this endpoint.
Carlos Edit 2: Remove the parameter of language from both downloader and parser because the endpoint is not really subsetting. If it is not, our functions shouldn't either, the user can subset after they get the table.

project endpoint:

  • Summary: Using the parsed portfolio project's data table (which has been parsed and filtered to grab only the projects that match the input desired code language), iterate through every portfolio project grabbing its name, which is unique, to then query the global list of projects to acquire its project ID. Every API response file will contain one page containing its matching project by name (using the API's query system, which may return multiple projects, but the parser will filter these).
  • File Name Suggestion: project_unixtimestamp_projectname_pagenumber.xml. Following the lowercase with no spaces sanitization, projectname is the name of the project located in the response file.

analysis endpoint:

  • Summary: Using the merged data table of the parsed portfolio projects and parsed project's, iterate through every project grabbing its project ID, which is unique, to then make a URL specific GET request to retrieve its analysis collection. Every API response file will exactly contain one page, containing its matching analysis collection by project ID.
  • File Name Suggestion: analysis_projectname_unixtimestamp.xml. Following the lowercase with no spaces sanitization, projectname is the name of the project located in the response file.

Carlos Edit: updated_at parameter has to be parsed and used as unit timestamp here! On the function docs you must clarify that this timestamp behavior captures the created_at. Which is very different than all the other unix timestamp.

@carlosparadis
Copy link
Member

Organization makes sense. I can't make sense of portfolio project or project endpoint. Too many projects and the endpoints are all inter-used in some field or column which, unable to see right now, makes hard to understand.

Do you have some time today for a quick call we can go over this? Delaying to tomorrow is probably going to leave you with little time to address this.

In an ideal world, every endpoint maps to something sane: An issue, a commit, a project, a person, etc. Here my conceptual understanding is just organization with projects, then portfolio project (whatever that is), then project again. Analysis is also a very generic name. I understand these names are not what you chose, but this is what I am struggling to know what best to represent on the file name.

If you are unable to meet, I suggest editing the comment above and add a sample of the first 2 rows to every table, so when you say "I used this column to call this other endpoint" i can make sense.

Also, note unix timestamp is just a single number. So the file format is more like _dddddddd.xml instead of YYMMDDHHMMSS.xml. You need to call a function to convert to and from unix timestamp. Please edit the file name above accordingly.

@beydlern
Copy link
Collaborator Author

beydlern commented Dec 4, 2024

@carlosparadis Yes we can meet, I sent you an email.

beydlern added a commit that referenced this issue Dec 6, 2024
- The current working version of the notebook and functions to search for projects (Downloaders and Parsers are not yet generalized and still contain overloading functionality, which is to be resolved and implemented in the notebook, openhub_project_search.Rmd).
beydlern added a commit that referenced this issue Dec 6, 2024
- In config.R, the organization downloader and parser as well as the portfolio project downloader and parser were refactored to become more generalized and work with the iterator function, openhub_api_iterate_pages().
beydlern added a commit that referenced this issue Dec 7, 2024
- In config.R, the project downloader and parser as well as the analysis downloader and parser were refactored to become more generalized and work with the iterator function, openhub_api_iterate_pages().
beydlern added a commit that referenced this issue Dec 8, 2024
- config.R, the openhub interfacing functions had their documentation updated.
- openhub_project_search.Rmd, revised the notebook with up-to-date information.
@beydlern
Copy link
Collaborator Author

beydlern commented Dec 8, 2024

@carlosparadis

The analysis endpoint parser function is still overloaded for the code_languages column which displays each language with its corresponding percentages of the codebase. I believe this is a correct way to parse the information.

As for the mailing_list column (found in the project endpoint), did you want me to paste all of the links associated with the selected project along with its category description and rename the column to project_links? At this current time, the parser, openhub_parse_projects, is overloaded to read each link associated with a project and use a regular expression pattern to decipher the links to acquire the mailing list link.

Other than these two outliers, the revisions you requested have been made.

@carlosparadis
Copy link
Member

The analysis endpoint parser function is still overloaded for the code_languages column which displays each language with its corresponding percentages of the codebase. I believe this is a correct way to parse the information.

Can you elaborate on this? Trying to remember what we discussed as far as the overload is concerned. I think you are referring to my request on not adding hardcoded link extraction and instead return to the user. What is the code line this refers to (you can paste the copied permalink and hold shift to pick the region to post here so it references me to it).

As for the mailing_list column (found in the project endpoint), did you want me to paste all of the links associated with the selected project along with its category description and rename the column to project_links? At this current time, the parser, openhub_parse_projects, is overloaded to read each link associated with a project and use a regular expression pattern to decipher the links to acquire the mailing list link.

Yes, return every link available, and then move your code logic to check for specifics for mailing list, etc in the Notebook where you can explain in english right before what you are doing and why. It is useful to have, but I want to make sure it is at Notebook level in case in the future others are looking for anything else, the function will make available for them. Can you give me an example of the table before and after would look like? You can mockup a table with markdown here too.

@beydlern
Copy link
Collaborator Author

beydlern commented Dec 8, 2024

Can you elaborate on this? Trying to remember what we discussed as far as the overload is concerned. I think you are referring to my request on not adding hardcoded link extraction and instead return to the user. What is the code line this refers to (you can paste the copied permalink and hold shift to pick the region to post here so it references me to it).

Here is the permalink to the analysis section of code for the code_languages column:

kaiaulu/R/config.R

Lines 349 to 358 in e439c12

languages <- XML::xmlChildren(returnItems[[1]][[14]]) # <result><analysis><languages> children tags
code_languages_data_text <- list()
for (i in seq_along(languages)) {
language <- languages[[i]]
code_language_percentage <- paste0(XML::xmlGetAttr(language, "percentage"), "%") # adds a percentage symbol to the end of the percentage value for the code language
code_language <- stringi::stri_trim_both(stringi::stri_replace_all_fixed(XML::xmlValue(language), "\n", "")) # grabs code language text, then removes spaces and new line characters
code_languages_data_text[[i]] <- paste(code_language_percentage, code_language)
}
code_languages_data_text <- paste(code_languages_data_text, collapse = ", ")
parsed_response[["code_languages"]] <- append(parsed_response[["code_languages"]], code_languages_data_text)

Yes, return every link available, and then move your code logic to check for specifics for mailing list, etc in the Notebook where you can explain in english right before what you are doing and why. It is useful to have, but I want to make sure it is at Notebook level in case in the future others are looking for anything else, the function will make available for them. Can you give me an example of the table before and after would look like? You can mockup a table with markdown here too.

Using Apache Tomcat for the first row for both tables and a fictional project without links:

The table before has a column for mailing_list (parser used to use regex to extract the mailing list using a pattern, and if no matches are found throughout the links, "N/A" is returned):

mailing_list
http://tomcat.apache.org/lists.html
N/A

The table after will have the mailing_list column renamed to project_links (if no project links are found, "N/A" is returned):

project_links
Forums http://tomcat.apache.org/lists.html, Documentation http://tomcat.apache.org/faq/, Documentation http://tomcat.apache.org/bugreport.html, Other http://www.amazon.com/Apache-Tomcat-Bible-Jon-Eaves/dp/0764526065, Other http://www.amazon.com/Tomcat-Definitive-Guide-Jason-Brittain/dp/0596101066
N/A

I'm not sure how to accurately place newline characters or line break elements to separate the different project links to be nicely displayed in the data table using the gt() package, so for now they don't wrap and are separated by commas.

beydlern added a commit that referenced this issue Dec 8, 2024
- Modified openhub_parse_projects() overloaded functionality (using regex pattern to search for mailing list) to simply extract the list of project links (if available).
@carlosparadis
Copy link
Member

hmm.. the proper way to do this is to get the project_links, use stringi::stri_split, and make a separate table where every row is of the form

"project_name" "project_link"

your function would return a named list of two data.tables, the first would be every row per project with the information you already have, the second would be this one where every row is a project link. Anyone could then merge both or do as they prefer. Raven code on Scitools, that I just merged, does this for the transform_ function, where the function returns node_list and edge_list.

beydlern added a commit that referenced this issue Dec 9, 2024
- The openhub_parse_projects() function returns two tables, one about the project's specific data and the other about the project's external links. The notebook, openhub_project_search.Rmd, has been updated to reflect these changes.
beydlern added a commit that referenced this issue Dec 9, 2024
- Incorrect table was being merged, fixed in notebook.
@beydlern
Copy link
Collaborator Author

beydlern commented Dec 9, 2024

@carlosparadis
I implemented these changes in my most recent commit, there are now two data tables that follow your suggested format.

@carlosparadis
Copy link
Member

Thanks! Can you give me some screenshot of the output or paste them here just to make sure we are on the same page?

@beydlern
Copy link
Collaborator Author

beydlern commented Dec 9, 2024

@carlosparadis
Here are screenshots from the two tables that are displayed to the user (the columns are explained to the user):
The first data table (called project_data in the notebook):
Image
The second data table (called project_links in the notebook):
Image

@carlosparadis
Copy link
Member

@beydlern this is perfect. I'd suggest adding the "id" column to the project_links table too just in case in the future there is more than one project with the same name.

beydlern added a commit that referenced this issue Dec 9, 2024
- openhub_parse_projects() the project_links data table had a field added (column added), which is the project ID. The notebook, openhub_project_search.Rmd, has been updated to reflect these changes.
@beydlern
Copy link
Collaborator Author

beydlern commented Dec 9, 2024

@carlosparadis
I added the "id" column to the project_links data table too in my recent commit.
Image

@carlosparadis
Copy link
Member

@beydlern looks good! is it reaady for review? send me a code review request when ready

@beydlern
Copy link
Collaborator Author

beydlern commented Dec 9, 2024

@carlosparadis
Code review has been requested in the associated PR #325.

beydlern added a commit that referenced this issue Dec 9, 2024
- All description tags for function documentations were removed in config.R.
- Data tables are displayed using pipe operators with gt() function and head() function in openhub_project_search.Rmd.
- Parser sections have their "eval = FALSE" removed for their code blocks in openhub_project_search.Rmd.
- Removed "invisible()" function from all functions in config.R.
beydlern added a commit that referenced this issue Dec 11, 2024
- Added OpenHub API interfacing functions to _pkgdown.yml.
- Added missing parameter function documentation to openhub_download function in config.R.
- Removed debug print statements openhub_api_iterate_pages in config.R.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants