-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TBD suitable techniques for querying git repos #7
Comments
Filesystem queries for local instances of code reposLocal repos contain project files (and files created and maintained by git). Therefore filesystem queries can obtain pertinent information contained in the project files without requiring code that 'understands' git. Python 3 includes https://realpython.com/python-pathlib/ which seems a useful starting point e.g. to find program files that match various commonly used extensions e.g. This approach doesn't collect any git-related information. |
Related work
|
Sense checkingTo rely on on untested code on unknown codebases would be sad and poor practice. Let's cross-check using various techniques to increase our confidence in the results our software returns. Basic sanity checksCommands such as the *nix
in folder:
The We're more interested in filenames than folder names at this stage. Nonetheless |
Some interim thoughtsI wonder if it'd be worth us amending the data frame(s) so that they can record:
I don't yet know enough about pandas dataframes to understand if it supports structures within cells and/or nested data. Similarly, we'll eventually need to communicate this information to NLnet using an RDF structure and that may place its own constraints on how the info can be communicated.
|
Part of the work for #7. The results are not yet stored in the dataframe and similarly not exported yet.
Context
NLnet funds opensource projects; these projects host their code on a variety of code hosting services including github, gitlab, and others. Some of these, including github, provide mechanisms to query codebases they host. The mechanisms are likely to vary in their APIs, methods, and the data they return. We will start iteratively by using github's APIs to bootstrap the analysis of information about testing which will then help shape our understanding of the information that we find pertinent. That will then help us determine how we might obtain this information from the various code hosting services.
Objectives
Abstractions
Broadly the information can be obtained from various sources, including asking:
Each provides distinct facets of information, including about the accuracy, completeness, and perception of any tests that are related to the repo. We aim to obtain answers from at least one of these for every repository supported by NLnet foundation. Where practical, the one providing the most insight will be chosen, and at least one of the code-based sources will be queried in addition to whatever perspective developers can provide.
Querying git
Much of the underlying information will be in an instance of the codebase that includes the git history. Therefore it'll be worth us investigating how we might interface with a git repo independently of where the repo is hosted, e.g. on a local cloned copy of the respective repo.
Work in this area
Source code analysis in codebases is an active area of academic research e.g. as part of Mining Software Repositories (MSR) and there are very likely to be tools and techniques we can use and apply to help us with our work.
The text was updated successfully, but these errors were encountered: