Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBD suitable techniques for querying git repos #7

Open
julianharty opened this issue Mar 8, 2024 · 4 comments
Open

TBD suitable techniques for querying git repos #7

julianharty opened this issue Mar 8, 2024 · 4 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@julianharty
Copy link
Member

julianharty commented Mar 8, 2024

Context

NLnet funds opensource projects; these projects host their code on a variety of code hosting services including github, gitlab, and others. Some of these, including github, provide mechanisms to query codebases they host. The mechanisms are likely to vary in their APIs, methods, and the data they return. We will start iteratively by using github's APIs to bootstrap the analysis of information about testing which will then help shape our understanding of the information that we find pertinent. That will then help us determine how we might obtain this information from the various code hosting services.

Objectives

  • To find reliable, performance, and viable mechanisms to extract pertinent information from git codebases and/or their hosting providers related to the testing of those codebases.
  • To encourage an iterative, learning-oriented approach that supports multiple approaches especially during the early stages of learning.

Abstractions

Broadly the information can be obtained from various sources, including asking:

  1. the developer(s)
  2. the testing framework(s)
  3. the hosting provider e.g. github.com
  4. the operating system and file system
  5. git

Each provides distinct facets of information, including about the accuracy, completeness, and perception of any tests that are related to the repo. We aim to obtain answers from at least one of these for every repository supported by NLnet foundation. Where practical, the one providing the most insight will be chosen, and at least one of the code-based sources will be queried in addition to whatever perspective developers can provide.

Querying git

Much of the underlying information will be in an instance of the codebase that includes the git history. Therefore it'll be worth us investigating how we might interface with a git repo independently of where the repo is hosted, e.g. on a local cloned copy of the respective repo.

Work in this area

Source code analysis in codebases is an active area of academic research e.g. as part of Mining Software Repositories (MSR) and there are very likely to be tools and techniques we can use and apply to help us with our work.

@julianharty julianharty added documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Mar 8, 2024
@julianharty julianharty self-assigned this Mar 8, 2024
@julianharty
Copy link
Member Author

Filesystem queries for local instances of code repos

Local repos contain project files (and files created and maintained by git). Therefore filesystem queries can obtain pertinent information contained in the project files without requiring code that 'understands' git. Python 3 includes https://realpython.com/python-pathlib/ which seems a useful starting point e.g. to find program files that match various commonly used extensions e.g. .java for Java, .kt for Kotlin, .js for JavaScript, and .py for python. Paths can also be queried for substrings e.g. test that may indicate the existence of code intended for testing purposes; albeit there may be false positives.

This approach doesn't collect any git-related information.

@julianharty
Copy link
Member Author

Sense checking

To rely on on untested code on unknown codebases would be sad and poor practice. Let's cross-check using various techniques to increase our confidence in the results our software returns.

Basic sanity checks

Commands such as the *nix find command can be used to search for files and folders that match text we provide. As many developers apply conventions, such as placing automated tests in a directory branch such as src/test/, we can use these as heuristics when searching for code that contain tests. Additional *nix utilities e.g. grep can be used for filtering and wc to count items.

in folder: spring-petclinic/src/test find . -type f -print0 | xargs -0 file | grep -i source returns in a local clone of the repo:

find . -type f -print0 | xargs -0 file | grep -i source
./java/org/springframework/samples/petclinic/vet/VetTests.java:                           C++ source text, ASCII text
./java/org/springframework/samples/petclinic/vet/VetControllerTests.java:                 C++ source text, ASCII text
./java/org/springframework/samples/petclinic/owner/PetControllerTests.java:               C++ source text, ASCII text
./java/org/springframework/samples/petclinic/owner/OwnerControllerTests.java:             C++ source text, ASCII text
./java/org/springframework/samples/petclinic/owner/VisitControllerTests.java:             C++ source text, ASCII text
./java/org/springframework/samples/petclinic/owner/PetTypeFormatterTests.java:            C++ source text, ASCII text
./java/org/springframework/samples/petclinic/system/CrashControllerTests.java:            C++ source text, ASCII text
./java/org/springframework/samples/petclinic/system/CrashControllerIntegrationTests.java: C++ source text, ASCII text
./java/org/springframework/samples/petclinic/PetClinicIntegrationTests.java:              Java source text, ASCII text
./java/org/springframework/samples/petclinic/model/ValidatorTests.java:                   C++ source text, ASCII text
./java/org/springframework/samples/petclinic/MySqlIntegrationTests.java:                  C++ source text, ASCII text
./java/org/springframework/samples/petclinic/service/EntityUtils.java:                    Java source, ASCII text
./java/org/springframework/samples/petclinic/service/ClinicServiceTests.java:             C++ source text, ASCII text
./java/org/springframework/samples/petclinic/MysqlTestApplication.java:                   Java source text, ASCII text
./java/org/springframework/samples/petclinic/PostgresIntegrationTests.java:               Java source text, ASCII text

The file command identified many of the java files as C++ hence using grep to match source rather than Java as I'd prefer to err on including the files identified as C++ since they probably contain tests. It's possible - and may be productive - to drill into the various files to extract the names of individual tests. That's a later exercise, and not needed just yet.

We're more interested in filenames than folder names at this stage. Nonetheless find . -type d -iname '*test*' -print performs a case-insensitive search for folders/directories that have the word test in them. There may be projects that have tests in folders in a test folder branch where the filenames do not include the word test. The output of this command can be passed to xargs to perform a subsequent search for source files that might include automated tests. We'd then want to investigate the contents of those files to determine if they do actually include tests.

@julianharty
Copy link
Member Author

Some interim thoughts

I wonder if it'd be worth us amending the data frame(s) so that they can record:

  1. the method information was obtained by
  2. the commit (and perhaps the branch if needed)
  3. the count of tests

I don't yet know enough about pandas dataframes to understand if it supports structures within cells and/or nested data. Similarly, we'll eventually need to communicate this information to NLnet using an RDF structure and that may place its own constraints on how the info can be communicated.
TODO

  • Determine constraints and workarounds for storing the info in dataframes
  • Determine the constraints and workarounds for RDF structure(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant