All notable changes to this project will be documented in this file. This change log follows the conventions of keepachangelog.com.
- Historical document versions
- Timestamps for historical versions
- Optimize (speed+size of) low level index format
"Happy New Year"-release!
- Throwing a sane error, when opening an EMPTY database!
Support nippy
based files in a zip
based database - called zippy
This could be the new goto nippy database format, much more space efficient than
"classic" ndnippy
:
- 300-400% faster query speeds!
- With just 20-25% larger size than ZIP'ed EDNs
- for comparison,
ndnippy
is usually as large as the uncompressed EDNs
- for comparison,
Update clarch to enhance decompression of zip entries
- Re-opening a ZIP database no longer need explicit :doc-parser, but only a :doc-type parameter of either :json or :edn.
- ZIP-databases can now be generated even if some ZIP entries are not deflated.
Fix reopening zip databases: For now need explicit :doc-parser!
Upgrade clarch dependency
- Replaced cheshire with charred, for less dependencies
- Upgraded clojure to version 1.12.0
- Experimental alpha for treating
.zip
files as databases
- Does
ndfile-md5
work for zip files? Otherwise tweakserialized-db-filepath
:as-of
isn't set in theparse-db
'ed zip database
- Slim down by extracting
compress
ns to new library:com.luposlip/clarch
or-q
now queries a seq of databases (by single ID), until a non-nil result is returned
- Wrapping gz outputstream with tar outputstream for new fn tar-gz-outputstream
- Added compress fns to work with .tar.gz archives
- Updated deps
- Specific error message when serialized nippy cannot be read (probably because of nippy versioning discrepancies)
- nippy library updated from 3.2.0 to 3.3.0
- cheshire and commons-compress also updated
- Indexing didn't work when appending more than [batch size] documents
NB: [batch size] is currently set to 128.
- Append documents to existing nd-db files (previously v1.0.0) g- Optional end-pointer parameter for versioning
- no parameter:
- use everything in the file, including new doc versions
- added docs will update index (thus prevent getting new db value until done)
- the index will contain only the newest version of each document
- a future version of
nd-db
might contain historical versions
- parameter:
- look for
nddbmeta
using same line (name of index reflecting lines) - if
nddbmeta
doesn't exist, stop indexing after passed line number - this will create a new
.nddbmeta
file with a hash and metadata reflecting
- look for
Multiple documents are automatically written to db (and index) in batches of 128.
Minor refactoring
Minor refactoring
nd-db.compress
namespace containing input- and output-stream convenience fns
- Support for CSV/TSV files as databases
- Revamp documentation
Utility function nd-db.convert/upgrade-nddbmeta!
converts your old pre-v0.9.0
nddbmeta files to the new format, and keeps the old under the same name with
_old
appended to the file name.
Internally the database is now no longer a future. Instead the :index is a delay. This means immediate initialization of the db value, and that the :index doesn't get realized until you start querying.
This also means that the lazy-docs
and lazy-ids
make even better sense
if you just want to traverse the database sequentially, because in that case
you're not using the realized index at all.
The external API for the library is unchanged. You initialize the database value in the same way, and you query it the same way too.
lazy-ids
failed in some cases when moving index
lazy-ids
now work when moving nddbmeta file around (i.e. with the db file)
- Default now is to generate the index in the same folder as the database. Previously the default was the filesystem temp folder.
- nddbmeta files now only contain the serialized filename, as opposed to before where it was the complete path.
- Reader for compressed non-nippy nd* files
- Parallelized index-creation - takes 2/3 less time than before (mbp m1)!
- Potentially more stable serialization of index (flushing every 1000 lines)
lazy-ids
has internalBufferedReader
. Should be passed fromwith-open
.- conversion function for pre-v0.9.0
.nddbmeta
files. - skip the realization of the index when generating the db value (= refactor)
lazy-docs
now works with eager indexes:
(lazy-docs nd-db)
Or with lazy indexes:
(with-open [r (nd-db.index/reader nd-db)]
(->> nd-db
(lazy-docs r)
(drop 1000000)
(filter (comp pos? :amount))
(sort-by :priority)
(take 10)))
NB: For convenience this also works, without any penalty:
(with-open [r (nd-db.index/reader nd-db)]
(->> r
(lazy-docs nd-db)
...
(take 10)))
Still need to make the conversion function for pre-v0.9.0 .nddbmeta
files.
WIP! lazy-docs
might change signature when using the new index-reader
!
- new format for metadata/index
.nddbmeta
file
The new format makes it much faster to initialize and sequentially read through the whole database. The change will make the most impact for humongous databases with millions of huge documents.
Old indexes will not be readable anymore. Good news is that there will be a new
nd-db.convert/upgrade-nddbmeta!
utility function, which can converts your old
file to the new format, and overwrite it.
The downside to the support for laziness is the size of the meta+index files, which in my tested scenarios have grown with 100%. This means a database containing ~300k huge documents (of 200-300Kb each in raw JSON/EDN form) has grown form ~5MB to ~10MB.
This is not a problem at all in real life, since when you need the realized in-memory index (for ad-hoc querying by ID), it still consumes the same amount of memory as before (in the above example ~3MB).
And compared to the database it describes it's nothing - the above mentioned
index is for a .ndnippy
database of 16.8GB.
- because of the change to the metadata format, the
lazy-docs
introduced withv0.8.0
is now much more efficient. Again this is most noticable when you need to read sequentially through parts of a huge database.
Dependency buddy/buddy-core
not needed anymore. Instead using built-in similar
functionality from com.taoensso/nippy
.
- Utility function to get lazy seq of all indexed IDs:
nd-db.io/lazy-docs
- Updated nippy
- Bugfix release, downgrade nippy
Using projects couldn't compile nd-db with nippy version 3.2.0
- Make serialized databases portable (not bound to a specific filesystem path)
- Support for clojure 1.11 keyword function parameters: https://clojure.org/news/2021/03/18/apis-serving-people-and-programs
- Upgrade dependencies
- Upgrade Clojure 1.10.3 -> 1.11.1
- Upgraded other dependencies
- Minor optimizations
Fix issue when creating index for ndjson/ndedn
Eliminate a reflective call when serializing the database.
0.6.0
- introducing .ndnippy
!
Now you can use .ndnippy
as database format. It's MUCH faster to load than
.ndjson
and .ndedn
, meaning better query times. Particularly when querying multiple documents at once.
Also a new util
namespace lets you convert from .ndjson
and .ndedn
to .ndnippy
.
.ndnippy
- like .ndedn
isn't really a standard. But it probably should be. I implemented the encoding for
.ndnippy
myself, it's somewhat naive, but really fast anyhow. If you have ideas on how to make it even
fast, let me know. Because version 0.6.0
introduces the .ndnippy
format, it may change several times in the
future, possibly making old .ndnippy
files incompatible with new versions. Now you're warned. Thankfully the
generation of new .ndnippy
files is quite fast.
NB: .ndnippy
isn't widely used (this is - as far as I know, the first and only use), and probably isn't a good distribution format, unless you can distribute the nd-db
library with it.
NB: For small documents (like the ones in the test samples), .ndnippy
files are actually bigger than their
json/edn alternatives. Even the Twitter sample .ndjson
file mentioned in the README
becomes bigger as
.ndnippy
. With the serialization mechanism used right now, the biggest benefits are when the individual documents
are huge (i.e. 10s of KBs). We've done experiments with methods that actually makes the resulting size the same as
the input, even for small documents. But there's a huge performance impact to using that, which is counter productive.
- Bug when using new :id-rx-str as input
- Eliminate reflective call when querying file
- Persist the processed index to disk for fast re-initialization
- Saves to temp filesystem folder (default)
- optionally a different folder to persist index between system reboots
- Uses filename, content and optionally regex-string to name the index file
0.4.0
- simpler and smaller!
- Auto-infer
:doc-type
from db file extension (*.ndedn
->:doc-type :edn
)- This means you have to use either db extension
.ndedn
|.ndjson
or:doc-type :json
|:edn
- Defaults to
:json
if extension is unknown and:doc-type
isn't set
- This means you have to use either db extension
- Laying the groundwork for improved indexing performance via parallelization.
- Need more work to limit memory consumption for huge databases, before enabling it
- Rename core namespace to
nd-db.core
, and the library fromluposlip/ndjson-db
tocom.luposlip/nd-db
!
- Removed core.memoize and timbre (not used anymore)
- Smaller deployable!
- Updated clojure (-> 1.10.3)
- Add support for the
.ndedn
file format, where all lines are well formed EDN documents.
- Updated depencies (timbre, memoize, cheshire)
- Updated depencies (clojure, cheshire, memoize)
- Completely revamped API
clear-all-indices!!
->clear-all-indexes!!
- Using timbre for logging
- Now using Apache License, Version 2.0 (instead of Eclipse Licence 2.0)
- API for using default id-fn for querying by json name with type string or integer
- Added
clear-all-indices!!
andclear-index!
- Added documentation in README
- Enhanced API for lazy/streaming usage
- Initial public release
- Example on how to query huge datasets