Generative AI - NLP
All that you need to know under GenAI for NLP
GENAI
Krishnav Dave
1/3/20255 min read
> Table of contents
Procure data
Enrich data
Embeddings
Vector storage
Chunking
RAG
LLM
Evaluate performance
Model observability
GenAI Ops
Cloud for GenAI
Hardware for GenAI
Procure data
Gather all available data
1. Open Data Sources: Leverage publicly available datasets
List of common licenses for datasets:
1. Public Domain (CC0): The dataset is free for use without restrictions; no attribution is required.
2. Creative Commons Attribution (CC BY): Allows usage, sharing, and adaptation, provided proper credit is given.
3. Creative Commons Attribution-ShareAlike (CC BY-SA): Requires adaptations of the dataset to be shared under the same license.
4. Creative Commons Attribution-NoDerivatives (CC BY-ND): Allows use and sharing but prohibits modifications.
5. Creative Commons Attribution-NonCommercial (CC BY-NC): Restricts usage to non-commercial purposes, with attribution required.
6. Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA): Non-commercial use is allowed, with adaptations shared under the same license.
7. Creative Commons Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND): Most restrictive; non-commercial use only, no modifications, and attribution required.
8. GNU General Public License (GPL): Allows free use, modification, and distribution, but derivatives must also be open-source under GPL.
9. GNU Lesser General Public License (LGPL): Similar to GPL but allows linking with non-GPL-licensed software.
10. MIT License: Permissive license that allows usage, modification, and distribution with minimal restrictions.
11. Apache License 2.0: Allows usage, modification, and distribution, with a requirement to provide a copy of the license and state modifications.
12. Open Data Commons Public Domain Dedication and License (PDDL): Public domain equivalent for data, allowing unrestricted use.
13. Open Data Commons Attribution License (ODC-By): Requires attribution for data usage or redistribution.
14. Open Data Commons Open Database License (ODbL): Allows usage and adaptation but requires sharing derivatives under the same license and providing attribution.
15. Proprietary License: Grants specific rights for dataset use, often with significant restrictions defined by the provider.
16. Free Use License: Permits usage without charge but may impose restrictions like attribution or non-commercial use.
17. Non-Commercial License: Restricts use of the dataset to non-commercial purposes.
18. End-User License Agreement (EULA): Custom, provider-specific terms for dataset usage, often associated with proprietary data.
19. Data Use Agreement (DUA): Defines specific conditions for accessing and using data, typically in academic or research collaborations.
20. BSD License: Permissive license allowing use, modification, and distribution with minimal conditions.
21. CC BY-SA 4.0 (Data-specific): Allows adaptation and sharing of data under similar terms, requiring attribution.
22. Software-Specific Licenses: Licenses like the PostgreSQL License or SQLite License are applied to datasets associated with software.
Always review and comply with the terms of a dataset’s license to avoid legal issues.
2. APIs: Extract data from available online services
Here’s a list of 50 popular APIs for procuring datasets for AI projects, categorized for different use cases:
General Data APIs
1. Kaggle API: Access thousands of datasets and competitions from Kaggle.
2. Google Dataset Search: Discover publicly available datasets across domains.
3. Data.gov API: US government datasets for open data and research.
4. Open Data Portal API (EU): Access datasets from European Union countries.
5. World Bank API: Offers global economic, development, and financial datasets.
Social Media APIs
6. Twitter API: Real-time tweets, trends, and social media insights.
7. Reddit API: Access subreddit discussions, comments, and metadata.
8. Facebook Graph API: Insights from Facebook pages, posts, and analytics.
9. YouTube Data API: Video metadata, comments, and engagement stats.
10. TikTok API: Insights into trending content and user interactions.
E-Commerce APIs
11. Amazon Product Advertising API: Product data, reviews, and pricing.
12. eBay API: Marketplace data, including listings and pricing.
13. Walmart Open API: Product data, inventory, and reviews.
14. Rakuten API: Data from Japan's largest e-commerce platform.
15. Shopify API: Access to store, product, and transaction data.
Healthcare and Genomics APIs
16. NIH Open Access API: Research articles and medical data from NIH.
17. ClinVar API: Genetic variation and medical condition data.
18. PubMed API: Biomedical literature and research abstracts.
19. Healthdata.gov API: US government health data.
20. Human Genome API: Genomic sequencing and variant datasets.
Geospatial APIs
21. Google Maps API: Geographic data, including geocoding and places.
22. OpenStreetMap API: Open-source maps and geographic data.
23. NASA Earth Science API: Satellite imagery and climate data.
24. USGS Earth Explorer API: Geological and geospatial data.
25. HERE Maps API: Location-based data and mapping services.
Finance and Economics APIs
26. Alpha Vantage API: Stock market and financial data.
27. Quandl API: Economic, financial, and alternative datasets.
28. Yahoo Finance API: Market data and financial news.
29. Open Exchange Rates API: Real-time and historical currency rates.
30. IMF Data API: Global economic indicators from the IMF.
Weather and Climate APIs
31. OpenWeatherMap API: Real-time and historical weather data.
32. NOAA API: US climate, weather, and ocean data.
33. Weatherstack API: Simple access to global weather data.
34. AccuWeather API: Detailed forecasts and meteorological insights.
35. Climacell API: Hyper-local weather predictions and historical data.
News and Media APIs
36. NewsAPI: Aggregated news articles from global sources.
37. The Guardian API: Articles and multimedia content from The Guardian.
38. NY Times API: News and archives from The New York Times.
39. MediaStack API: Real-time news data from multiple sources.
40. GDELT API: Global news event and sentiment analysis data.
Education and Research APIs
41. ArXiv API: Research papers from physics, mathematics, and AI.
42. Springer API: Scientific books, journal metadata, and abstracts.
43. UNESCO Data API: Global educational, scientific, and cultural data.
44. CORE API: Academic papers and open-access research data.
45. Semantic Scholar API: AI-driven insights into academic publications.
Entertainment APIs
46. Spotify API: Music metadata, playlists, and user behavior.
47. IMDb API: Movies, TV shows, and ratings data.
48. TMDb API: Movie and TV metadata, including reviews.
49. Last.fm API: Music listening behavior and trends.
50. Twitch API: Streaming data, including viewers and trends.
These APIs offer diverse datasets for AI applications like NLP, CV, and predictive modeling, and should be used per their licensing and usage policies.
3. Web Scraping: Extract data from websites and online services
3. Data Licensing: Procure proper licenses for data from third-party providers, ensuring proper usage rights for commercial or non-commercial use.
4. Synthetic Data Generation: Create synthetic datasets using rule-based systems, simulations, or pre-trained generative models to mimic real-world data.
5. Enterprise Data: Utilize proprietary data from the organization's internal systems.
6. Crowdsourcing Platforms: Employ crowdsourcing platforms to gather data
7. IoT and Sensor Data: Collect real-time data from IoT devices, sensors, or industrial equipment.
8. Data Partnerships: Collaborate with other organizations to share or exchange data under mutual agreements.
9. User-Contributed Data: Encourage users to provide data through surveys, forms, or interaction with apps and platforms.
10. Social Media Platforms: Mine data from social media platforms adhering to their terms of service.
11. Mobile App Data: Collect data from user interactions with mobile applications or services.
12. Log and Telemetry Data: Analyze server logs or telemetry data from software systems for user behavior insights.
13. Cloud Marketplaces: Access pre-labeled datasets and data services from cloud providers.
14. Satellite and Geospatial Data: Obtain satellite images or GIS data for applications in agriculture, urban planning, or environmental monitoring.
15. Research Collaboration: Partner with academic or research institutions to access niche or curated datasets.
16. Data Augmentation: Use existing data and apply transformations (e.g., cropping, rotation, noise injection) to expand dataset diversity.
17. Annotated Dataset Providers: Purchase labeled datasets from companies specializing in data annotation services.
18. Public Forums and Communities: Collect data from forums like Reddit or specialized communities for domain-specific information.
19. Embedded Systems: Leverage edge devices and embedded systems to collect decentralized, real-time data.
20. Manual Collection: Conduct manual data gathering through field research, interviews, or direct observation.
Each method should comply with data privacy regulations like GDPR, CCPA, or HIPAA, depending on the project and region.